← Back to Cookbook
property graph predefined schema
Details
File: third_party/LlamaIndex/propertygraphs/property_graph_predefined_schema.ipynb
Type: Jupyter Notebook
Use Cases: Property graph
Integrations: Llamaindex
Content
Notebook content (JSON format):
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Property Graph with Pre-defined Schemas\n", "\n", "In this notebook, we guide you through using Neo4j, Ollama, and Huggingface to build a property graph.\n", "\n", "Specifically, we will utilize the SchemaLLMPathExtractor, which enables us to define a precise schema containing possible entity types, relation types, and how they can be interconnected.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install llama-index-core\n", "%pip install llama-index-llms-ollama\n", "%pip install llama-index-embeddings-huggingface\n", "%pip install llama-index-graph-stores-neo4j" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import nest_asyncio\n", "\n", "nest_asyncio.apply()\n", "\n", "from IPython.display import Markdown, display" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download Data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2024-07-05 09:13:53-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 75042 (73K) [text/plain]\n", "Saving to: ‘data/paul_graham/paul_graham_essay.txt’\n", "\n", "data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.02s \n", "\n", "2024-07-05 09:13:53 (4.27 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]\n", "\n" ] } ], "source": [ "!mkdir -p 'data/paul_graham/'\n", "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from llama_index.core import SimpleDirectoryReader\n", "\n", "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Graph Construction\n", "\n", "To build our graph, we will use the SchemaLLMPathExtractor. \n", "\n", "This tool allows us to specify a schema for the graph, enabling us to extract entities and relations that adhere to this predefined schema, rather than allowing the LLM to randomly determine entities and relations.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from typing import Literal\n", "from llama_index.llms.ollama import Ollama\n", "from llama_index.core.indices.property_graph import SchemaLLMPathExtractor\n", "\n", "# best practice to use upper-case\n", "entities = Literal[\"PERSON\", \"PLACE\", \"ORGANIZATION\"]\n", "relations = Literal[\"HAS\", \"PART_OF\", \"WORKED_ON\", \"WORKED_WITH\", \"WORKED_AT\"]\n", "\n", "# define which entities can have which relations\n", "# validation_schema = {\n", "# \"PERSON\": [\"HAS\", \"PART_OF\", \"WORKED_ON\", \"WORKED_WITH\", \"WORKED_AT\"],\n", "# \"PLACE\": [\"HAS\", \"PART_OF\", \"WORKED_AT\"],\n", "# \"ORGANIZATION\": [\"HAS\", \"PART_OF\", \"WORKED_WITH\"],\n", "# }\n", "validation_schema = [\n", " (\"ORGANIZATION\", \"HAS\", \"PERSON\"),\n", " (\"PERSON\", \"WORKED_AT\", \"ORGANIZATION\"),\n", " (\"PERSON\", \"WORKED_WITH\", \"PERSON\"),\n", " (\"PERSON\", \"WORKED_ON\", \"ORGANIZATION\"),\n", " (\"PERSON\", \"PART_OF\", \"ORGANIZATION\"),\n", " (\"ORGANIZATION\", \"PART_OF\", \"ORGANIZATION\"),\n", " (\"PERSON\", \"WORKED_AT\", \"PLACE\"),\n", "]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "kg_extractor = SchemaLLMPathExtractor(\n", " llm=Ollama(model=\"mistral:7b\", json_mode=True, request_timeout=3600),\n", " possible_entities=entities,\n", " possible_relations=relations,\n", " kg_validation_schema=validation_schema,\n", " # if false, allows for values outside of the schema\n", " # useful for using the schema as a suggestion\n", " strict=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Neo4j setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!docker run \\\n", " -p 7474:7474 -p 7687:7687 \\\n", " -v $PWD/data:/data -v $PWD/plugins:/plugins \\\n", " --name neo4j-apoc \\\n", " -e NEO4J_apoc_export_file_enabled=true \\\n", " -e NEO4J_apoc_import_file_enabled=true \\\n", " -e NEO4J_apoc_import_file_use__neo4j__config=true \\\n", " -e NEO4JLABS_PLUGINS=\\[\\\"apoc\\\"\\] \\\n", " neo4j:latest" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning} {category: DEPRECATION} {title: This feature is deprecated and will be removed in future versions.} {description: The procedure has a deprecated field. ('config' used by 'apoc.meta.graphSample' is deprecated.)} {position: line: 1, column: 1, offset: 0} for query: \"CALL apoc.meta.graphSample() YIELD nodes, relationships RETURN nodes, [rel in relationships | {name:apoc.any.property(rel, 'type'), count: apoc.any.property(rel, 'count')}] AS relationships\"\n" ] } ], "source": [ "from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore\n", "from llama_index.core.vector_stores.simple import SimpleVectorStore\n", "\n", "graph_store = Neo4jPropertyGraphStore(\n", " username=\"neo4j\",\n", " password=\"llamaindex\",\n", " url=\"bolt://localhost:7687\",\n", ")\n", "vec_store = SimpleVectorStore()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PropertyGraphIndex construction" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", " warnings.warn(\n", "/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", " warnings.warn(\n", "Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 28.70it/s]\n", "Extracting paths from text with schema: 100%|██████████| 22/22 [08:14<00:00, 22.46s/it]\n", "Generating embeddings: 100%|██████████| 3/3 [00:00<00:00, 3.37it/s]\n", "Generating embeddings: 100%|██████████| 9/9 [00:01<00:00, 5.87it/s]\n", "Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning} {category: DEPRECATION} {title: This feature is deprecated and will be removed in future versions.} {description: The procedure has a deprecated field. ('config' used by 'apoc.meta.graphSample' is deprecated.)} {position: line: 1, column: 1, offset: 0} for query: \"CALL apoc.meta.graphSample() YIELD nodes, relationships RETURN nodes, [rel in relationships | {name:apoc.any.property(rel, 'type'), count: apoc.any.property(rel, 'count')}] AS relationships\"\n" ] } ], "source": [ "from llama_index.core import PropertyGraphIndex\n", "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n", "\n", "index = PropertyGraphIndex.from_documents(\n", " documents,\n", " kg_extractors=[kg_extractor],\n", " embed_model=HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\"),\n", " property_graph_store=graph_store,\n", " vector_store=vec_store,\n", " show_progress=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup Retrievers" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", " warnings.warn(\n" ] } ], "source": [ "from llama_index.core.indices.property_graph import (\n", " LLMSynonymRetriever,\n", " VectorContextRetriever,\n", ")\n", "\n", "\n", "llm_synonym = LLMSynonymRetriever(\n", " index.property_graph_store,\n", " llm=Ollama(model=\"mistral:instruct\", request_timeout=3600),\n", " include_text=False,\n", ")\n", "vector_context = VectorContextRetriever(\n", " index.property_graph_store,\n", " embed_model=HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\"),\n", " include_text=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retriever" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "retriever = index.as_retriever(\n", " sub_retrievers=[\n", " llm_synonym,\n", " vector_context,\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Interleaf -> Got crushed by -> Moore's law\n", "Paul graham -> Did freelance work for -> Interleaf\n", "Interleaf -> Made software for -> Creating documents\n", "Interleaf -> Is -> Software company\n", "Interleaf -> Added -> Scripting language\n", "Paul Graham -> WORKED_AT -> Interleaf\n", "Paul Graham -> WORKED_ON -> Interleaf\n" ] } ], "source": [ "nodes = retriever.retrieve(\"What happened at Interleaf?\")\n", "\n", "for node in nodes:\n", " print(node.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### QueryEngine" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ " At Interleaf, a software company, they developed software for creating documents and added a scripting language. Paul Graham worked on this project for them. Eventually, something occurred that led to the significant impact of Interleaf being crushed by Moore's law." ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query_engine = index.as_query_engine(\n", " sub_retrievers=[\n", " llm_synonym,\n", " vector_context,\n", " ],\n", " llm=Ollama(model=\"mistral:instruct\", request_timeout=3600),\n", ")\n", "\n", "response = query_engine.query(\"What happened at Interleaf?\")\n", "\n", "display(Markdown(f\"{response.response}\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "llamaindex", "language": "python", "name": "llamaindex" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 2 }