← Back to Cookbook
property graph predefined schema
Details
File: third_party/LlamaIndex/propertygraphs/property_graph_predefined_schema.ipynb
Type: Jupyter Notebook
Use Cases: Property graph
Integrations: Llamaindex
Content
Notebook content (JSON format):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Property Graph with Pre-defined Schemas\n",
"\n",
"In this notebook, we guide you through using Neo4j, Ollama, and Huggingface to build a property graph.\n",
"\n",
"Specifically, we will utilize the SchemaLLMPathExtractor, which enables us to define a precise schema containing possible entity types, relation types, and how they can be interconnected.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install llama-index-core\n",
"%pip install llama-index-llms-ollama\n",
"%pip install llama-index-embeddings-huggingface\n",
"%pip install llama-index-graph-stores-neo4j"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()\n",
"\n",
"from IPython.display import Markdown, display"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download Data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2024-07-05 09:13:53-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 75042 (73K) [text/plain]\n",
"Saving to: ‘data/paul_graham/paul_graham_essay.txt’\n",
"\n",
"data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.02s \n",
"\n",
"2024-07-05 09:13:53 (4.27 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]\n",
"\n"
]
}
],
"source": [
"!mkdir -p 'data/paul_graham/'\n",
"!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.core import SimpleDirectoryReader\n",
"\n",
"documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Graph Construction\n",
"\n",
"To build our graph, we will use the SchemaLLMPathExtractor. \n",
"\n",
"This tool allows us to specify a schema for the graph, enabling us to extract entities and relations that adhere to this predefined schema, rather than allowing the LLM to randomly determine entities and relations.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from typing import Literal\n",
"from llama_index.llms.ollama import Ollama\n",
"from llama_index.core.indices.property_graph import SchemaLLMPathExtractor\n",
"\n",
"# best practice to use upper-case\n",
"entities = Literal[\"PERSON\", \"PLACE\", \"ORGANIZATION\"]\n",
"relations = Literal[\"HAS\", \"PART_OF\", \"WORKED_ON\", \"WORKED_WITH\", \"WORKED_AT\"]\n",
"\n",
"# define which entities can have which relations\n",
"# validation_schema = {\n",
"# \"PERSON\": [\"HAS\", \"PART_OF\", \"WORKED_ON\", \"WORKED_WITH\", \"WORKED_AT\"],\n",
"# \"PLACE\": [\"HAS\", \"PART_OF\", \"WORKED_AT\"],\n",
"# \"ORGANIZATION\": [\"HAS\", \"PART_OF\", \"WORKED_WITH\"],\n",
"# }\n",
"validation_schema = [\n",
" (\"ORGANIZATION\", \"HAS\", \"PERSON\"),\n",
" (\"PERSON\", \"WORKED_AT\", \"ORGANIZATION\"),\n",
" (\"PERSON\", \"WORKED_WITH\", \"PERSON\"),\n",
" (\"PERSON\", \"WORKED_ON\", \"ORGANIZATION\"),\n",
" (\"PERSON\", \"PART_OF\", \"ORGANIZATION\"),\n",
" (\"ORGANIZATION\", \"PART_OF\", \"ORGANIZATION\"),\n",
" (\"PERSON\", \"WORKED_AT\", \"PLACE\"),\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"kg_extractor = SchemaLLMPathExtractor(\n",
" llm=Ollama(model=\"mistral:7b\", json_mode=True, request_timeout=3600),\n",
" possible_entities=entities,\n",
" possible_relations=relations,\n",
" kg_validation_schema=validation_schema,\n",
" # if false, allows for values outside of the schema\n",
" # useful for using the schema as a suggestion\n",
" strict=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Neo4j setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!docker run \\\n",
" -p 7474:7474 -p 7687:7687 \\\n",
" -v $PWD/data:/data -v $PWD/plugins:/plugins \\\n",
" --name neo4j-apoc \\\n",
" -e NEO4J_apoc_export_file_enabled=true \\\n",
" -e NEO4J_apoc_import_file_enabled=true \\\n",
" -e NEO4J_apoc_import_file_use__neo4j__config=true \\\n",
" -e NEO4JLABS_PLUGINS=\\[\\\"apoc\\\"\\] \\\n",
" neo4j:latest"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning} {category: DEPRECATION} {title: This feature is deprecated and will be removed in future versions.} {description: The procedure has a deprecated field. ('config' used by 'apoc.meta.graphSample' is deprecated.)} {position: line: 1, column: 1, offset: 0} for query: \"CALL apoc.meta.graphSample() YIELD nodes, relationships RETURN nodes, [rel in relationships | {name:apoc.any.property(rel, 'type'), count: apoc.any.property(rel, 'count')}] AS relationships\"\n"
]
}
],
"source": [
"from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore\n",
"from llama_index.core.vector_stores.simple import SimpleVectorStore\n",
"\n",
"graph_store = Neo4jPropertyGraphStore(\n",
" username=\"neo4j\",\n",
" password=\"llamaindex\",\n",
" url=\"bolt://localhost:7687\",\n",
")\n",
"vec_store = SimpleVectorStore()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PropertyGraphIndex construction"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n",
"/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
" warnings.warn(\n",
"/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
" warnings.warn(\n",
"Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 28.70it/s]\n",
"Extracting paths from text with schema: 100%|██████████| 22/22 [08:14<00:00, 22.46s/it]\n",
"Generating embeddings: 100%|██████████| 3/3 [00:00<00:00, 3.37it/s]\n",
"Generating embeddings: 100%|██████████| 9/9 [00:01<00:00, 5.87it/s]\n",
"Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning} {category: DEPRECATION} {title: This feature is deprecated and will be removed in future versions.} {description: The procedure has a deprecated field. ('config' used by 'apoc.meta.graphSample' is deprecated.)} {position: line: 1, column: 1, offset: 0} for query: \"CALL apoc.meta.graphSample() YIELD nodes, relationships RETURN nodes, [rel in relationships | {name:apoc.any.property(rel, 'type'), count: apoc.any.property(rel, 'count')}] AS relationships\"\n"
]
}
],
"source": [
"from llama_index.core import PropertyGraphIndex\n",
"from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
"\n",
"index = PropertyGraphIndex.from_documents(\n",
" documents,\n",
" kg_extractors=[kg_extractor],\n",
" embed_model=HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\"),\n",
" property_graph_store=graph_store,\n",
" vector_store=vec_store,\n",
" show_progress=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup Retrievers"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
" warnings.warn(\n"
]
}
],
"source": [
"from llama_index.core.indices.property_graph import (\n",
" LLMSynonymRetriever,\n",
" VectorContextRetriever,\n",
")\n",
"\n",
"\n",
"llm_synonym = LLMSynonymRetriever(\n",
" index.property_graph_store,\n",
" llm=Ollama(model=\"mistral:instruct\", request_timeout=3600),\n",
" include_text=False,\n",
")\n",
"vector_context = VectorContextRetriever(\n",
" index.property_graph_store,\n",
" embed_model=HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\"),\n",
" include_text=False,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retriever"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"retriever = index.as_retriever(\n",
" sub_retrievers=[\n",
" llm_synonym,\n",
" vector_context,\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Interleaf -> Got crushed by -> Moore's law\n",
"Paul graham -> Did freelance work for -> Interleaf\n",
"Interleaf -> Made software for -> Creating documents\n",
"Interleaf -> Is -> Software company\n",
"Interleaf -> Added -> Scripting language\n",
"Paul Graham -> WORKED_AT -> Interleaf\n",
"Paul Graham -> WORKED_ON -> Interleaf\n"
]
}
],
"source": [
"nodes = retriever.retrieve(\"What happened at Interleaf?\")\n",
"\n",
"for node in nodes:\n",
" print(node.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### QueryEngine"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
" At Interleaf, a software company, they developed software for creating documents and added a scripting language. Paul Graham worked on this project for them. Eventually, something occurred that led to the significant impact of Interleaf being crushed by Moore's law."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"query_engine = index.as_query_engine(\n",
" sub_retrievers=[\n",
" llm_synonym,\n",
" vector_context,\n",
" ],\n",
" llm=Ollama(model=\"mistral:instruct\", request_timeout=3600),\n",
")\n",
"\n",
"response = query_engine.query(\"What happened at Interleaf?\")\n",
"\n",
"display(Markdown(f\"{response.response}\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "llamaindex",
"language": "python",
"name": "llamaindex"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}