property graph predefined schema

Details

File: third_party/LlamaIndex/propertygraphs/property_graph_predefined_schema.ipynb
Type: Jupyter Notebook
Use Cases: Property graph
Integrations: Llamaindex
Content

Notebook content (JSON format):
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Property Graph with Pre-defined Schemas\n",
    "\n",
    "In this notebook, we guide you through using Neo4j, Ollama, and Huggingface to build a property graph.\n",
    "\n",
    "Specifically, we will utilize the SchemaLLMPathExtractor, which enables us to define a precise schema containing possible entity types, relation types, and how they can be interconnected.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-core\n",
    "%pip install llama-index-llms-ollama\n",
    "%pip install llama-index-embeddings-huggingface\n",
    "%pip install llama-index-graph-stores-neo4j"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "from IPython.display import Markdown, display"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-07-05 09:13:53--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 75042 (73K) [text/plain]\n",
      "Saving to: ‘data/paul_graham/paul_graham_essay.txt’\n",
      "\n",
      "data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.02s   \n",
      "\n",
      "2024-07-05 09:13:53 (4.27 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Graph Construction\n",
    "\n",
    "To build our graph, we will use the SchemaLLMPathExtractor. \n",
    "\n",
    "This tool allows us to specify a schema for the graph, enabling us to extract entities and relations that adhere to this predefined schema, rather than allowing the LLM to randomly determine entities and relations.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Literal\n",
    "from llama_index.llms.ollama import Ollama\n",
    "from llama_index.core.indices.property_graph import SchemaLLMPathExtractor\n",
    "\n",
    "# best practice to use upper-case\n",
    "entities = Literal[\"PERSON\", \"PLACE\", \"ORGANIZATION\"]\n",
    "relations = Literal[\"HAS\", \"PART_OF\", \"WORKED_ON\", \"WORKED_WITH\", \"WORKED_AT\"]\n",
    "\n",
    "# define which entities can have which relations\n",
    "# validation_schema = {\n",
    "#     \"PERSON\": [\"HAS\", \"PART_OF\", \"WORKED_ON\", \"WORKED_WITH\", \"WORKED_AT\"],\n",
    "#     \"PLACE\": [\"HAS\", \"PART_OF\", \"WORKED_AT\"],\n",
    "#     \"ORGANIZATION\": [\"HAS\", \"PART_OF\", \"WORKED_WITH\"],\n",
    "# }\n",
    "validation_schema = [\n",
    "    (\"ORGANIZATION\", \"HAS\", \"PERSON\"),\n",
    "    (\"PERSON\", \"WORKED_AT\", \"ORGANIZATION\"),\n",
    "    (\"PERSON\", \"WORKED_WITH\", \"PERSON\"),\n",
    "    (\"PERSON\", \"WORKED_ON\", \"ORGANIZATION\"),\n",
    "    (\"PERSON\", \"PART_OF\", \"ORGANIZATION\"),\n",
    "    (\"ORGANIZATION\", \"PART_OF\", \"ORGANIZATION\"),\n",
    "    (\"PERSON\", \"WORKED_AT\", \"PLACE\"),\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "kg_extractor = SchemaLLMPathExtractor(\n",
    "    llm=Ollama(model=\"mistral:7b\", json_mode=True, request_timeout=3600),\n",
    "    possible_entities=entities,\n",
    "    possible_relations=relations,\n",
    "    kg_validation_schema=validation_schema,\n",
    "    # if false, allows for values outside of the schema\n",
    "    # useful for using the schema as a suggestion\n",
    "    strict=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Neo4j setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!docker run \\\n",
    "    -p 7474:7474 -p 7687:7687 \\\n",
    "    -v $PWD/data:/data -v $PWD/plugins:/plugins \\\n",
    "    --name neo4j-apoc \\\n",
    "    -e NEO4J_apoc_export_file_enabled=true \\\n",
    "    -e NEO4J_apoc_import_file_enabled=true \\\n",
    "    -e NEO4J_apoc_import_file_use__neo4j__config=true \\\n",
    "    -e NEO4JLABS_PLUGINS=\\[\\\"apoc\\\"\\] \\\n",
    "    neo4j:latest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning} {category: DEPRECATION} {title: This feature is deprecated and will be removed in future versions.} {description: The procedure has a deprecated field. ('config' used by 'apoc.meta.graphSample' is deprecated.)} {position: line: 1, column: 1, offset: 0} for query: \"CALL apoc.meta.graphSample() YIELD nodes, relationships RETURN nodes, [rel in relationships | {name:apoc.any.property(rel, 'type'), count: apoc.any.property(rel, 'count')}] AS relationships\"\n"
     ]
    }
   ],
   "source": [
    "from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore\n",
    "from llama_index.core.vector_stores.simple import SimpleVectorStore\n",
    "\n",
    "graph_store = Neo4jPropertyGraphStore(\n",
    "    username=\"neo4j\",\n",
    "    password=\"llamaindex\",\n",
    "    url=\"bolt://localhost:7687\",\n",
    ")\n",
    "vec_store = SimpleVectorStore()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PropertyGraphIndex construction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n",
      "/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n",
      "Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 28.70it/s]\n",
      "Extracting paths from text with schema: 100%|██████████| 22/22 [08:14<00:00, 22.46s/it]\n",
      "Generating embeddings: 100%|██████████| 3/3 [00:00<00:00,  3.37it/s]\n",
      "Generating embeddings: 100%|██████████| 9/9 [00:01<00:00,  5.87it/s]\n",
      "Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning} {category: DEPRECATION} {title: This feature is deprecated and will be removed in future versions.} {description: The procedure has a deprecated field. ('config' used by 'apoc.meta.graphSample' is deprecated.)} {position: line: 1, column: 1, offset: 0} for query: \"CALL apoc.meta.graphSample() YIELD nodes, relationships RETURN nodes, [rel in relationships | {name:apoc.any.property(rel, 'type'), count: apoc.any.property(rel, 'count')}] AS relationships\"\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core import PropertyGraphIndex\n",
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "\n",
    "index = PropertyGraphIndex.from_documents(\n",
    "    documents,\n",
    "    kg_extractors=[kg_extractor],\n",
    "    embed_model=HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\"),\n",
    "    property_graph_store=graph_store,\n",
    "    vector_store=vec_store,\n",
    "    show_progress=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup Retrievers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.indices.property_graph import (\n",
    "    LLMSynonymRetriever,\n",
    "    VectorContextRetriever,\n",
    ")\n",
    "\n",
    "\n",
    "llm_synonym = LLMSynonymRetriever(\n",
    "    index.property_graph_store,\n",
    "    llm=Ollama(model=\"mistral:instruct\", request_timeout=3600),\n",
    "    include_text=False,\n",
    ")\n",
    "vector_context = VectorContextRetriever(\n",
    "    index.property_graph_store,\n",
    "    embed_model=HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\"),\n",
    "    include_text=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = index.as_retriever(\n",
    "    sub_retrievers=[\n",
    "        llm_synonym,\n",
    "        vector_context,\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Interleaf -> Got crushed by -> Moore's law\n",
      "Paul graham -> Did freelance work for -> Interleaf\n",
      "Interleaf -> Made software for -> Creating documents\n",
      "Interleaf -> Is -> Software company\n",
      "Interleaf -> Added -> Scripting language\n",
      "Paul Graham -> WORKED_AT -> Interleaf\n",
      "Paul Graham -> WORKED_ON -> Interleaf\n"
     ]
    }
   ],
   "source": [
    "nodes = retriever.retrieve(\"What happened at Interleaf?\")\n",
    "\n",
    "for node in nodes:\n",
    "    print(node.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### QueryEngine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       " At Interleaf, a software company, they developed software for creating documents and added a scripting language. Paul Graham worked on this project for them. Eventually, something occurred that led to the significant impact of Interleaf being crushed by Moore's law."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "query_engine = index.as_query_engine(\n",
    "    sub_retrievers=[\n",
    "        llm_synonym,\n",
    "        vector_context,\n",
    "    ],\n",
    "    llm=Ollama(model=\"mistral:instruct\", request_timeout=3600),\n",
    ")\n",
    "\n",
    "response = query_engine.query(\"What happened at Interleaf?\")\n",
    "\n",
    "display(Markdown(f\"{response.response}\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llamaindex",
   "language": "python",
   "name": "llamaindex"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}