property graph

Details

File: third_party/LlamaIndex/propertygraphs/property_graph.ipynb
Type: Jupyter Notebook
Use Cases: Property graph
Integrations: Llamaindex
Content

Notebook content (JSON format):
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/mistralai/cookbook/blob/main/third_party/LlamaIndex/propertygraphs/property_graph.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PropertyGraph Index with Mistral AI and LlamaIndex \n",
    "\n",
    "In this notebook, we demonstrate the basic usage of the PropertyGraphIndex in LlamaIndex.\n",
    "\n",
    "The property graph index will process unstructured documents, extract a property graph from them, and offer various methods for querying this graph."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-core\n",
    "%pip install llama-index-llms-mistralai\n",
    "%pip install llama-index-embeddings-mistralai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "from IPython.display import Markdown, display"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['MISTRAL_API_KEY'] = 'YOUR MISTRAL API KEY'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.mistralai import MistralAIEmbedding\n",
    "from llama_index.llms.mistralai import MistralAI\n",
    "\n",
    "llm = MistralAI(model='mistral-large-latest')\n",
    "embed_model = MistralAIEmbedding()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-07-05 07:22:24--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8002::154, 2606:50c0:8000::154, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 75042 (73K) [text/plain]\n",
      "Saving to: ‘data/paul_graham/paul_graham_essay.txt’\n",
      "\n",
      "data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.06s   \n",
      "\n",
      "2024-07-05 07:22:24 (1.27 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create PropertyGraphIndex\n",
    "\n",
    "The following steps occur during the creation of a PropertyGraph:\n",
    "\n",
    "1.\tPropertyGraphIndex.from_documents(): We load documents into an index.\n",
    "\n",
    "2.\tParsing Nodes: The index parses the documents into nodes.\n",
    "\n",
    "3.\tExtracting Paths from Text: The nodes are passed to an LLM, which is prompted to generate knowledge graph triples (i.e., paths).\n",
    "\n",
    "4.\tExtracting Implicit Paths: The node.relationships property is used to infer implicit paths.\n",
    "\n",
    "5.\tGenerating Embeddings: Embeddings are generated for each text node and graph node, occurring twice during the process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/ravithejad/Desktop/llamaindex/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 27.96it/s]\n",
      "Extracting paths from text: 100%|██████████| 22/22 [00:24<00:00,  1.13s/it]\n",
      "Extracting implicit paths: 100%|██████████| 22/22 [00:00<00:00, 11592.30it/s]\n",
      "Generating embeddings: 100%|██████████| 3/3 [00:00<00:00,  3.47it/s]\n",
      "Generating embeddings: 100%|██████████| 44/44 [00:13<00:00,  3.18it/s]\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core import PropertyGraphIndex\n",
    "\n",
    "\n",
    "index = PropertyGraphIndex.from_documents(\n",
    "    documents,\n",
    "    llm=llm,\n",
    "    embed_model=embed_model,\n",
    "    show_progress=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For debugging purposes, the default SimplePropertyGraphStore includes a helper to save a networkx representation of the graph to an html file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "index.property_graph_store.save_networkx_graph(name=\"./kg.html\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import Settings\n",
    "\n",
    "Settings.llm = llm\n",
    "Settings.embed_model = embed_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying\n",
    "\n",
    "Querying a property graph index typically involves using one or more sub-retrievers and combining their results. The process of graph retrieval includes:\n",
    "\n",
    "1.\tSelecting Nodes: Identifying the initial nodes of interest within the graph.\n",
    "2.\tTraversing: Moving from the selected nodes to explore connected elements.\n",
    "\n",
    "By default, two primary types of retrieval are employed simultaneously:\n",
    "\n",
    "•\tSynonym/Keyword Expansion: Utilizing an LLM to generate synonyms and keywords derived from the query.\n",
    "\n",
    "•\tVector Retrieval: Employing embeddings to locate nodes within your graph.\n",
    "\n",
    "\n",
    "Once nodes are identified, you can choose to:\n",
    "\n",
    "•\tReturn Paths: Provide the paths adjacent to the selected nodes, typically in the form of triples.\n",
    "\n",
    "•\tReturn Paths and Source Text: Provide both the paths and the original source text of the chunk, if available.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retreival"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Viaweb -> Launch date -> January 1996\n",
      "Viaweb -> Growth rate -> 7x a year\n",
      "Viaweb -> Received seed funding from -> Julian\n",
      "Viaweb -> Is -> Online store builder\n",
      "Viaweb -> User growth -> 70 stores at the end of 1996 and about 500 at the end of 1997\n",
      "Viaweb -> Pricing -> $100 a month for a small store and $300 a month for a big one\n",
      "Viaweb -> Acquisition -> Bought by yahoo in the summer of 1998\n",
      "Viaweb -> Has -> Code editor\n",
      "Viaweb -> Strategy -> Doing things that don't scale\n",
      "Viaweb -> Developed by -> Robert and trevor\n",
      "Viaweb -> Founded by -> Paul graham and robert\n",
      "Viaweb -> Reached breakeven -> Summer of 1998\n",
      "Viaweb -> Was -> One of the best general-purpose site builders\n",
      "Viaweb -> Started by -> I\n",
      "Viaweb -> Service -> Building stores for users\n",
      "Viaweb -> Investors -> Had significant influence on company decisions\n",
      "Viaweb -> Software -> Works via the web\n",
      "Viaweb -> Was founded by -> I and robert morris\n",
      "Viaweb -> Bought by -> Yahoo\n",
      "Viaweb -> Initial product -> Wysiwyg site builder\n",
      "Viaweb -> Had -> Handful of employees\n",
      "Viaweb -> Started for -> Needing money\n",
      "Viaweb -> Hosts -> Stores\n",
      "Viaweb -> Status before acquisition -> Not profitable\n",
      "I -> Got a job at -> Interleaf\n",
      "Interleaf -> Made software for -> Creating documents\n",
      "Interleaf -> Added a scripting language -> Lisp\n",
      "I -> Arranged to do freelance work for -> Interleaf\n"
     ]
    }
   ],
   "source": [
    "retriever = index.as_retriever(\n",
    "    include_text=False,  # include source text, default True\n",
    ")\n",
    "\n",
    "nodes = retriever.retrieve(\"What happened at Interleaf and Viaweb?\")\n",
    "\n",
    "for node in nodes:\n",
    "    print(node.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### QueryEngine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "At Interleaf, the company made software for creating documents and added a scripting language, Lisp, inspired by Emacs. The narrator worked there for a year but admits to being a bad employee, not understanding most of the software and being irresponsible. However, they learned some valuable lessons about what not to do in a technology company.\n",
       "\n",
       "Viaweb, on the other hand, was an online store builder that hosted stores for users. Before its public launch, it had to recruit an initial set of users and ensure they had decent-looking stores. Viaweb had a code editor for users to define their own page styles, which was essentially editing Lisp expressions, but it wasn't an app editor. The company was not profitable before it was bought by Yahoo in the summer of 1998. Viaweb's strategy was to do things that don't scale, and it was one of the best general-purpose site builders at the time."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "query_engine = index.as_query_engine(\n",
    "    include_text=True\n",
    ")\n",
    "\n",
    "response = query_engine.query(\"What happened at Interleaf and Viaweb?\")\n",
    "\n",
    "display(Markdown(f\"{response.response}\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Storage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, storage is managed using our straightforward in-memory abstractions—SimpleVectorStore for embeddings and SimplePropertyGraphStore for the property graph.\n",
    "\n",
    "We can save and load these structures to and from disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "At Interleaf, the company made software for creating documents and added a scripting language, Lisp, inspired by Emacs. The narrator worked there for a year but admits to being a bad employee, not understanding most of the software and being irresponsible. However, they learned some valuable lessons about what not to do in a technology company.\n",
       "\n",
       "Viaweb, on the other hand, was an online store builder that hosted stores for users. Before its public launch, it had to recruit an initial set of users and ensure they had decent-looking stores. Viaweb had a code editor for users to define their own page styles, which was essentially editing Lisp expressions, but it wasn't an app editor. The company was not profitable before it was bought by Yahoo in the summer of 1998. Viaweb's strategy was to do things that don't scale, and it was one of the best general-purpose site builders at the time."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "index.storage_context.persist(persist_dir=\"./storage\")\n",
    "\n",
    "from llama_index.core import StorageContext, load_index_from_storage\n",
    "\n",
    "index = load_index_from_storage(\n",
    "    StorageContext.from_defaults(persist_dir=\"./storage\")\n",
    ")\n",
    "\n",
    "query_engine = index.as_query_engine(\n",
    "    include_text=True\n",
    ")\n",
    "\n",
    "response = query_engine.query(\"What happened at Interleaf and Viaweb?\")\n",
    "\n",
    "display(Markdown(f\"{response.response}\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vector Stores\n",
    "\n",
    "While some graph databases, such as Neo4j, support vectors, you can still specify which vector store to use with your graph in cases where vectors are not supported, or when you want to override the default settings.\n",
    "\n",
    "Below, we will demonstrate how to combine ChromaVectorStore with the default SimplePropertyGraphStore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-vector-stores-chroma"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build and Save Index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 28.20it/s]\n",
      "Extracting paths from text: 100%|██████████| 22/22 [00:24<00:00,  1.13s/it]\n",
      "Extracting implicit paths: 100%|██████████| 22/22 [00:00<00:00, 7562.26it/s]\n",
      "Generating embeddings: 100%|██████████| 3/3 [00:01<00:00,  2.17it/s]\n",
      "Generating embeddings: 100%|██████████| 45/45 [00:14<00:00,  3.13it/s]\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.graph_stores import SimplePropertyGraphStore\n",
    "from llama_index.vector_stores.chroma import ChromaVectorStore\n",
    "import chromadb\n",
    "\n",
    "client = chromadb.PersistentClient(\"./chroma_db\")\n",
    "collection = client.get_or_create_collection(\"my_graph_vector_db\")\n",
    "\n",
    "index = PropertyGraphIndex.from_documents(\n",
    "    documents,\n",
    "    llm=llm,\n",
    "    embed_model=embed_model,\n",
    "    property_graph_store=SimplePropertyGraphStore(),\n",
    "    vector_store=ChromaVectorStore(chroma_collection=collection),\n",
    "    show_progress=True,\n",
    ")\n",
    "\n",
    "index.storage_context.persist(persist_dir=\"./storage\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The author was involved in YC, where they encountered various problems, including having to address misinterpretations of their essays on a forum they managed. They also worked closely with someone named Jessica. However, the specific roles or tasks the author undertook at YC are not detailed in the provided context."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "index = PropertyGraphIndex.from_existing(\n",
    "    SimplePropertyGraphStore.from_persist_dir(\"./storage\"),\n",
    "    vector_store=ChromaVectorStore(chroma_collection=collection),\n",
    "    llm=llm,\n",
    ")\n",
    "\n",
    "query_engine = index.as_query_engine(\n",
    "    include_text=True\n",
    ")\n",
    "\n",
    "response = query_engine.query(\"why did author do at YC?\")\n",
    "\n",
    "display(Markdown(f\"{response.response}\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llamaindex",
   "language": "python",
   "name": "llamaindex"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}