property graph extractors retrievers

Details

File: third_party/LlamaIndex/propertygraphs/property_graph_extractors_retrievers.ipynb
Type: Jupyter Notebook
Use Cases: Property graph
Integrations: Llamaindex
Content

Notebook content (JSON format):
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extractors and Retrievers in Property Graph\n",
    "\n",
    "In this notebook, we will explore how to define extractors and retrievers for the PropertyGraph Index.\n",
    "\n",
    "A property graph is a structured collection of labeled nodes (such as entity categories and text labels) with properties (metadata), interconnected by relationships to form structured paths (triplets).\n",
    "\n",
    "In LlamaIndex, the PropertyGraphIndex plays a crucial role in:\n",
    "\n",
    "•\tConstructing a graph\n",
    "\n",
    "•\tQuerying a graph"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building and Using PropertyGraph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import PropertyGraphIndex\n",
    "\n",
    "# create\n",
    "index = PropertyGraphIndex.from_documents(\n",
    "    documents,\n",
    ")\n",
    "\n",
    "# use\n",
    "retriever = index.as_retriever(\n",
    "    include_text=True,  # include source chunk with matching paths\n",
    "    similarity_top_k=2,  # top k for vector kg node retrieval\n",
    ")\n",
    "nodes = retriever.retrieve(\"<QUERY>\")\n",
    "\n",
    "query_engine = index.as_query_engine(\n",
    "    include_text=True,  # include source chunk with matching paths\n",
    "    similarity_top_k=2,  # top k for vector kg node retrieval\n",
    ")\n",
    "response = query_engine.query(\"<QUERY>\")\n",
    "\n",
    "# save and load\n",
    "index.storage_context.persist(persist_dir=\"./storage\")\n",
    "\n",
    "from llama_index.core import StorageContext, load_index_from_storage\n",
    "\n",
    "index = load_index_from_storage(\n",
    "    StorageContext.from_defaults(persist_dir=\"./storage\")\n",
    ")\n",
    "\n",
    "# loading from existing graph store (and optional vector store)\n",
    "# load from existing graph/vector store\n",
    "index = PropertyGraphIndex.from_existing(\n",
    "    property_graph_store=graph_store, vector_store=vector_store, llm=llm\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Property graph construction involves executing a series of knowledge graph extractors on each chunk, and attaching entities and relations as metadata to each node. \n",
    "\n",
    "You can use as many extractors as needed, and all will be applied."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index = PropertyGraphIndex.from_documents(\n",
    "    documents,\n",
    "    kg_extractors=[extractor1, extractor2, ...],\n",
    ")\n",
    "\n",
    "# insert additional documents / nodes\n",
    "index.insert(document)\n",
    "index.insert_nodes(nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If not provided, the defaults are SimpleLLMPathExtractor and ImplicitPathExtractor.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.indices.property_graph import SimpleLLMPathExtractor\n",
    "\n",
    "kg_extractor = SimpleLLMPathExtractor(\n",
    "    llm=llm,\n",
    "    max_paths_per_chunk=10,\n",
    "    num_workers=4,\n",
    "    show_progress=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SimpleLLMPathExtractor\n",
    "\n",
    "Use an LLM to extract short statements and parse single-hop paths in the format (entity1, relation, entity2).\n",
    "\n",
    "If desired, you can also customize both the prompt and the function used for parsing the paths.\n",
    "\n",
    "Here’s a straightforward (though simplistic) example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = (\n",
    "    \"Some text is provided below. Given the text, extract up to \"\n",
    "    \"{max_paths_per_chunk} \"\n",
    "    \"knowledge triples in the form of `subject,predicate,object` on each line. Avoid stopwords.\\n\"\n",
    ")\n",
    "\n",
    "\n",
    "def parse_fn(response_str: str) -> List[Tuple[str, str, str]]:\n",
    "    lines = response_str.split(\"\\n\")\n",
    "    triples = [line.split(\",\") for line in lines]\n",
    "    return triples\n",
    "\n",
    "\n",
    "kg_extractor = SimpleLLMPathExtractor(\n",
    "    llm=llm,\n",
    "    extract_prompt=prompt,\n",
    "    parse_fn=parse_fn,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ImplicitPathExtractor\n",
    "\n",
    "Extract paths using the node.relationships attribute on each LlamaIndex node object.\n",
    "\n",
    "This extraction process does not require an LLM or embedding model, as it simply parses properties that already exist on the node objects.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.indices.property_graph import ImplicitPathExtractor\n",
    "\n",
    "kg_extractor = ImplicitPathExtractor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SchemaLLMPathExtractor\n",
    "\n",
    "Extract paths by adhering to a strict schema that specifies allowed entities, relationships, and the connections between them.\n",
    "\n",
    "Using Pydantic, structured outputs from LLMs, and some intelligent validation, we can dynamically define a schema and verify the extractions for each path.(triplet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Literal\n",
    "from llama_index.core.indices.property_graph import SchemaLLMPathExtractor\n",
    "\n",
    "# recommended uppercase, underscore separated\n",
    "entities = Literal[\"PERSON\", \"PLACE\", \"ORGANIZATION\"]\n",
    "relations = Literal[\"PART_OF\", \"HAS\", \"WORKED_AT\"]\n",
    "# schema = {\n",
    "#     \"PERSON\": [\"PART_OF\", \"HAS\", \"WORKED_AT\"],\n",
    "#     \"PLACE\": [\"PART_OF\", \"HAS\"],\n",
    "#     \"ORGANIZATION\": [\"WORKED_AT\"],\n",
    "# }\n",
    "\n",
    "schema = [\n",
    "    (\"PLACE\", \"HAS\", \"PERSON\"),\n",
    "    (\"PERSON\", \"PART_OF\", \"PLACE\"),\n",
    "    (\"PERSON\", \"WORKED_AT\", \"ORGANIZATION\")\n",
    "]\n",
    "\n",
    "kg_extractor = SchemaLLMPathExtractor(\n",
    "    llm=llm,\n",
    "    possible_entities=entities,\n",
    "    possible_relations=relations,\n",
    "    kg_validation_schema=schema,\n",
    "    strict=True,  # if false, will allow triples outside of the schema\n",
    "    num_workers=4,\n",
    "    max_paths_per_chunk=10,\n",
    "    show_progres=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieval and Querying\n",
    "\n",
    "Labeled property graphs offer various querying methods to retrieve nodes and paths. In LlamaIndex, we have the ability to simultaneously combine multiple node retrieval techniques!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a retriever\n",
    "retriever = index.as_retriever(sub_retrievers=[retriever1, retriever2, ...])\n",
    "\n",
    "# create a query engine\n",
    "query_engine = index.as_query_engine(\n",
    "    sub_retrievers=[retriever1, retriever2, ...]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If no sub-retrievers are specified, the default retrievers used are the LLMSynonymRetriever and VectorContextRetriever (if embeddings are enabled).\n",
    "\n",
    "Currently, the following retrievers are included:\n",
    "\n",
    "•\tLLMSynonymRetriever: Retrieves nodes based on keywords and synonyms generated by an LLM.\n",
    "\n",
    "•\tVectorContextRetriever: Retrieves nodes based on embedded graph nodes.\n",
    "\n",
    "•\tTextToCypherRetriever: Directs the LLM to generate Cypher queries based on the schema of the property graph.\n",
    "\n",
    "•\tCypherTemplateRetriever: Utilizes a Cypher template with parameters inferred by the LLM.\n",
    "\n",
    "•\tCustomPGRetriever: Easily subclassed to implement custom retrieval logic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.indices.property_graph import (\n",
    "    PGRetriever,\n",
    "    VectorContextRetriever,\n",
    "    LLMSynonymRetriever,\n",
    ")\n",
    "\n",
    "sub_retrievers = [\n",
    "    VectorContextRetriever(index.property_graph_store, ...),\n",
    "    LLMSynonymRetriever(index.property_graph_store, ...),\n",
    "]\n",
    "\n",
    "retriever = PGRetriever(sub_retrievers=sub_retrievers)\n",
    "\n",
    "nodes = retriever.retrieve(\"<query>\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### LLMSynonymRetriever\n",
    "\n",
    "This retriever takes the input query and attempts to generate relevant keywords and synonyms. These are used to retrieve nodes and consequently, the paths connected to those nodes.\n",
    "\n",
    "Explicitly declaring this retriever in your configuration allows for the customization of several options."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.indices.property_graph import LLMSynonymRetriever\n",
    "\n",
    "prompt = (\n",
    "    \"Given some initial query, generate synonyms or related keywords up to {max_keywords} in total, \"\n",
    "    \"considering possible cases of capitalization, pluralization, common expressions, etc.\\n\"\n",
    "    \"Provide all synonyms/keywords separated by '^' symbols: 'keyword1^keyword2^...'\\n\"\n",
    "    \"Note, result should be in one-line, separated by '^' symbols.\"\n",
    "    \"----\\n\"\n",
    "    \"QUERY: {query_str}\\n\"\n",
    "    \"----\\n\"\n",
    "    \"KEYWORDS: \"\n",
    ")\n",
    "\n",
    "\n",
    "def parse_fn(self, output: str) -> list[str]:\n",
    "    matches = output.strip().split(\"^\")\n",
    "\n",
    "    # capitalize to normalize with ingestion\n",
    "    return [x.strip().capitalize() for x in matches if x.strip()]\n",
    "\n",
    "\n",
    "synonym_retriever = LLMSynonymRetriever(\n",
    "    index.property_graph_store,\n",
    "    llm=llm,\n",
    "    # include source chunk text with retrieved paths\n",
    "    include_text=False,\n",
    "    synonym_prompt=prompt,\n",
    "    output_parsing_fn=parse_fn,\n",
    "    max_keywords=10,\n",
    "    # the depth of relations to follow after node retrieval\n",
    "    path_depth=1,\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(sub_retrievers=[synonym_retriever])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## VectorContextRetriever\n",
    "\n",
    "This retriever identifies nodes based on their vector similarity, subsequently fetching the paths connected to those nodes.\n",
    "\n",
    "If your graph store natively supports vector capabilities, managing that graph store alone suffices for storage. However, if vector support is not inherent, you will need to supplement the graph store with a vector store. By default, this setup uses the in-memory SimpleVectorStore.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.indices.property_graph import VectorContextRetriever\n",
    "\n",
    "vector_retriever = VectorContextRetriever(\n",
    "    index.property_graph_store,\n",
    "    # only needed when the graph store doesn't support vector queries\n",
    "    # vector_store=index.vector_store,\n",
    "    embed_model=embed_model,\n",
    "    # include source chunk text with retrieved paths\n",
    "    include_text=False,\n",
    "    # the number of nodes to fetch\n",
    "    similarity_top_k=2,\n",
    "    # the depth of relations to follow after node retrieval\n",
    "    path_depth=1,\n",
    "    # can provide any other kwargs for the VectorStoreQuery class\n",
    "    ...,\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(sub_retrievers=[vector_retriever])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TextToCypherRetriever\n",
    "\n",
    "This retriever utilizes a graph store schema, your query, and a prompt template for text-to-cypher conversion to generate and execute a Cypher query.\n",
    "\n",
    "Note: Since the SimplePropertyGraphStore is not a full-fledged graph database, it does not support Cypher queries.\n",
    "\n",
    "To inspect the schema, you can use the method: index.property_graph_store.get_schema_str()."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.indices.property_graph import TextToCypherRetriever\n",
    "\n",
    "DEFAULT_RESPONSE_TEMPLATE = (\n",
    "    \"Generated Cypher query:\\n{query}\\n\\n\" \"Cypher Response:\\n{response}\"\n",
    ")\n",
    "DEFAULT_ALLOWED_FIELDS = [\"text\", \"label\", \"type\"]\n",
    "\n",
    "DEFAULT_TEXT_TO_CYPHER_TEMPLATE = (\n",
    "    index.property_graph_store.text_to_cypher_template,\n",
    ")\n",
    "\n",
    "\n",
    "cypher_retriever = TextToCypherRetriever(\n",
    "    index.property_graph_store,\n",
    "    # customize the LLM, defaults to Settings.llm\n",
    "    llm=llm,\n",
    "    # customize the text-to-cypher template.\n",
    "    # Requires `schema` and `question` template args\n",
    "    text_to_cypher_template=DEFAULT_TEXT_TO_CYPHER_TEMPLATE,\n",
    "    # customize how the cypher result is inserted into\n",
    "    # a text node. Requires `query` and `response` template args\n",
    "    response_template=DEFAULT_RESPONSE_TEMPLATE,\n",
    "    # an optional callable that can clean/verify generated cypher\n",
    "    cypher_validator=None,\n",
    "    # allowed fields in the resulting\n",
    "    allowed_output_field=DEFAULT_ALLOWED_FIELDS,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CypherTemplateRetriever\n",
    "\n",
    "This is a more constrained version of the TextToCypherRetriever. Instead of allowing the LLM free rein to generate any Cypher statement, we can provide a Cypher template and have the LLM fill in the blanks.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# NOTE: current v1 is needed\n",
    "from pydantic.v1 import BaseModel, Field\n",
    "from llama_index.core.indices.property_graph import CypherTemplateRetriever\n",
    "\n",
    "# write a query with template params\n",
    "cypher_query = \"\"\"\n",
    "MATCH (c:Chunk)-[:MENTIONS]->(o)\n",
    "WHERE o.name IN $names\n",
    "RETURN c.text, o.name, o.label;\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "# create a pydantic class to represent the params for our query\n",
    "# the class fields are directly used as params for running the cypher query\n",
    "class TemplateParams(BaseModel):\n",
    "    \"\"\"Template params for a cypher query.\"\"\"\n",
    "\n",
    "    names: list[str] = Field(\n",
    "        description=\"A list of entity names or keywords to use for lookup in a knowledge graph.\"\n",
    "    )\n",
    "\n",
    "\n",
    "template_retriever = CypherTemplateRetriever(\n",
    "    index.property_graph_store, TemplateParams, cypher_query\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}