pdf entity extraction

Details

File: third_party/Indexify/pdf-entity-extraction/pdf-entity-extraction.ipynb
Type: Jupyter Notebook
Use Cases: PDF, Summarization
Integrations: Indexify
Content

Notebook content (JSON format):
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PDF Entity Extraction with Indexify and Mistral\n",
    "\n",
    "This cookbook demonstrates how to build a robust entity extraction pipeline for PDF documents using Indexify and Mistral's large language models. You will learn how to efficiently extract named entities from PDF files for various applications such as information retrieval, content analysis, and data mining.\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Entity extraction, also known as named entity recognition (NER) involves identifying and classifying named entities in text into predefined categories such as persons, organizations, locations, dates, and more. By applying this technique to PDF documents, we can automatically extract structured information from unstructured text, making it easier to analyze and utilize the content of these documents.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "Before we begin, ensure you have the following:\n",
    "\n",
    "- Create a virtual env with Python 3.9 or later\n",
    "  ```shell\n",
    "  python3.9 -m venv ve\n",
    "  source ve/bin/activate\n",
    "  ```\n",
    "- `pip` (Python package manager)\n",
    "- A Mistral API key\n",
    "- Basic familiarity with Python and command-line interfaces\n",
    "\n",
    "## Setup\n",
    "\n",
    "### Install Indexify\n",
    "\n",
    "First, let's install Indexify using the official installation script in a terminal:\n",
    "\n",
    "```bash\n",
    "curl https://getindexify.ai | sh\n",
    "```\n",
    "\n",
    "Start the Indexify server:\n",
    "```bash\n",
    "./indexify server -d\n",
    "```\n",
    "This starts a long running server that exposes ingestion and retrieval APIs to applications.\n",
    "\n",
    "### Install Required Extractors\n",
    "\n",
    "Next, we'll install the necessary extractors in a new terminal:\n",
    "\n",
    "```bash\n",
    "pip install indexify-extractor-sdk\n",
    "indexify-extractor download tensorlake/pdfextractor\n",
    "indexify-extractor download tensorlake/mistral\n",
    "```\n",
    "\n",
    "Once the extractors are downloaded, you can start them:\n",
    "```bash\n",
    "indexify-extractor join-server\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the Extraction Graph\n",
    "\n",
    "The extraction graph defines the flow of data through our entity extraction pipeline. We'll create a graph that first extracts text from PDFs, then sends that text to Mistral for entity extraction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from indexify import IndexifyClient, ExtractionGraph\n",
    "\n",
    "client = IndexifyClient()\n",
    "\n",
    "extraction_graph_spec = \"\"\"\n",
    "name: 'pdf_entity_extractor'\n",
    "extraction_policies:\n",
    "  - extractor: 'tensorlake/pdfextractor'\n",
    "    name: 'pdf_to_text'\n",
    "  - extractor: 'tensorlake/mistral'\n",
    "    name: 'text_to_entities'\n",
    "    input_params:\n",
    "      model_name: 'mistral-large-latest'\n",
    "      key: 'YOUR_MISTRAL_API_KEY'\n",
    "      system_prompt: 'Extract and categorize all named entities from the following text. Provide the results in a JSON format with categories: persons, organizations, locations, dates, and miscellaneous.'\n",
    "    content_source: 'pdf_to_text'\n",
    "\"\"\"\n",
    "\n",
    "extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)\n",
    "client.create_extraction_graph(extraction_graph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace `'YOUR_MISTRAL_API_KEY'` with your actual Mistral API key."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing the Entity Extraction Pipeline\n",
    "\n",
    "Now that we have our extraction graph set up, we can upload files and retrieve the entities:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "import requests\n",
    "from indexify import IndexifyClient\n",
    "\n",
    "def download_pdf(url, save_path):\n",
    "    response = requests.get(url)\n",
    "    with open(save_path, 'wb') as f:\n",
    "        f.write(response.content)\n",
    "    print(f\"PDF downloaded and saved to {save_path}\")\n",
    "\n",
    "\n",
    "def extract_entities_from_pdf(pdf_path):\n",
    "    client = IndexifyClient()\n",
    "    \n",
    "    # Upload the PDF file\n",
    "    content_id = client.upload_file(\"pdf_entity_extractor\", pdf_path)\n",
    "    \n",
    "    # Wait for the extraction to complete\n",
    "    client.wait_for_extraction(content_id)\n",
    "    \n",
    "    # Retrieve the extracted entities\n",
    "    entities_content = client.get_extracted_content(\n",
    "        content_id=content_id,\n",
    "        graph_name=\"pdf_entity_extractor\",\n",
    "        policy_name=\"text_to_entities\"\n",
    "    )\n",
    "    \n",
    "    # Parse the JSON response\n",
    "    entities = json.loads(entities_content[0]['content'].decode('utf-8'))\n",
    "    return entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdf_url = \"https://arxiv.org/pdf/2310.06825.pdf\"\n",
    "pdf_path = \"reference_document.pdf\"\n",
    "\n",
    "# Download the PDF\n",
    "download_pdf(pdf_url, pdf_path)\n",
    "extracted_entities = extract_entities_from_pdf(pdf_path)\n",
    "\n",
    "print(\"Extracted Entities:\")\n",
    "for category, entities in extracted_entities.items():\n",
    "    print(f\"\\n{category.capitalize()}:\")\n",
    "    for entity in entities:\n",
    "        print(f\"- {entity}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Customization and Advanced Usage\n",
    "\n",
    "You can customize the entity extraction process by modifying the `system_prompt` in the extraction graph. For example:\n",
    "\n",
    "- To focus on specific entity types:\n",
    "  ```yaml\n",
    "  system_prompt: 'Extract only person names and organizations from the following text. Provide the results in a JSON format with categories: persons and organizations.'\n",
    "  ```\n",
    "\n",
    "- To include entity relationships:\n",
    "  ```yaml\n",
    "  system_prompt: 'Extract named entities and their relationships from the following text. Provide the results in a JSON format with categories: entities (including type and name) and relationships (including type and involved entities).'\n",
    "  ```\n",
    "\n",
    "You can also experiment with different Mistral models by changing the `model_name` parameter to find the best balance between speed and accuracy for your specific use case.\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "While the example might look simple, there are some unique advantages of using Indexify for this -\n",
    "\n",
    "1. **Scalable and Highly Availability**: Indexify server can be deployed on a cloud and it can process 1000s of PDFs uploaded into it, and if any step in the pipeline fails it automatically retries on another machine.\n",
    "2. **Flexibility**: You can use any other [PDF extraction model](https://docs.getindexify.ai/usecases/pdf_extraction/) we used here doesn't work for the document you are using. \n",
    "\n",
    "## Next Steps\n",
    "\n",
    "- Learn more about Indexify on our docs - https://docs.getindexify.ai\n",
    "- Go over an example, which uses Mistral for [building summarization at scale](../pdf-summarization/)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}