pdf summarization

Details

File: third_party/Indexify/pdf-summarization/pdf-summarization.ipynb
Type: Jupyter Notebook
Use Cases: PDF, Summarization
Integrations: Indexify
Content

Notebook content (JSON format):
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PDF Summarization with Indexify and Mistral\n",
    "\n",
    "In this cookbook, we'll explore how to create a PDF summarization pipeline using Indexify and Mistral's large language models. By the end of the document, you should have a pipeline capable of ingesting 1000s of PDF documents, and using Mistral for summarization.\n",
    "\n",
    "## Introduction\n",
    "\n",
    "The summarization pipeline is going to be composed of two steps -\n",
    "- PDF to Text extraction. We are going to use a pre-built extractor for this - `tensorlake/pdfextractor`.\n",
    "- We use Mistral for summarization.\n",
    "\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "Before we begin, ensure you have the following:\n",
    "\n",
    "- Create a virtual env with Python 3.9 or later\n",
    "  ```shell\n",
    "  python3.9 -m venv ve\n",
    "  source ve/bin/activate\n",
    "  ```\n",
    "- `pip` (Python package manager)\n",
    "- A Mistral API key\n",
    "- Basic familiarity with Python and command-line interfaces\n",
    "\n",
    "## Setup\n",
    "\n",
    "### Install Indexify\n",
    "\n",
    "First, let's install Indexify using the official installation script in a terminal:\n",
    "\n",
    "```bash\n",
    "curl https://getindexify.ai | sh\n",
    "```\n",
    "\n",
    "Start the Indexify server:\n",
    "```bash\n",
    "./indexify server -d\n",
    "```\n",
    "This starts a long running server that exposes ingestion and retrieval APIs to applications.\n",
    "\n",
    "### Install Required Extractors\n",
    "\n",
    "Next, we'll install the necessary extractors in a new terminal:\n",
    "\n",
    "```bash\n",
    "pip install indexify-extractor-sdk\n",
    "indexify-extractor download tensorlake/pdfextractor\n",
    "indexify-extractor download tensorlake/mistral\n",
    "```\n",
    "\n",
    "Once the extractors are downloaded, you can start them:\n",
    "```bash\n",
    "indexify-extractor join-server\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the Extraction Graph\n",
    "\n",
    "The extraction graph defines the flow of data through our summarization pipeline. We'll create a graph that first extracts text from PDFs, then sends that text to Mistral for summarization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from indexify import IndexifyClient, ExtractionGraph\n",
    "\n",
    "client = IndexifyClient()\n",
    "\n",
    "extraction_graph_spec = \"\"\"\n",
    "name: 'pdf_summarizer'\n",
    "extraction_policies:\n",
    "  - extractor: 'tensorlake/pdfextractor'\n",
    "    name: 'pdf_to_text'\n",
    "  - extractor: 'tensorlake/mistral'\n",
    "    name: 'text_to_summary'\n",
    "    input_params:\n",
    "      model_name: 'mistral-large-latest'\n",
    "      key: 'YOUR_MISTRAL_API_KEY'\n",
    "      system_prompt: 'Summarize the following text in a concise manner, highlighting the key points:'\n",
    "    content_source: 'pdf_to_text'\n",
    "\"\"\"\n",
    "\n",
    "extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)\n",
    "client.create_extraction_graph(extraction_graph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace `'YOUR_MISTRAL_API_KEY'` with your actual Mistral API key."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing the Summarization Pipeline\n",
    "\n",
    "Now that we have our extraction graph set up, we can upload files and make the pipeline generate summaries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import requests\n",
    "from indexify import IndexifyClient\n",
    "\n",
    "def download_pdf(url, save_path):\n",
    "    response = requests.get(url)\n",
    "    with open(save_path, 'wb') as f:\n",
    "        f.write(response.content)\n",
    "    print(f\"PDF downloaded and saved to {save_path}\")\n",
    "\n",
    "def summarize_pdf(pdf_path):\n",
    "    client = IndexifyClient()\n",
    "    \n",
    "    # Upload the PDF file\n",
    "    content_id = client.upload_file(\"pdf_summarizer\", pdf_path)\n",
    "    \n",
    "    # Wait for the extraction to complete\n",
    "    client.wait_for_extraction(content_id)\n",
    "    \n",
    "    # Retrieve the summarized content\n",
    "    summary = client.get_extracted_content(\n",
    "        content_id=content_id,\n",
    "        graph_name=\"pdf_summarizer\",\n",
    "        policy_name=\"text_to_summary\"\n",
    "    )\n",
    "    \n",
    "    return summary[0]['content'].decode('utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdf_url = \"https://arxiv.org/pdf/2310.06825.pdf\"\n",
    "pdf_path = \"reference_document.pdf\"\n",
    "\n",
    "# Download the PDF\n",
    "download_pdf(pdf_url, pdf_path)\n",
    "\n",
    "# Summarize the PDF\n",
    "summary = summarize_pdf(pdf_path)\n",
    "print(\"Summary of the PDF:\")\n",
    "print(summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Customization and Advanced Usage\n",
    "\n",
    "You can customize the summarization process by modifying the `system_prompt` in the extraction graph. For example:\n",
    "\n",
    "- To generate bullet-point summaries:\n",
    "  ```yaml\n",
    "  system_prompt: 'Summarize the following text as a list of bullet points:'\n",
    "  ```\n",
    "\n",
    "- To focus on specific aspects of the document:\n",
    "  ```yaml\n",
    "  system_prompt: 'Summarize the main arguments and supporting evidence from the following text:'\n",
    "  ```\n",
    "\n",
    "You can also experiment with different Mistral models by changing the `model_name` parameter to find the best balance between speed and accuracy for your specific use case.\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "While the example might look simple, there are some unique advantages of using Indexify for this -\n",
    "\n",
    "1. **Scalable and Highly Availability**: Indexify server can be deployed on a cloud and it can process 1000s of PDFs uploaded into it, and if any step in the pipeline fails it automatically retries on another machine.\n",
    "2. **Flexibility**: You can use any other [PDF extraction model](https://docs.getindexify.ai/usecases/pdf_extraction/) we used here doesn't work for the document you are using.\n",
    "\n",
    "## Next Steps\n",
    "\n",
    "- Learn more about Indexify on our docs - https://docs.getindexify.ai\n",
    "- Learn how to use Indexify and Mistral for [entity extraction from PDF documents](../pdf-entity-extraction/)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}