← Back to Cookbook
pdf summarization
Details
File: third_party/Indexify/pdf-summarization/pdf-summarization.ipynb
Type: Jupyter Notebook
Use Cases: PDF, Summarization
Integrations: Indexify
Content
Notebook content (JSON format):
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PDF Summarization with Indexify and Mistral\n", "\n", "In this cookbook, we'll explore how to create a PDF summarization pipeline using Indexify and Mistral's large language models. By the end of the document, you should have a pipeline capable of ingesting 1000s of PDF documents, and using Mistral for summarization.\n", "\n", "## Introduction\n", "\n", "The summarization pipeline is going to be composed of two steps -\n", "- PDF to Text extraction. We are going to use a pre-built extractor for this - `tensorlake/pdfextractor`.\n", "- We use Mistral for summarization.\n", "\n", "\n", "## Prerequisites\n", "\n", "Before we begin, ensure you have the following:\n", "\n", "- Create a virtual env with Python 3.9 or later\n", " ```shell\n", " python3.9 -m venv ve\n", " source ve/bin/activate\n", " ```\n", "- `pip` (Python package manager)\n", "- A Mistral API key\n", "- Basic familiarity with Python and command-line interfaces\n", "\n", "## Setup\n", "\n", "### Install Indexify\n", "\n", "First, let's install Indexify using the official installation script in a terminal:\n", "\n", "```bash\n", "curl https://getindexify.ai | sh\n", "```\n", "\n", "Start the Indexify server:\n", "```bash\n", "./indexify server -d\n", "```\n", "This starts a long running server that exposes ingestion and retrieval APIs to applications.\n", "\n", "### Install Required Extractors\n", "\n", "Next, we'll install the necessary extractors in a new terminal:\n", "\n", "```bash\n", "pip install indexify-extractor-sdk\n", "indexify-extractor download tensorlake/pdfextractor\n", "indexify-extractor download tensorlake/mistral\n", "```\n", "\n", "Once the extractors are downloaded, you can start them:\n", "```bash\n", "indexify-extractor join-server\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the Extraction Graph\n", "\n", "The extraction graph defines the flow of data through our summarization pipeline. We'll create a graph that first extracts text from PDFs, then sends that text to Mistral for summarization." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from indexify import IndexifyClient, ExtractionGraph\n", "\n", "client = IndexifyClient()\n", "\n", "extraction_graph_spec = \"\"\"\n", "name: 'pdf_summarizer'\n", "extraction_policies:\n", " - extractor: 'tensorlake/pdfextractor'\n", " name: 'pdf_to_text'\n", " - extractor: 'tensorlake/mistral'\n", " name: 'text_to_summary'\n", " input_params:\n", " model_name: 'mistral-large-latest'\n", " key: 'YOUR_MISTRAL_API_KEY'\n", " system_prompt: 'Summarize the following text in a concise manner, highlighting the key points:'\n", " content_source: 'pdf_to_text'\n", "\"\"\"\n", "\n", "extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)\n", "client.create_extraction_graph(extraction_graph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace `'YOUR_MISTRAL_API_KEY'` with your actual Mistral API key." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementing the Summarization Pipeline\n", "\n", "Now that we have our extraction graph set up, we can upload files and make the pipeline generate summaries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import requests\n", "from indexify import IndexifyClient\n", "\n", "def download_pdf(url, save_path):\n", " response = requests.get(url)\n", " with open(save_path, 'wb') as f:\n", " f.write(response.content)\n", " print(f\"PDF downloaded and saved to {save_path}\")\n", "\n", "def summarize_pdf(pdf_path):\n", " client = IndexifyClient()\n", " \n", " # Upload the PDF file\n", " content_id = client.upload_file(\"pdf_summarizer\", pdf_path)\n", " \n", " # Wait for the extraction to complete\n", " client.wait_for_extraction(content_id)\n", " \n", " # Retrieve the summarized content\n", " summary = client.get_extracted_content(\n", " content_id=content_id,\n", " graph_name=\"pdf_summarizer\",\n", " policy_name=\"text_to_summary\"\n", " )\n", " \n", " return summary[0]['content'].decode('utf-8')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pdf_url = \"https://arxiv.org/pdf/2310.06825.pdf\"\n", "pdf_path = \"reference_document.pdf\"\n", "\n", "# Download the PDF\n", "download_pdf(pdf_url, pdf_path)\n", "\n", "# Summarize the PDF\n", "summary = summarize_pdf(pdf_path)\n", "print(\"Summary of the PDF:\")\n", "print(summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Customization and Advanced Usage\n", "\n", "You can customize the summarization process by modifying the `system_prompt` in the extraction graph. For example:\n", "\n", "- To generate bullet-point summaries:\n", " ```yaml\n", " system_prompt: 'Summarize the following text as a list of bullet points:'\n", " ```\n", "\n", "- To focus on specific aspects of the document:\n", " ```yaml\n", " system_prompt: 'Summarize the main arguments and supporting evidence from the following text:'\n", " ```\n", "\n", "You can also experiment with different Mistral models by changing the `model_name` parameter to find the best balance between speed and accuracy for your specific use case.\n", "\n", "## Conclusion\n", "\n", "While the example might look simple, there are some unique advantages of using Indexify for this -\n", "\n", "1. **Scalable and Highly Availability**: Indexify server can be deployed on a cloud and it can process 1000s of PDFs uploaded into it, and if any step in the pipeline fails it automatically retries on another machine.\n", "2. **Flexibility**: You can use any other [PDF extraction model](https://docs.getindexify.ai/usecases/pdf_extraction/) we used here doesn't work for the document you are using.\n", "\n", "## Next Steps\n", "\n", "- Learn more about Indexify on our docs - https://docs.getindexify.ai\n", "- Learn how to use Indexify and Mistral for [entity extraction from PDF documents](../pdf-entity-extraction/)" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }