← Back to Cookbook
pdf summarization
Details
File: third_party/Indexify/pdf-summarization/pdf-summarization.ipynb
Type: Jupyter Notebook
Use Cases: PDF, Summarization
Integrations: Indexify
Content
Notebook content (JSON format):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PDF Summarization with Indexify and Mistral\n",
"\n",
"In this cookbook, we'll explore how to create a PDF summarization pipeline using Indexify and Mistral's large language models. By the end of the document, you should have a pipeline capable of ingesting 1000s of PDF documents, and using Mistral for summarization.\n",
"\n",
"## Introduction\n",
"\n",
"The summarization pipeline is going to be composed of two steps -\n",
"- PDF to Text extraction. We are going to use a pre-built extractor for this - `tensorlake/pdfextractor`.\n",
"- We use Mistral for summarization.\n",
"\n",
"\n",
"## Prerequisites\n",
"\n",
"Before we begin, ensure you have the following:\n",
"\n",
"- Create a virtual env with Python 3.9 or later\n",
" ```shell\n",
" python3.9 -m venv ve\n",
" source ve/bin/activate\n",
" ```\n",
"- `pip` (Python package manager)\n",
"- A Mistral API key\n",
"- Basic familiarity with Python and command-line interfaces\n",
"\n",
"## Setup\n",
"\n",
"### Install Indexify\n",
"\n",
"First, let's install Indexify using the official installation script in a terminal:\n",
"\n",
"```bash\n",
"curl https://getindexify.ai | sh\n",
"```\n",
"\n",
"Start the Indexify server:\n",
"```bash\n",
"./indexify server -d\n",
"```\n",
"This starts a long running server that exposes ingestion and retrieval APIs to applications.\n",
"\n",
"### Install Required Extractors\n",
"\n",
"Next, we'll install the necessary extractors in a new terminal:\n",
"\n",
"```bash\n",
"pip install indexify-extractor-sdk\n",
"indexify-extractor download tensorlake/pdfextractor\n",
"indexify-extractor download tensorlake/mistral\n",
"```\n",
"\n",
"Once the extractors are downloaded, you can start them:\n",
"```bash\n",
"indexify-extractor join-server\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating the Extraction Graph\n",
"\n",
"The extraction graph defines the flow of data through our summarization pipeline. We'll create a graph that first extracts text from PDFs, then sends that text to Mistral for summarization."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from indexify import IndexifyClient, ExtractionGraph\n",
"\n",
"client = IndexifyClient()\n",
"\n",
"extraction_graph_spec = \"\"\"\n",
"name: 'pdf_summarizer'\n",
"extraction_policies:\n",
" - extractor: 'tensorlake/pdfextractor'\n",
" name: 'pdf_to_text'\n",
" - extractor: 'tensorlake/mistral'\n",
" name: 'text_to_summary'\n",
" input_params:\n",
" model_name: 'mistral-large-latest'\n",
" key: 'YOUR_MISTRAL_API_KEY'\n",
" system_prompt: 'Summarize the following text in a concise manner, highlighting the key points:'\n",
" content_source: 'pdf_to_text'\n",
"\"\"\"\n",
"\n",
"extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)\n",
"client.create_extraction_graph(extraction_graph)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace `'YOUR_MISTRAL_API_KEY'` with your actual Mistral API key."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementing the Summarization Pipeline\n",
"\n",
"Now that we have our extraction graph set up, we can upload files and make the pipeline generate summaries:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import requests\n",
"from indexify import IndexifyClient\n",
"\n",
"def download_pdf(url, save_path):\n",
" response = requests.get(url)\n",
" with open(save_path, 'wb') as f:\n",
" f.write(response.content)\n",
" print(f\"PDF downloaded and saved to {save_path}\")\n",
"\n",
"def summarize_pdf(pdf_path):\n",
" client = IndexifyClient()\n",
" \n",
" # Upload the PDF file\n",
" content_id = client.upload_file(\"pdf_summarizer\", pdf_path)\n",
" \n",
" # Wait for the extraction to complete\n",
" client.wait_for_extraction(content_id)\n",
" \n",
" # Retrieve the summarized content\n",
" summary = client.get_extracted_content(\n",
" content_id=content_id,\n",
" graph_name=\"pdf_summarizer\",\n",
" policy_name=\"text_to_summary\"\n",
" )\n",
" \n",
" return summary[0]['content'].decode('utf-8')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdf_url = \"https://arxiv.org/pdf/2310.06825.pdf\"\n",
"pdf_path = \"reference_document.pdf\"\n",
"\n",
"# Download the PDF\n",
"download_pdf(pdf_url, pdf_path)\n",
"\n",
"# Summarize the PDF\n",
"summary = summarize_pdf(pdf_path)\n",
"print(\"Summary of the PDF:\")\n",
"print(summary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Customization and Advanced Usage\n",
"\n",
"You can customize the summarization process by modifying the `system_prompt` in the extraction graph. For example:\n",
"\n",
"- To generate bullet-point summaries:\n",
" ```yaml\n",
" system_prompt: 'Summarize the following text as a list of bullet points:'\n",
" ```\n",
"\n",
"- To focus on specific aspects of the document:\n",
" ```yaml\n",
" system_prompt: 'Summarize the main arguments and supporting evidence from the following text:'\n",
" ```\n",
"\n",
"You can also experiment with different Mistral models by changing the `model_name` parameter to find the best balance between speed and accuracy for your specific use case.\n",
"\n",
"## Conclusion\n",
"\n",
"While the example might look simple, there are some unique advantages of using Indexify for this -\n",
"\n",
"1. **Scalable and Highly Availability**: Indexify server can be deployed on a cloud and it can process 1000s of PDFs uploaded into it, and if any step in the pipeline fails it automatically retries on another machine.\n",
"2. **Flexibility**: You can use any other [PDF extraction model](https://docs.getindexify.ai/usecases/pdf_extraction/) we used here doesn't work for the document you are using.\n",
"\n",
"## Next Steps\n",
"\n",
"- Learn more about Indexify on our docs - https://docs.getindexify.ai\n",
"- Learn how to use Indexify and Mistral for [entity extraction from PDF documents](../pdf-entity-extraction/)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}