← Back to Cookbook
document understanding
Details
File: mistral/ocr/document_understanding.ipynb
Type: Jupyter Notebook
Use Cases: OCR, Documents, RAG
Content
Notebook content (JSON format):
{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "1Fzjyb9daweA" }, "source": [ "# Chat with your Documents with built-in Document QnA\n", "\n", "---\n", "\n", "## Use our built-in Document QnA feature\n", "\n", "Optical Character Recognition (OCR) transforms text-based documents and images into pure text outputs and markdown. By leveraging this feature, you can enable any Large Language Model (LLM) to reliably understand documents efficiently and cost-effectively.\n", "\n", "In this guide, we will demonstrate how to use OCR with our models to discuss any text-based document, whether it's a PDF, photo, or screenshot, via URLs and our built-in feature.\n", "\n", "---\n", "\n", "### Method\n", "This method will make use of our built-in feature that leverages OCR, we will extract the URLs with regex and call our models with this feature." ] }, { "cell_type": "markdown", "source": [ "## Built-In\n", "Mistral provides a built-in feature that leverages OCR with all models. By providing a URL pointing to a document, you can extract text data that will be provided to the model." ], "metadata": { "id": "X0n4egyIfba7" } }, { "cell_type": "markdown", "source": [ "" ], "metadata": { "id": "UV9IBRXSfjhN" } }, { "cell_type": "markdown", "source": [ "Following, there is a simple, quick, example of how to make use of this feature by extracting PDF URLs with regex and uploading them as a `document_url`." ], "metadata": { "id": "cyeuMtkGfe4n" } }, { "cell_type": "markdown", "source": [ "" ], "metadata": { "id": "ajR_E_VpfsPP" } }, { "cell_type": "markdown", "source": [ "Learn more about Document QnA [here](https://docs.mistral.ai/capabilities/OCR/document_qna/)." ], "metadata": { "id": "c0Bb5yaIfVG3" } }, { "cell_type": "markdown", "source": [ "### Setup\n", "First, let's install `mistralai`" ], "metadata": { "id": "Qpkf1J_JF3vz" } }, { "cell_type": "code", "source": [ "!pip install mistralai" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4J4LcA_PF0-0", "outputId": "d6a5238e-5600-4ff7-9b28-3692281c64d6" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Collecting mistralai\n", " Downloading mistralai-1.7.0-py3-none-any.whl.metadata (30 kB)\n", "Collecting eval-type-backport>=0.2.0 (from mistralai)\n", " Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)\n", "Requirement already satisfied: httpx>=0.28.1 in /usr/local/lib/python3.11/dist-packages (from mistralai) (0.28.1)\n", "Requirement already satisfied: pydantic>=2.10.3 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.11.4)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.9.0.post0)\n", "Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (0.4.0)\n", "Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.28.1->mistralai) (4.9.0)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpx>=0.28.1->mistralai) (2025.4.26)\n", "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx>=0.28.1->mistralai) (1.0.9)\n", "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.28.1->mistralai) (3.10)\n", "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx>=0.28.1->mistralai) (0.16.0)\n", "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.10.3->mistralai) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.10.3->mistralai) (2.33.2)\n", "Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.10.3->mistralai) (4.13.2)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->mistralai) (1.17.0)\n", "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.28.1->mistralai) (1.3.1)\n", "Downloading mistralai-1.7.0-py3-none-any.whl (301 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m301.5/301.5 kB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)\n", "Installing collected packages: eval-type-backport, mistralai\n", "Successfully installed eval-type-backport-0.2.2 mistralai-1.7.0\n" ] } ] }, { "cell_type": "markdown", "source": [ "We can now set up our client. You can create an API key on our [Plateforme](https://console.mistral.ai/api-keys/)." ], "metadata": { "id": "AoJtMUBYF6IY" } }, { "cell_type": "code", "source": [ "from mistralai import Mistral\n", "\n", "api_key = \"API_KEY\"\n", "client = Mistral(api_key=api_key)\n", "text_model = \"mistral-small-latest\"" ], "metadata": { "id": "qdYPUQJ6F8m5" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### System and Regex\n", "Let's define a simple system prompt, since there is no tool call required for this demo we can be fairly straightforward." ], "metadata": { "id": "T7CvWtw9jfR7" } }, { "cell_type": "code", "source": [ "system = \"You are an AI Assistant with document understanding via URLs. You may be provided with URLs, followed by their corresponding OCR.\"" ], "metadata": { "id": "Mkmw1FyGQpl3" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "To extract the URLs, we will use regex to extract any URL pattern from the user query.\n", "\n", "*Note: We will assume there will only be PDF files for simplicity.*" ], "metadata": { "id": "35yYt9asjoIa" } }, { "cell_type": "code", "source": [ "import re\n", "\n", "def extract_urls(text: str) -> list:\n", " url_pattern = r'\\b((?:https?|ftp)://(?:www\\.)?[^\\s/$.?#].[^\\s]*)\\b'\n", " urls = re.findall(url_pattern, text)\n", " return urls\n", "\n", "# Example\n", "extract_urls(\"Hi there, you can visit our docs in our website https://docs.mistral.ai/, we cannot wait to see what you will build with us.\")" ], "metadata": { "id": "vLMw8Z8fOT19", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "3ac0dc54-61df-4284-f553-e4cfdfd9b6e0" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['https://docs.mistral.ai']" ] }, "metadata": {}, "execution_count": 4 } ] }, { "cell_type": "markdown", "source": [ "### Test\n", "We can now try it out, we setup so that for each query all urls are extracted and added to the query properly.\n", "\n", "#### Example Prompts ( PDFs )\n", "- Could you summarize what this research paper talks about? https://arxiv.org/pdf/2410.07073\n", "- Explain this architecture: https://arxiv.org/abs/2401.04088" ], "metadata": { "id": "gsRD_4mJjz7-" } }, { "cell_type": "code", "source": [ "import json\n", "\n", "messages = [{\"role\": \"system\", \"content\": system}]\n", "while True:\n", " user_input = input(\"User > \")\n", " if user_input.lower() == \"quit\":\n", " break\n", "\n", " # Extract URLs from the user input, assuming they are always PDFs\n", " document_urls = extract_urls(user_input)\n", " user_message_content = [{\"type\": \"text\", \"text\": user_input}]\n", " for url in document_urls:\n", " user_message_content.append({\"type\": \"document_url\", \"document_url\": url})\n", " messages.append({\"role\": \"user\", \"content\": user_message_content})\n", "\n", " # Send the messages to the model and get a response\n", " response = client.chat.complete(\n", " model=text_model,\n", " messages=messages,\n", " temperature=0\n", " )\n", " messages.append({\"role\": \"assistant\", \"content\": response.choices[0].message.content})\n", "\n", " print(\"Assistant >\", response.choices[0].message.content)\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Eell9TZ7Oapq", "outputId": "13e48a49-1067-409b-8482-183f9b779584" }, "execution_count": null, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User > Could you summarize what this research paper talks about? https://arxiv.org/pdf/2410.07073\n", "Assistant > The research paper titled \"Pixtral 12B\" introduces a 12-billion-parameter multimodal language model named Pixtral 12B. This model is designed to understand both natural images and documents, achieving leading performance on various multimodal benchmarks while also excelling in text-only tasks. Unlike many open-source models, Pixtral 12B does not compromise on natural language performance to excel in multimodal tasks.\n", "\n", "### Key Features and Contributions:\n", "\n", "1. **Multimodal Capabilities**:\n", " - Pixtral 12B is trained to handle both images and text, making it versatile for a wide range of applications.\n", " - It uses a new vision encoder trained from scratch, allowing it to ingest images at their natural resolution and aspect ratio, providing flexibility in the number of tokens used to process an image.\n", "\n", "2. **Performance**:\n", " - The model outperforms other open models of similar sizes (e.g., Llama-3.2 11B and Qwen-2-VL 7B) and even larger models like Llama-3.2 90B on various multimodal benchmarks.\n", " - It also performs well on text-only benchmarks, making it a suitable drop-in replacement for both text and vision tasks.\n", "\n", "3. **Benchmarks and Evaluation**:\n", " - The paper introduces a new open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios.\n", " - The evaluation protocol includes 'Explicit' prompts that specify the required output format, ensuring fair and standardized evaluation.\n", "\n", "4. **Architectural Details**:\n", " - Pixtral 12B is based on the transformer architecture and consists of a multimodal decoder and a vision encoder.\n", " - The vision encoder is designed to handle variable image sizes and aspect ratios, using techniques like RoPE-2D for relative position encoding.\n", "\n", "5. **Applications**:\n", " - The model can be used for reasoning over complex figures, multi-image instruction following, chart understanding and analysis, and converting hand-drawn website interfaces into executable HTML code.\n", "\n", "6. **Open-Source Release**:\n", " - Pixtral 12B is released under the Apache 2.0 license, making it accessible for further research and development.\n", "\n", "### Conclusion:\n", "Pixtral 12B represents a significant advancement in multimodal language models, offering strong performance across both text and vision tasks. Its versatility and open-source nature make it a valuable tool for researchers and developers in the field of AI and machine learning.\n", "User > quit\n" ] } ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" } }, "nbformat": 4, "nbformat_minor": 0 }