← Back to Cookbook
tool usage
Details
File: mistral/ocr/tool_usage.ipynb
Type: Jupyter Notebook
Use Cases: OCR, Function calling
Content
Notebook content (JSON format):
{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "1Fzjyb9daweA" }, "source": [ "# Document Comprehension with Any Model via Tool Usage and OCR\n", "\n", "---\n", "\n", "Optical Character Recognition (OCR) transforms text-based documents and images into pure text outputs and markdown. By leveraging this feature, you can enable any Large Language Model (LLM) to reliably understand documents efficiently and cost-effectively.\n", "\n", "In this guide, we will demonstrate how to use OCR with our models to discuss any text-based document, whether it's a PDF, photo, or screenshot, via URLs.\n", "\n", "---\n", "\n", "### Method\n", "We will leverage [Tool Usage](https://docs.mistral.ai/capabilities/function_calling/) to open any URL on demand by the user.\n", "\n", "#### Other Methods\n", "We also have a built-in feature for document understanding leveraging our OCR model, to learn more about it visit our [Document Understanding docs](https://docs.mistral.ai/capabilities/OCR/document_understanding/)" ] }, { "cell_type": "markdown", "metadata": { "id": "KvEQoe7Y9-um" }, "source": [ "## Tool Usage\n", "To achieve this, we will first send our question, which may or may not include URLs pointing to documents that we want to perform OCR on. Mistral Small will then decide, using the `open_urls` tool ( extracting the URLs directly ), whether it needs to perform OCR on any URL or if it can directly answer the question." ] }, { "cell_type": "markdown", "metadata": { "id": "FL4ZJCeY918i" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "Sf84okJJmm7M" }, "source": [ "### Setup\n", "First, let's install `mistralai`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X1EBW_a6gRUD", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "3d1c72f7-7eb5-44b0-c898-ce3d45d2dbcb" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Collecting mistralai\n", " Downloading mistralai-1.5.0-py3-none-any.whl.metadata (29 kB)\n", "Collecting eval-type-backport>=0.2.0 (from mistralai)\n", " Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)\n", "Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (0.28.1)\n", "Collecting jsonpath-python>=1.0.6 (from mistralai)\n", " Downloading jsonpath_python-1.0.6-py3-none-any.whl.metadata (12 kB)\n", "Requirement already satisfied: pydantic>=2.9.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.10.6)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.8.2)\n", "Collecting typing-inspect>=0.9.0 (from mistralai)\n", " Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)\n", "Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (3.7.1)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (2025.1.31)\n", "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (1.0.7)\n", "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (3.10)\n", "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx>=0.27.0->mistralai) (0.14.0)\n", "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (2.27.2)\n", "Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (4.12.2)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->mistralai) (1.17.0)\n", "Collecting mypy-extensions>=0.3.0 (from typing-inspect>=0.9.0->mistralai)\n", " Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)\n", "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.27.0->mistralai) (1.3.1)\n", "Downloading mistralai-1.5.0-py3-none-any.whl (271 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m271.6/271.6 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)\n", "Downloading jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)\n", "Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)\n", "Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)\n", "Installing collected packages: mypy-extensions, jsonpath-python, eval-type-backport, typing-inspect, mistralai\n", "Successfully installed eval-type-backport-0.2.2 jsonpath-python-1.0.6 mistralai-1.5.0 mypy-extensions-1.0.0 typing-inspect-0.9.0\n" ] } ], "source": [ "!pip install mistralai" ] }, { "cell_type": "markdown", "metadata": { "id": "nTpiGWkpmvSb" }, "source": [ "We can now set up our client. You can create an API key on our [Plateforme](https://console.mistral.ai/api-keys/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AwG2kwfTlbW1" }, "outputs": [], "source": [ "from mistralai import Mistral\n", "\n", "api_key = \"API_KEY\"\n", "client = Mistral(api_key=api_key)\n", "text_model = \"mistral-small-latest\"\n", "ocr_model = \"mistral-ocr-latest\"" ] }, { "cell_type": "markdown", "metadata": { "id": "F35KDRN-nEMv" }, "source": [ "### System and Tool\n", "For the model to be aware of its purpose and what it can do, it's important to provide a clear system prompt with instructions and explanations of any tools it may have access to.\n", "\n", "Let's define a system prompt and the tools it will have access to, in this case, `open_urls`.\n", "\n", "*Note: `open_urls` can easily be customized with other resources and models ( for summarization, for example ) and many other features. In this demo, we are going for a simpler approach.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zzgxk6qTgGU9" }, "outputs": [], "source": [ "system = \"\"\"You are an AI Assistant with document understanding via URLs. You will be provided with URLs, and you must answer any questions related to those documents.\n", "\n", "# OPEN URLS INSTRUCTIONS\n", "You can open URLs by using the `open_urls` tool. It will open webpages and apply OCR to them, retrieving the contents. Use those contents to answer the user.\n", "Only URLs pointing to PDFs and images are supported; you may encounter an error if they are not; provide that information to the user if required.\"\"\"" ] }, { "cell_type": "code", "source": [ "def _perform_ocr(url: str) -> str:\n", " try: # Apply OCR to the PDF URL\n", " response = client.ocr.process(\n", " model=ocr_model,\n", " document={\n", " \"type\": \"document_url\",\n", " \"document_url\": url\n", " }\n", " )\n", " except Exception:\n", " try: # IF PDF OCR fails, try Image OCR\n", " response = client.ocr.process(\n", " model=ocr_model,\n", " document={\n", " \"type\": \"image_url\",\n", " \"image_url\": url\n", " }\n", " )\n", " except Exception as e:\n", " return e # Return the error to the model if it fails, otherwise return the contents\n", " return \"\\n\\n\".join([f\"### Page {i+1}\\n{response.pages[i].markdown}\" for i in range(len(response.pages))])" ], "metadata": { "id": "SxP7DlEHWXnK" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def open_urls(urls: list) -> str:\n", " contents = \"# Documents\"\n", " for url in urls:\n", " contents += f\"\\n\\n## URL: {url}\\n{_perform_ocr(url)}\"\n", " return contents" ], "metadata": { "id": "s9PgX9fqWY1m" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "BagY4xg0nSSg" }, "source": [ "We also have to define the Tool Schema that will be provided to our API and model.\n", "\n", "By following the [documentation](https://docs.mistral.ai/capabilities/function_calling/), we can create something like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hpBKzNOfliQr" }, "outputs": [], "source": [ "tools = [\n", " {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": \"open_urls\",\n", " \"description\": \"Open URLs websites (PDFs and Images) and perform OCR on them.\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"urls\": {\n", " \"type\": \"array\",\n", " \"description\": \"The URLs list.\",\n", " }\n", " },\n", " \"required\": [\"urls\"],\n", " },\n", " },\n", " },\n", "]" ] }, { "cell_type": "code", "source": [ "names_to_functions = {\n", " 'open_urls': open_urls\n", "}" ], "metadata": { "id": "DqalxqIWWVL1" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "W6bE08lPngrm" }, "source": [ "### Test\n", "Everything is ready; we can quickly create a while loop to chat with our model directly in the console.\n", "\n", "The model will use `open_urls` each time URLs are mentioned. If they are PDFs or photos, it will perform OCR and provide the raw text contents to the model, which will then use them to answer the user.\n", "\n", "#### Example Prompts ( PDF & Image )\n", "- Could you summarize what this research paper talks about? https://arxiv.org/pdf/2410.07073\n", "- What is written here: https://jeroen.github.io/images/testocr.png" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 371 }, "id": "pVeVmWn_ljRo", "outputId": "7fe77386-cb5d-43f4-8bae-ea7b41ddb01a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assistant > The research paper titled \"Pixtral 12B\" introduces a 12-billion-parameter multimodal language model designed to understand both natural images and documents. The model is trained on a large-scale dataset of interleaved image and text documents, enabling it to perform multi-turn, multi-image conversations. Pixtral 12B is built on a transformer architecture and includes a new vision encoder, PixtralViT, which allows it to process images at their native resolution and aspect ratio. This flexibility is achieved through a novel RoPE-2D implementation, which supports variable image sizes and aspect ratios without the need for interpolation.\n", "\n", "The model's performance is evaluated on various multimodal benchmarks, where it outperforms other open-source models of similar sizes, such as Qwen-2-VL 7B and Llama-3.2 11B. Pixtral 12B also matches or exceeds the performance of much larger models like Llama-3.2 90B and closed-source models like Claude-3 Haiku and Gemini-1.5 Flash 8B. The paper introduces a new benchmark, MM-MT-Bench, designed to evaluate multimodal models in practical scenarios, and provides detailed analysis and code for standardized evaluation protocols.\n", "\n", "The architecture of Pixtral 12B consists of a multimodal decoder and a vision encoder. The vision encoder, PixtralViT, is trained from scratch and includes several key features such as break tokens, gating in the feedforward layer, sequence packing, and RoPE-2D for relative position encoding. The model is evaluated under various prompts and metrics, demonstrating its robustness and flexibility in handling different types of multimodal tasks.\n", "\n", "The paper also discusses the importance of standardized evaluation protocols and the impact of prompt design on model performance. It highlights that Pixtral 12B performs well under both 'Explicit' and 'Naive' prompts, with only minor regressions on specific benchmarks. The model's performance is further analyzed under flexible parsing constraints, showing that it benefits very little from relaxed metrics and continues to lead even when flexible parsing is accounted for.\n", "\n", "In summary, Pixtral 12B is a state-of-the-art multimodal model that excels in both text-only and multimodal tasks. Its novel architecture, flexibility in processing images, and strong performance across various benchmarks make it a versatile tool for complex multimodal applications. The model is released under the Apache 2.0 license, making it accessible for further research and development.\n", "Assistant > The text written on the image is:\n", "\n", "\"This is a lot of 12 point text to test the ocr code and see if it works on all types of file format. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox.\"\n", "Assistant > You're welcome! If you have any more questions or need further assistance, feel free to ask.\n" ] } ], "source": [ "import json\n", "\n", "messages = [{\"role\": \"system\", \"content\": system}]\n", "while True:\n", " # Insert user input, quit if desired\n", " user_input = input(\"User > \")\n", " if user_input == \"quit\":\n", " break\n", " messages.append({\"role\": \"user\", \"content\": user_input})\n", "\n", " # Loop Mistral Small tool use until no tool called\n", " while True:\n", " response = client.chat.complete(\n", " model = text_model,\n", " messages = messages,\n", " temperature = 0,\n", " tools = tools\n", " )\n", " messages.append({\"role\":\"assistant\", \"content\": response.choices[0].message.content, \"tool_calls\": response.choices[0].message.tool_calls})\n", "\n", " # If tool called, run tool and continue, else break loop and reply\n", " if response.choices[0].message.tool_calls:\n", " tool_call = response.choices[0].message.tool_calls[0]\n", " function_name = tool_call.function.name\n", " function_params = json.loads(tool_call.function.arguments)\n", " function_result = names_to_functions[function_name](**function_params)\n", " messages.append({\"role\":\"tool\", \"name\":function_name, \"content\":function_result, \"tool_call_id\":tool_call.id})\n", " else:\n", " break\n", "\n", " print(\"Assistant >\", response.choices[0].message.content)" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" } }, "nbformat": 4, "nbformat_minor": 0 }