Reach out

Command Palette

Search for a command to run...

[Capabilities]

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Annotations

In addition to the basic OCR functionality, Mistral Document AI API adds the annotations functionality, which allows you to extract information in a structured json-format that you provide. Specifically, it offers two types of annotations:

  • bbox_annotation: gives you the annotation of the bboxes extracted by the OCR model (charts/ figures etc) based on user requirement and provided bbox/image annotation format. The user may ask to describe/caption the figure for instance.
  • document_annotation: returns the annotation of the entire document based on the provided document annotation format.

Key capabilities:

  • Labeling and annotating data
  • Extraction and structuring of specific information from documents into a predefined JSON format
  • Automation of data extraction to reduce manual entry and errors
  • Efficient handling of large document volumes for enterprise-level applications

Common use cases:

  • Parsing of forms, classification of documents, and processing of images, including text, charts, and signatures
  • Conversion of charts to tables, extraction of fine print from figures, or definition of custom image types
  • Capture of receipt data, including merchant names and transaction amounts, for expense management.
  • Extraction of key information like vendor details and amounts from invoices for automated accounting.
  • Extraction of key clauses and terms from contracts for easier review and management

How it works

BBOX Annotations

  • All document types:
    • After regular OCR is finished; we call a Vision capable LLM for all bboxes individually with the provided annotation format.

Document Annotation

  • pdf/image:
    • Independent of OCR; we convert all pages into images and send all images to a Vision capable LLM along with the provided annotation format.
  • pptx/docx/...:
    • We run OCR first and send the output text markdown to a Vision capable LLM along with the provided annotation format.

You can use our API with the following document formats:

In these examples, we will only consider the OCR with pdf format.

BBox Annotation

Here is an example of how to use our Annotation functionalities using the Mistral AI client and Pydantic:

Define the Data Model

First, define the response formats for BBox Annotation using Pydantic models:

1from pydantic import BaseModel
2
3# BBOX Annotation response formats
4class Image(BaseModel):
5  image_type: str
6  short_description: str
7  summary: str

You can also provide a description for each entry, the description will be used as detailed information and instructions during the annotation; for example:

1from pydantic import BaseModel, Field
2
3# BBOX Annotation response formats
4class Image(BaseModel):
5  image_type: str = Field(..., description="The type of the image.")
6  short_description: str = Field(..., description="A description in english describing the image.")
7  summary: str = Field(..., description="Summarize the image.")

Start the completion

Next, use the Mistral AI python client to make a request and ensure the response adheres to the defined structures using bbox_annotation_format set to the corresponding pydantic models:

1import os
2from mistralai import Mistral, DocumentURLChunk, ImageURLChunk, ResponseFormat
3from mistralai.extra import response_format_from_pydantic_model
4
5api_key = os.environ["MISTRAL_API_KEY"]
6
7client = Mistral(api_key=api_key)
8
9# Client call
10response = client.ocr.process(
11    model="mistral-ocr-latest",
12    document=DocumentURLChunk(
13      document_url="https://arxiv.org/pdf/2410.07073"
14    ),
15    bbox_annotation_format=response_format_from_pydantic_model(Image),
16    include_image_base64=True
17  )

Here is an example of how to use our Annotation functionalities using the Mistral AI client and Zod:

Define the Data Model

First, define the response formats for BBox Annotation using Zod schemas:

1import { z } from 'zod';
2
3// BBOX Annotation response formats
4const ImageSchema = z.object({
5  image_type: z.string(),
6  short_description: z.string(),
7  summary: z.string(),
8});

You can also provide a description for each entry, the description will be used as detailed information and instructions during the annotation; for example:

1import { z } from 'zod';
2
3// Define the schema for the Image type
4const ImageSchema = z.object({
5  image_type: z.string().describe('The type of the image.'),
6  short_description: z
7    .string()
8    .describe('A description in English describing the image.'),
9  summary: z.string().describe('Summarize the image.'),
10});

Start the completion

Next, use the Mistral AI typescript client to make a request and ensure the response adheres to the defined structure using bbox_annotation_format set to the corresponding Zod schema:

1import { Mistral } from '@mistralai/mistralai';
2import { responseFormatFromZodObject } from '@mistralai/mistralai/extra/structChat.js';
3
4const apiKey = process.env.MISTRAL_API_KEY;
5
6const client = new Mistral({ apiKey: apiKey });
7
8async function processDocument() {
9  try {
10    const response = await client.ocr.process({
11      model: 'mistral-ocr-latest',
12      document: {
13        type: 'document_url',
14        documentUrl: 'https://arxiv.org/pdf/2410.07073',
15      },
16      bboxAnnotationFormat: responseFormatFromZodObject(ImageSchema),
17      includeImageBase64: true,
18    });
19
20    console.log(response);
21  } catch (error) {
22    console.error('Error processing document:', error);
23  }
24}
25
26processDocument();

The request is structured to ensure that the response adheres to the specified custom JSON schema. The schema defines the structure of a bbox_annotation object with image_type, short_description and summary properties.

1curl --location 'https://api.mistral.ai/v1/ocr' \
2--header 'Content-Type: application/json' \
3--header "Authorization: Bearer ${MISTRAL_API_KEY}" \
4--data '{
5    "model": "mistral-ocr-latest",
6    "document": {"document_url": "https://arxiv.org/pdf/2410.07073"},
7    "bbox_annotation_format": {
8        "type": "json_schema",
9        "json_schema": {
10            "schema": {
11                "properties": {
12                    "document_type": {"title": "Document_Type", "type": "string"},
13                    "short_description": {"title": "Short_Description", "type": "string"},
14                    "summary": {"title": "Summary", "type": "string"}
15                },
16                "required": ["document_type", "short_description", "summary"],
17                "title": "BBOXAnnotation",
18                "type": "object",
19                "additionalProperties": false
20            },
21            "name": "document_annotation",
22            "strict": true
23        }
24    },
25    "include_image_base64": true
26}'

You can also add a description key in your properties object. The description will be used as detailed information and instructions during the annotation; for example:

1curl --location 'https://api.mistral.ai/v1/ocr' \
2--header 'Content-Type: application/json' \
3--header "Authorization: Bearer ${MISTRAL_API_KEY}" \
4--data '{
5    "model": "mistral-ocr-latest",
6    "document": {"document_url": "https://arxiv.org/pdf/2410.07073"},
7    "bbox_annotation_format": {
8        "type": "json_schema",
9        "json_schema": {
10            "schema": {
11                "properties": {
12                    "document_type": {"title": "Document_Type", "description": "The type of the image.", "type": "string"},
13                    "short_description": {"title": "Short_Description", "description": "A description in English describing the image.", "type": "string"},
14                    "summary": {"title": "Summary", "description": "Summarize the image.", "type": "string"}
15                },
16                "required": ["document_type", "short_description", "summary"],
17                "title": "BBOXAnnotation",
18                "type": "object",
19                "additionalProperties": false
20            },
21            "name": "document_annotation",
22            "strict": true
23        }
24    },
25    "include_image_base64": true
26}'

BBOX Image

Image Base 64

1{
2  "image_base64": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGB{LONG_MIDDLE_SEQUENCE}KKACiiigAooooAKKKKACiiigD//2Q=="
3}

BBOX Annotation Output

1{
2  "image_type": "scatter plot",
3  "short_description": "Comparison of different models based on performance and cost.",
4  "summary": "The image consists of two scatter plots comparing various models on two different performance metrics against their cost or number of parameters. The left plot shows performance on the MM-MT-Bench, while the right plot shows performance on the LMSys-Vision ELO. Each point represents a different model, with the x-axis indicating the cost or number of parameters in billions (B) and the y-axis indicating the performance score. The shaded region in both plots highlights the best performance/cost ratio, with Pixtral 12B positioned within this region in both plots, suggesting it offers a strong balance of performance and cost efficiency. Other models like Qwen-2-VL 72B and Qwen-2-VL 7B also show high performance but at varying costs."
5}

Document Annotation

Here is an example of how to use our Document Annotation functionality using the Mistral AI client and Pydantic:

Define the Data Model

First, define the response format for Document Annotation using a Pydantic model:

1from pydantic import BaseModel
2
3# Document Annotation response format
4class Document(BaseModel):
5  language: str
6  chapter_titles: list[str]
7  urls: list[str]

Start the completion

Next, use the Mistral AI python client to make a request and ensure the response adheres to the defined structures using document_annotation_format set to the corresponding pydantic model:

1import os
2from mistralai import Mistral, DocumentURLChunk, ImageURLChunk, ResponseFormat
3from mistralai.extra import response_format_from_pydantic_model
4
5api_key = os.environ["MISTRAL_API_KEY"]
6
7client = Mistral(api_key=api_key)
8
9# Client call
10response = client.ocr.process(
11    model="mistral-ocr-latest",
12    pages=list(range(8)),
13    document=DocumentURLChunk(
14      document_url="https://arxiv.org/pdf/2410.07073"
15    ),
16    document_annotation_format=response_format_from_pydantic_model(Document),
17    include_image_base64=True
18  )

Here is an example of how to use our Document Annotation functionality using the Mistral AI client and Zod:

Define the Data Model

First, define the response formats for Document Annotation using a Zod schema:

1import { z } from 'zod';
2
3// Document Annotation response format
4const DocumentSchema = z.object({
5  language: z.string(),
6  chapter_titles: z.array(z.string()),
7  urls: z.array(z.string()),
8});

Start the completion

Next, use the Mistral AI typescript client to make a request and ensure the response adheres to the defined structures using document_annotation_format set to the corresponding Zod schema:

1import { Mistral } from '@mistralai/mistralai';
2import { responseFormatFromZodObject } from '@mistralai/mistralai/extra/structChat.js';
3
4const apiKey = process.env.MISTRAL_API_KEY;
5
6const client = new Mistral({ apiKey: apiKey });
7
8async function processDocument() {
9  try {
10    const response = await client.ocr.process({
11      model: 'mistral-ocr-latest',
12      pages: Array.from({ length: 8 }, (_, i) => i), // Creates an array [0, 1, 2, ..., 7]
13      document: {
14        type: 'document_url',
15        documentUrl: 'https://arxiv.org/pdf/2410.07073',
16      },
17      documentAnnotationFormat: responseFormatFromZodObject(DocumentSchema),
18      includeImageBase64: true,
19    });
20
21    console.log(response);
22  } catch (error) {
23    console.error('Error processing document:', error);
24  }
25}
26
27processDocument();

The request is structured to ensure that the response adheres to the specified custom JSON schema. The schema defines the structure of a document_annotation object with with language, chapter_titles and urls properties.

1curl --location 'https://api.mistral.ai/v1/ocr' \
2--header 'Content-Type: application/json' \
3--header "Authorization: Bearer ${MISTRAL_API_KEY}" \
4--data '{
5    "model": "mistral-ocr-latest",
6    "document": {"document_url": "https://arxiv.org/pdf/2410.07073"},
7    "pages": [0, 1, 2, 3, 4, 5, 6, 7],
8    "document_annotation_format": {
9        "type": "json_schema",
10        "json_schema": {
11            "schema": {
12                "properties": {
13                    "language": {"title": "Language", "type": "string"},
14                    "chapter_titles": {"title": "Chapter_Titles", "type": "string"},
15                    "urls": {"title": "urls", "type": "string"}
16                },
17                "required": ["language", "chapter_titles", "urls"],
18                "title": "DocumentAnnotation",
19                "type": "object",
20                "additionalProperties": false
21            },
22            "name": "document_annotation",
23            "strict": true
24        }
25    },
26    "include_image_base64": true
27}'

Document Annotation Output

1{
2  "language": "English",
3  "chapter_titles": [
4    "Abstract",
5    "1 Introduction",
6    "2 Architectural details",
7    "2.1 Multimodal Decoder",
8    "2.2 Vision Encoder",
9    "2.3 Complete architecture",
10    "3 MM-MT-Bench: A benchmark for multi-modal instruction following",
11    "4 Results",
12    "4.1 Main Results",
13    "4.2 Prompt selection",
14    "4.3 Sensitivity to evaluation metrics",
15    "4.4 Vision Encoder Ablations"
16  ],
17  "urls": [
18    "https://mistral.ai/news/pixtal-12b/",
19    "https://github.com/mistralai/mistral-inference/",
20    "https://github.com/mistralai/mistral-evals/",
21    "https://huggingface.co/datasets/mistralai/MM-MT-Bench"
22  ]
23}

BBoxes Annotation and Document Annotation

Here is an example of how to use our Annotation functionalities using the Mistral AI client and Pydantic:

Define the Data Model

First, define the response formats for both BBox Annotation and Document Annotation using Pydantic models:

1from pydantic import BaseModel
2
3# BBOX Annotation response format
4class Image(BaseModel):
5  image_type: str
6  short_description: str
7  summary: str
8
9# Document Annotation response format
10class Document(BaseModel):
11  language: str
12  chapter_titles: list[str]
13  urls: list[str]

You can also provide a description for each entry, the description will be used as detailed information and instructions during the annotation; for example:

1from pydantic import BaseModel, Field
2
3# BBOX Annotation response format with description
4class Image(BaseModel):
5  image_type: str = Field(..., description="The type of the image.")
6  short_description: str = Field(..., description="A description in english describing the image.")
7  summary: str = Field(..., description="Summarize the image.")
8
9# Document Annotation response format
10class Document(BaseModel):
11  language: str
12  chapter_titles: list[str]
13  urls: list[str]

Start the completion

Next, use the Mistral AI python client to make a request and ensure the response adheres to the defined structures using bbox_annotation_format and document_annotation_format set to the corresponding pydantic models:

1import os
2from mistralai import Mistral, DocumentURLChunk, ImageURLChunk, ResponseFormat
3from mistralai.extra import response_format_from_pydantic_model
4
5api_key = os.environ["MISTRAL_API_KEY"]
6
7client = Mistral(api_key=api_key)
8
9# Client call
10response = client.ocr.process(
11    model="mistral-ocr-latest",
12    pages=list(range(8)),
13    document=DocumentURLChunk(
14      document_url="https://arxiv.org/pdf/2410.07073"
15    ),
16    bbox_annotation_format=response_format_from_pydantic_model(Image),
17    document_annotation_format=response_format_from_pydantic_model(Document),
18    include_image_base64=True
19  )

Here is an example of how to use our Annotation functionalities using the Mistral AI client and Zod:

Define the Data Model

First, define the response formats for both BBox Annotation and Document Annotation using Zod schemas:

1import { z } from 'zod';
2
3// BBOX Annotation response format
4const ImageSchema = z.object({
5  image_type: z.string(),
6  short_description: z.string(),
7  summary: z.string(),
8});
9
10// Document Annotation response format
11const DocumentSchema = z.object({
12  language: z.string(),
13  chapter_titles: z.array(z.string()),
14  urls: z.array(z.string()),
15});

You can also provide a description for each entry, the description will be used as detailed information and instructions during the annotation; for example:

1import { z } from 'zod';
2
3// Define the schema for the Image type
4const ImageSchema = z.object({
5  image_type: z.string().describe('The type of the image.'),
6  short_description: z
7    .string()
8    .describe('A description in English describing the image.'),
9  summary: z.string().describe('Summarize the image.'),
10});
11
12// Document Annotation response format
13const DocumentSchema = z.object({
14  language: z.string(),
15  chapter_titles: z.array(z.string()),
16  urls: z.array(z.string()),
17});

Start the completion

Next, use the Mistral AI typescript client to make a request and ensure the response adheres to the defined structures using bbox_annotation_format and document_annotation_format set to the corresponding Zod schemas:

1import { Mistral } from '@mistralai/mistralai';
2import { responseFormatFromZodObject } from '@mistralai/mistralai/extra/structChat.js';
3
4const apiKey = process.env.MISTRAL_API_KEY;
5
6const client = new Mistral({ apiKey: apiKey });
7
8async function processDocument() {
9  try {
10    const response = await client.ocr.process({
11      model: 'mistral-ocr-latest',
12      pages: Array.from({ length: 8 }, (_, i) => i), // Creates an array [0, 1, 2, ..., 7]
13      document: {
14        type: 'document_url',
15        documentUrl: 'https://arxiv.org/pdf/2410.07073',
16      },
17      bboxAnnotationFormat: responseFormatFromZodObject(ImageSchema),
18      documentAnnotationFormat: responseFormatFromZodObject(DocumentSchema),
19      includeImageBase64: true,
20    });
21
22    console.log(response);
23  } catch (error) {
24    console.error('Error processing document:', error);
25  }
26}
27
28processDocument();

The request is structured to ensure that the response adheres to the specified custom JSON schema. The schema defines the structure of a bbox_annotation object with image_type, short_description and summary properties and a document_annotation object with with language, chapter_titles and urls properties.

1curl --location 'https://api.mistral.ai/v1/ocr' \
2--header 'Content-Type: application/json' \
3--header "Authorization: Bearer ${MISTRAL_API_KEY}" \
4--data '{
5    "model": "mistral-ocr-latest",
6    "document": {"document_url": "https://arxiv.org/pdf/2410.07073"},
7    "pages": [0, 1, 2, 3, 4, 5, 6, 7],
8    "bbox_annotation_format": {
9        "type": "json_schema",
10        "json_schema": {
11            "schema": {
12                "properties": {
13                    "document_type": {"title": "Document_Type", "type": "string"},
14                    "short_description": {"title": "Short_Description", "type": "string"},
15                    "summary": {"title": "Summary", "type": "string"}
16                },
17                "required": ["document_type", "short_description", "summary"],
18                "title": "BBOXAnnotation",
19                "type": "object",
20                "additionalProperties": false
21            },
22            "name": "document_annotation",
23            "strict": true
24        }
25    },
26    "document_annotation_format": {
27        "type": "json_schema",
28        "json_schema": {
29            "schema": {
30                "properties": {
31                    "language": {"title": "Language", "type": "string"},
32                    "chapter_titles": {"title": "Chapter_Titles", "type": "string"},
33                    "urls": {"title": "urls", "type": "string"}
34                },
35                "required": ["language", "chapter_titles", "urls"],
36                "title": "DocumentAnnotation",
37                "type": "object",
38                "additionalProperties": false
39            },
40            "name": "document_annotation",
41            "strict": true
42        }
43    },
44    "include_image_base64": true
45}'

You can also add a description key in you properties object. The description will be used as detailed information and instructions during the annotation; for example:

1curl --location 'https://api.mistral.ai/v1/ocr' \
2--header 'Content-Type: application/json' \
3--header "Authorization: Bearer ${MISTRAL_API_KEY}" \
4--data '{
5    "model": "mistral-ocr-latest",
6    "document": {"document_url": "https://arxiv.org/pdf/2410.07073"},
7    "bbox_annotation_format": {
8        "type": "json_schema",
9        "json_schema": {
10            "schema": {
11                "properties": {
12                    "document_type": {"title": "Document_Type", "description": "The type of the image.", "type": "string"},
13                    "short_description": {"title": "Short_Description", "description": "A description in English describing the image.", "type": "string"},
14                    "summary": {"title": "Summary", "description": "Summarize the image.", "type": "string"}
15                },
16                "required": ["document_type", "short_description", "summary"],
17                "title": "BBOXAnnotation",
18                "type": "object",
19                "additionalProperties": false
20            },
21            "name": "document_annotation",
22            "strict": true
23        }
24    },
25     "document_annotation_format": {
26        "type": "json_schema",
27        "json_schema": {
28            "schema": {
29                "properties": {
30                    "language": {"title": "Language", "type": "string"},
31                    "chapter_titles": {"title": "Chapter_Titles", "type": "string"},
32                    "urls": {"title": "urls", "type": "string"}
33                },
34                "required": ["language", "chapter_titles", "urls"],
35                "title": "DocumentAnnotation",
36                "type": "object",
37                "additionalProperties": false
38            },
39            "name": "document_annotation",
40            "strict": true
41        }
42    },
43    "include_image_base64": true
44}'

BBOX Image

Image Base 64

1{
2  "image_base64": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGB{LONG_MIDDLE_SEQUENCE}KKACiiigAooooAKKKKACiiigD//2Q=="
3}

BBOX Annotation Output

1{
2  "image_type": "scatter plot",
3  "short_description": "Comparison of different models based on performance and cost.",
4  "summary": "The image consists of two scatter plots comparing various models on two different performance metrics against their cost or number of parameters. The left plot shows performance on the MM-MT-Bench, while the right plot shows performance on the LMSys-Vision ELO. Each point represents a different model, with the x-axis indicating the cost or number of parameters in billions (B) and the y-axis indicating the performance score. The shaded region in both plots highlights the best performance/cost ratio, with Pixtral 12B positioned within this region in both plots, suggesting it offers a strong balance of performance and cost efficiency. Other models like Qwen-2-VL 72B and Qwen-2-VL 7B also show high performance but at varying costs."
5}

Document Annotation Output

1{
2  "language": "English",
3  "chapter_titles": [
4    "Abstract",
5    "1 Introduction",
6    "2 Architectural details",
7    "2.1 Multimodal Decoder",
8    "2.2 Vision Encoder",
9    "2.3 Complete architecture",
10    "3 MM-MT-Bench: A benchmark for multi-modal instruction following",
11    "4 Results",
12    "4.1 Main Results",
13    "4.2 Prompt selection",
14    "4.3 Sensitivity to evaluation metrics",
15    "4.4 Vision Encoder Ablations"
16  ],
17  "urls": [
18    "https://mistral.ai/news/pixtal-12b/",
19    "https://github.com/mistralai/mistral-inference/",
20    "https://github.com/mistralai/mistral-evals/",
21    "https://huggingface.co/datasets/mistralai/MM-MT-Bench"
22  ]
23}

Cookbooks

For more information and guides on how to make use of OCR, we have the following cookbooks:

FAQ

Q: Are there any limits regarding the Document Intelligence API?
A: Yes, there are certain limitations for the Document Intelligence API. Uploaded document files must not exceed 50 MB in size and should be no longer than 1,000 pages.

Q: Are there any limits regarding the Annotations?
A: When using Document Annotations, the file cannot have more than 8 pages. BBox Annotations does not have the same limit.