← Back to Cookbook
data extraction
Details
File: mistral/ocr/data_extraction.ipynb
Type: Jupyter Notebook
Use Cases: OCR, Data Extraction, Structured outputs
Content
Notebook content (JSON format):
{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "7FPiAIwHteCl" }, "source": [ "# Extract Data from Documents via Annotations\n", "\n", "---\n", "\n", "## Annotations for Structured Outputs and Data Extraction\n", "In this cookbook, we will explore the basics of Annotations and to achieve structured outputs fueled by our OCR model.\n", "\n", "You may want to do this in case current vision models are not powerful enough, hence enhancing their vision OCR capabilities with the OCR model to achieve better structured data extraction.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "mwPFgFUsm_D5" }, "source": [ "## What are Annotations?\n", "\n", "Mistral Document AI API adds two annotation functionalities:\n", "- `document_annotation`: returns the annotation of the entire document based on the input schema.\n", "- `box_annotation`: gives you the annotation of the bboxes extracted by the OCR model (charts/ figures etc) based on user requirement. The user may ask to describe/caption the figure for instance." ] }, { "cell_type": "markdown", "metadata": { "id": "9pv6a79EnOn3" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "3VOZLPzq0tcI" }, "source": [ "Learn more about Annotations [here](https://docs.mistral.ai/capabilities/OCR/annotations/)." ] }, { "cell_type": "markdown", "metadata": { "id": "UgZW4ZfetwAl" }, "source": [ "## Setup\n", "\n", "First, let's install `mistralai` and download the required files." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "po7Cukllt8za" }, "outputs": [], "source": [ "%%capture\n", "!pip install mistralai" ] }, { "cell_type": "markdown", "metadata": { "id": "g8rxv4Tx5kNX" }, "source": [ "### Download PDF" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MtKgrASwF3Ol" }, "outputs": [], "source": [ "%%capture\n", "!wget https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/mistral7b.pdf" ] }, { "cell_type": "markdown", "metadata": { "id": "-1kV16LmkRD1" }, "source": [ "### Create Client\n", "\n", "We will need to set up our client. You can create an API key on our [Plateforme](https://console.mistral.ai/api-keys/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FZdL0ZXYkO0n" }, "outputs": [], "source": [ "# Initialize Mistral client with API key\n", "from mistralai import Mistral\n", "\n", "api_key = \"API_KEY\" # Replace with your API key\n", "client = Mistral(api_key=api_key)" ] }, { "cell_type": "markdown", "metadata": { "id": "NhwM0aITt7ti" }, "source": [ "## Mistral OCR without Annotations\n", "For our cookbook we will use a pdf file, annotate it and extract data from the document.\n", "\n", "First we have to make a function to encode our pdf file in base64, you can also upload the file to our cloud and use a signed url instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "G372Api7lb_Z" }, "outputs": [], "source": [ "import base64\n", "\n", "def encode_pdf(pdf_path):\n", " \"\"\"Encode the pdf to base64.\"\"\"\n", " try:\n", " with open(pdf_path, \"rb\") as pdf_file:\n", " return base64.b64encode(pdf_file.read()).decode('utf-8')\n", " except FileNotFoundError:\n", " print(f\"Error: The file {pdf_path} was not found.\")\n", " return None\n", " except Exception as e: # Added general exception handling\n", " print(f\"Error: {e}\")\n", " return None" ] }, { "cell_type": "markdown", "metadata": { "id": "4VsM34av2Hwc" }, "source": [ "Now with our function ready, we can encode our pdf file and call our OCR model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 240 }, "id": "svaJGBFlqm7_", "outputId": "ab29de8f-667b-40fe-e884-db0ba1c10f29" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"pages\": [\n", " {\n", " \"index\": 0,\n", " \"markdown\": \"# Mistral 7B \\n\\nAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L\\u00e9lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth\\u00e9e Lacroix, William El Sayed\\n\\n\\n\\n\\n#### Abstract\\n\\nWe introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned \n" ] } ], "source": [ "import requests\n", "import os\n", "import json\n", "\n", "# Path to your pdf\n", "pdf_path = \"mistral7b.pdf\"\n", "\n", "# Getting the base64 string\n", "base64_pdf = encode_pdf(pdf_path)\n", "\n", "# Call the OCR API\n", "pdf_response = client.ocr.process(\n", " model=\"mistral-ocr-latest\",\n", " document={\n", " \"type\": \"document_url\",\n", " \"document_url\": f\"data:application/pdf;base64,{base64_pdf}\"\n", " },\n", " include_image_base64=True\n", ")\n", "\n", "# Convert response to JSON format\n", "response_dict = json.loads(pdf_response.model_dump_json())\n", "\n", "print(json.dumps(response_dict, indent=4)[0:1000]) # check the first 1000 characters" ] }, { "cell_type": "markdown", "metadata": { "id": "EG2_TdlKIxYs" }, "source": [ "We can view the result with the following:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "dxefUpm-Idp8", "outputId": "715b1d4b-5afb-4b96-e15d-ec3fc1b287b5" }, "outputs": [ { "data": { "text/markdown": [ "# Mistral 7B \n", "\n", "Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed\n", "\n", "\n", "\n", "\n", "#### Abstract\n", "\n", "We introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B - Instruct, that surpasses Llama 2 13B - chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/\n", "\n", "\n", "## 1 Introduction\n", "\n", "In the rapidly evolving domain of Natural Language Processing (NLP), the race towards higher model performance often necessitates an escalation in model size. However, this scaling tends to increase computational costs and inference latency, thereby raising barriers to deployment in practical, real-world scenarios. In this context, the search for balanced models delivering both high-level performance and efficiency becomes critically essential. Our model, Mistral 7B, demonstrates that a carefully designed language model can deliver high performance while maintaining an efficient inference. Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested benchmarks, and surpasses the best 34B model (LLaMa 34B, [25]) in mathematics and code generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [20], without sacrificing performance on non-code related benchmarks.\n", "\n", "Mistral 7B leverages grouped-query attention (GQA) [1], and sliding window attention (SWA) [6, 3]. GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.\n", "\n", "Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference implementation ${ }^{1}$ facilitating easy deployment either locally or on cloud platforms such as AWS, GCP, or Azure using the vLLM [17] inference server and SkyPilot ${ }^{2}$. Integration with Hugging Face ${ }^{3}$ is also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B - Chat model.\n", "\n", "Mistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications.\n", "\n", "# 2 Architectural details \n", "\n", "\n", "\n", "Figure 1: Sliding Window Attention. The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use sliding window attention: each token can attend to at most $W$ tokens from the previous layer (here, $W=3$ ). Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by $W$ tokens. Hence, after $k$ attention layers, information can move forward by up to $k \\times W$ tokens.\n", "\n", "Mistral 7B is based on a transformer architecture [27]. The main parameters of the architecture are summarized in Table 1. Compared to Llama, it introduces a few changes that we summarize below.\n", "Sliding Window Attention. SWA exploits the stacked layers of a transformer to attend information beyond the window size $W$. The hidden state in position $i$ of the layer $k, h_{i}$, attends to all hidden states from the previous layer with positions between $i-W$ and $i$. Recursively, $h_{i}$ can access tokens from the input layer at a distance of up to $W \\times k$ tokens, as illustrated in Figure 1. At the last layer, using a window size of $W=4096$, we have a theoretical attention span of approximately $131 K$ tokens. In practice, for a sequence length of 16 K and $W=4096$, changes made to FlashAttention [11] and xFormers [18] yield a 2x speed improvement over a vanilla attention baseline.\n", "\n", "| Parameter | Value |\n", "| :-- | --: |\n", "| dim | 4096 |\n", "| n_layers | 32 |\n", "| head_dim | 128 |\n", "| hidden_dim | 14336 |\n", "| n_heads | 32 |\n", "| n_kv_heads | 8 |\n", "| window_size | 4096 |\n", "| context_len | 8192 |\n", "| vocab_size | 32000 |\n", "\n", "Table 1: Model architecture.\n", "\n", "Rolling Buffer Cache. A fixed attention span means that we can limit our cache size using a rolling buffer cache. The cache has a fixed size of $W$, and the keys and values for the timestep $i$ are stored in position $i \\bmod W$ of the cache. As a result, when the position $i$ is larger than $W$, past values in the cache are overwritten, and the size of the cache stops increasing. We provide an illustration in Figure 2 for $W=3$. On a sequence length of 32 k tokens, this reduces the cache memory usage by 8 x , without impacting the model quality.\n", "\n", "[^0]\n", "[^0]: ${ }^{1}$ https://github.com/mistralai/mistral-src\n", " ${ }^{2}$ https://github.com/skypilot-org/skypilot\n", " ${ }^{3}$ https://huggingface.co/mistralai\n", "\n", "\n", "\n", "Figure 2: Rolling buffer cache. The cache has a fixed size of $W=4$. Keys and values for position $i$ are stored in position $i \\bmod W$ of the cache. When the position $i$ is larger than $W$, past values in the cache are overwritten. The hidden state corresponding to the latest generated tokens are colored in orange.\n", "\n", "Pre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the $(k, v)$ cache with the prompt. If the prompt is very large, we can chunk it into smaller pieces, and pre-fill the cache with each chunk. For this purpose, we can select the window size as our chunk size. For each chunk, we thus need to compute the attention over the cache and over the chunk. Figure 3 shows how the attention mask works over both the cache and the chunk.\n", "\n", "\n", "\n", "Figure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory usage. We process a sequence in three chunks, \"The cat sat on\", \"the mat and saw\", \"the dog go to\". The figure shows what happens for the third chunk (\"the dog go to\"): it attends itself using a causal mask (rightmost block), attends the cache using a sliding window (center block), and does not attend to past tokens as they are outside of the sliding window (left block).\n", "\n", "## 3 Results\n", "\n", "We compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for fair comparison. We measure performance on a wide variety of tasks categorized as follow:\n", "\n", "- Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22], OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24]\n", "- World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15]\n", "- Reading Comprehension (0-shot): BoolQ [8], QuAC [7]\n", "- Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4\n", "- Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot)\n", "- Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29] (3-5-shot, English multiple-choice questions only)\n", "\n", "Detailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4 compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama $134 B^{4}$ in different categories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on most benchmarks. In particular, Mistral 7B displays a superior performance in code, mathematics, and reasoning benchmarks.\n", "\n", "[^0]\n", "[^0]: ${ }^{4}$ Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B.\n", "\n", "\n", "\n", "Figure 4: Performance of Mistral 7B and different Llama models on a wide range of benchmarks. All models were re-evaluated on all metrics with our evaluation pipeline for accurate comparison. Mistral 7B significantly outperforms Llama 2 7B and Llama 2 13B on all benchmarks. It is also vastly superior to Llama 1 34B in mathematics, code generation, and reasoning benchmarks.\n", "\n", "| Model | Modality | MMLU | HellaSwag | WinoG | PIQA | Arc-e | Arc-c | NQ | TriviaQA | HumanEval | MBPP | MATH | GSM8K |\n", "| :-- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n", "| LLaMA 2 7B | Pretrained | $44.4 \\%$ | $77.1 \\%$ | $69.5 \\%$ | $77.9 \\%$ | $68.7 \\%$ | $43.2 \\%$ | $24.7 \\%$ | $63.8 \\%$ | $11.6 \\%$ | $26.1 \\%$ | $3.9 \\%$ | $16.0 \\%$ |\n", "| LLaMA 2 13B | Pretrained | $55.6 \\%$ | $\\mathbf{8 0 . 7 \\%}$ | $72.9 \\%$ | $80.8 \\%$ | $75.2 \\%$ | $48.8 \\%$ | $\\mathbf{2 9 . 0 \\%}$ | $\\mathbf{6 9 . 6 \\%}$ | $18.9 \\%$ | $35.4 \\%$ | $6.0 \\%$ | $34.3 \\%$ |\n", "| Code-Llama 7B | Finetuned | $36.9 \\%$ | $62.9 \\%$ | $62.3 \\%$ | $72.8 \\%$ | $59.4 \\%$ | $34.5 \\%$ | $11.0 \\%$ | $34.9 \\%$ | $\\mathbf{3 1 . 1 \\%}$ | $\\mathbf{5 2 . 5 \\%}$ | $5.2 \\%$ | $20.8 \\%$ |\n", "| Mistral 7B | Pretrained | $\\mathbf{6 0 . 1 \\%}$ | $\\mathbf{8 1 . 3 \\%}$ | $\\mathbf{7 5 . 3 \\%}$ | $\\mathbf{8 3 . 0 \\%}$ | $\\mathbf{8 0 . 0 \\%}$ | $\\mathbf{5 5 . 5 \\%}$ | $\\mathbf{2 8 . 8 \\%}$ | $\\mathbf{6 9 . 9 \\%}$ | $\\mathbf{3 0 . 5 \\%}$ | $47.5 \\%$ | $\\mathbf{1 3 . 1 \\%}$ | $\\mathbf{5 2 . 2 \\%}$ |\n", "\n", "Table 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks.\n", "\n", "Size and Efficiency. We computed \"equivalent model sizes\" of the Llama 2 family, aiming to understand Mistral 7B models' efficiency in the cost-performance spectrum (see Figure 5). When evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On the Knowledge benchmarks, Mistral 7B's performance achieves a lower compression rate of 1.9 x , which is likely due to its limited parameter count that restricts the amount of knowledge it can store.\n", "\n", "Evaluation Differences. On some benchmarks, there are some differences between our evaluation protocol and the one reported in the Llama 2 paper: 1) on MBPP, we use the hand-verified subset 2) on TriviaQA, we do not provide Wikipedia contexts.\n", "\n", "## 4 Instruction Finetuning\n", "\n", "To evaluate the generalization capabilities of Mistral 7B, we fine-tuned it on instruction datasets publicly available on the Hugging Face repository. No proprietary data or training tricks were utilized: Mistral 7B - Instruct model is a simple and preliminary demonstration that the base model can easily be fine-tuned to achieve good performance. In Table 3, we observe that the resulting model, Mistral 7B - Instruct, exhibits superior performance compared to all 7B models on MT-Bench, and is comparable to 13B - Chat models. An independent human evaluation was conducted on https://limboxing.com/leaderboard.\n", "\n", "| Model | Chatbot Arena <br> ELO Rating | MT Bench |\n", "| :-- | :--: | :--: |\n", "| WizardLM 13B v1.2 | 1047 | 7.2 |\n", "| Mistral 7B Instruct | $\\mathbf{1 0 3 1}$ | $\\mathbf{6 . 8 4}$ +/- $\\mathbf{0 . 0 7}$ |\n", "| Llama 2 13B Chat | 1012 | 6.65 |\n", "| Vicuna 13B | 1041 | 6.57 |\n", "| Llama 2 7B Chat | 985 | 6.27 |\n", "| Vicuna 7B | 997 | 6.17 |\n", "| Alpaca 13B | 914 | 4.53 |\n", "\n", "Table 3: Comparison of Chat models. Mistral 7B Instruct outperforms all 7B models on MT-Bench, and is comparable to 13B - Chat models.\n", "\n", "In this evaluation, participants were provided with a set of questions along with anonymous responses from two models and were asked to select their preferred response, as illustrated in Figure 6. As of October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B.\n", "\n", "\n", "\n", "Figure 5: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension for Mistral 7B and Llama 2 (7B/13B/70B). Mistral 7B largely outperforms Llama 2 13B on all evaluations, except on knowledge benchmarks, where it is on par (this is likely due to its limited parameter count, which limits the amount of knowledge it can compress).\n", "\n", "# 5 Adding guardrails for front-facing applications \n", "\n", "The ability to enforce guardrails when it comes to AI generation is important for front-facing applications. In this section, we highlight how to leverage system prompting to optionally enforce output constraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform fine-grained content moderation, which can be useful to enforce quality content in applications.\n", "\n", "### 5.1 System prompt to enforce guardrails\n", "\n", "We introduce a system prompt (see below) to guide the model to generate answers within specified guardrails, similar to the work done with Llama 2. Using this prompt allows the user to move on the Pareto front of model utility / guardrails enforcement, as indicated in Table 4.\n", "\n", "Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.\n", "\n", "We use a set of 175 unsafe prompts for evaluating safety. With the recommended system prompt, the model properly declines to answer $100 \\%$ of the harmful questions.\n", "As an illustration, we provide in Table 5 the answers of both Mistral 7B - Instruct and Llama 2 Chat 13B to the question How to kill a linux process with system prompts activated. We observe that Mistral 7B provides a correct response while Llama 2 declines to answer. Note that on this specific question, both models answer correctly when system prompts are deactivated.\n", "\n", "| Guardrails | MT Bench |\n", "| :-- | :--: |\n", "| No system prompt | $6.84 \\pm 0.07$ |\n", "| Llama 2 system prompt | $6.38 \\pm 0.07$ |\n", "| Mistral system prompt | $6.58 \\pm 0.05$ |\n", "\n", "Table 4: System prompts. Mean official MT Bench score over 10 iterations with standard deviation for Mistral 7B - Instruct. For reference, Llama 2 13B - Chat reports official results of 6.65 .\n", "\n", "### 5.2 Content moderation with self-reflection\n", "\n", "Mistral 7B - Instruct can be used as a content moderator: the model itself is able to accurately classify a user prompt or its generated answer as being either acceptable or falling into one of the following categories: Illegal activities such as terrorism, child abuse or fraud; Hateful, harassing or violent content such as discrimination, self-harm or bullying; Unqualified advice for instance in legal, medical or financial domains.\n", "\n", "| Model | Answer |\n", "| :--: | :--: |\n", "| Mistral 7B - Instruct with Mistral system prompt | To kill a Linux process, you can use the `kill' command followed by the process ID (PID) of the process you want to terminate. For example, to kill process with PID 1234, you would run the command `kill 1234`. It's important to note that killing a process can have unintended consequences, so it's generally a good idea to only kill processes that you are certain you want to terminate. Additionally, it's generally a good idea to use the 'kill' command with caution, as it can be used to terminate critical system processes that are necessary for the operation of the system. |\n", "| Llama 2 13B - Chat with Llama 2 system prompt | As a helpful and respectful assistant, I cannot provide information on how to kill a Linux process. It is not appropriate or ethical to intentionally harm or terminate processes, as it can cause damage to the system and potentially result in data loss. Additionally, it is important to respect the autonomy and security of other users' processes, as they may be critical to their work or personal activities. Instead, I suggest that you explore other options to address any issues or concerns you may have. |\n", "\n", "Table 5: Comparison between Mistral and Llama system prompts. Mistral provides the right answer whereas Llama 2 declines to answer to the question How to kill a linux process.\n", "\n", "To do so, we designed a self-reflection prompt that makes Mistral 7B classify a prompt or a generated answer. We evaluated self-reflection on our manually curated and balanced dataset of adversarial and standard prompts and got a precision of $99.4 \\%$ for a recall of $95.6 \\%$ (considering acceptable prompts as positives).\n", "\n", "The use cases are vast, from moderating comments on social media or forums to brand monitoring on the internet. In particular, the end user is able to select afterwards which categories to effectively filter based on their particular use-case.\n", "\n", "# 6 Conclusion \n", "\n", "Our work on Mistral 7B demonstrates that language models may compress knowledge more than what was previously thought. This opens up interesting perspectives: the field has so far put the emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost, as in [14]); the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and much remains to be explored to obtain the best performance with the smallest possible model.\n", "\n", "## Acknowledgements\n", "\n", "We are grateful to CoreWeave for their 24/7 help in marshalling our cluster. We thank the CINECA/EuroHPC team, and in particular the operators of Leonardo, for their resources and help. We thank the maintainers of FlashAttention, vLLM, xFormers, Skypilot for their precious assistance in implementing new features and integrating their solutions into ours. A huge thanks to Tri Dao and Daniel Haziza for helping include Mistral related changes to FlashAttention and xFormers on a tight schedule. We thank the teams of Hugging Face, AWS, GCP, Azure ML for their intense help in making our model compatible everywhere.\n", "\n", "# Can you recommend a foundational and accessible book about quantum physics? \n", "\n", "Capy Link to Question Read Round (Enter) $\\rightarrow$\n", "\n", "Round Winner: mistral-7b-instruct-v0.1\n", "\n", "\n", "## mistral-7b-instruct-v0.1 R Round\n", "\n", " Winner\n", "A great book for a foundational understanding of quantum physics is \"The Quantum Universe: Everything that Can Happen Does Happen\" by Brian Cox and Jeff Forshaw. This book is accessible and easy to understand, making it a great choice for beginners. It covers all the basics of quantum mechanics, including superposition, entanglement, and the uncertainty principle. The authors also explore the implications of quantum physics for our understanding of the universe, including the nature of time and space. Overall, \"The Quantum Universe\" is an excellent choice for anyone looking to gain a foundational understanding of this fascinating field\n", "\n", "## llama-2-13b-chat\n", "\n", "Certainly! For a foundational and accessible introduction to quantum physics, I highly recommend \"The Feynman Lectures on Physics\" by Richard Feynman. This classic text provides a clear and engaging overview of the principles and concepts of quantum mechanics, with minimal mathematical formalism and plenty of illustrations and examples. It's a great starting point for anyone looking to gain a deeper understanding of quantum physics without getting bogged down in too much technical detail.\n", "\n", "Figure 6: Human evaluation of Mistral 7B - Instruct vs Llama 2 13B - Chat Example. An example of human evaluation from libboxing.com. The question asks for recommendations of books in quantum physics. Llama 2 13B - Chat recommends a general physics book, while Mistral 7B - Instruct recommends a more relevant book on quantum physics and describes in the contents in more detail.\n", "\n", "# References \n", "\n", "[1] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.\n", "[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.\n", "[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.\n", "[4] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.\n", "[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.\n", "[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.\n", "[7] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018.\n", "[8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.\n", "[9] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.\n", "[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.\n", "[11] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.\n", "[12] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.\n", "[13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.\n", "[14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, 2022.\n", "[15] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.\n", "[16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453-466, 2019.\n", "\n", "[17] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.\n", "[18] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer modelling library. https://github.com/ facebookresearch/xformers, 2022.\n", "[19] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.\n", "[20] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.\n", "[21] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99-106, 2021.\n", "[22] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.\n", "[23] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.\n", "[24] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.\n", "[25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.\n", "[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.\n", "[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.\n", "[28] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.\n", "[29] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023." ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from mistralai.models import OCRResponse\n", "from IPython.display import Markdown, display\n", "\n", "def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:\n", " \"\"\"\n", " Replace image placeholders in markdown with base64-encoded images.\n", "\n", " Args:\n", " markdown_str: Markdown text containing image placeholders\n", " images_dict: Dictionary mapping image IDs to base64 strings\n", "\n", " Returns:\n", " Markdown text with images replaced by base64 data\n", " \"\"\"\n", " for img_name, base64_str in images_dict.items():\n", " markdown_str = markdown_str.replace(\n", " f\"\", f\"\"\n", " )\n", " return markdown_str\n", "\n", "def get_combined_markdown(ocr_response: OCRResponse) -> str:\n", " \"\"\"\n", " Combine OCR text and images into a single markdown document.\n", "\n", " Args:\n", " ocr_response: Response from OCR processing containing text and images\n", "\n", " Returns:\n", " Combined markdown string with embedded images\n", " \"\"\"\n", " markdowns: list[str] = []\n", " # Extract images from page\n", " for page in ocr_response.pages:\n", " image_data = {}\n", " for img in page.images:\n", " image_data[img.id] = img.image_base64\n", " # Replace image placeholders with actual images\n", " markdowns.append(replace_images_in_markdown(page.markdown, image_data))\n", "\n", " return \"\\n\\n\".join(markdowns)\n", "\n", "# Display combined markdowns and images\n", "display(Markdown(get_combined_markdown(pdf_response)))" ] }, { "cell_type": "markdown", "metadata": { "id": "LNzRYPDcyaLX" }, "source": [ "## Mistral OCR with Annotations\n", "First, we need to create our Annotation Formats, for that we advise make use of `pydantic`. \n", "For this example, we will extract the image type and a description of each bbox; as well as the language, authors and a summary of the full document." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6yOKFR6XlnnC" }, "outputs": [], "source": [ "from pydantic import BaseModel, Field\n", "from enum import Enum\n", "\n", "class ImageType(str, Enum):\n", " GRAPH = \"graph\"\n", " TEXT = \"text\"\n", " TABLE = \"table\"\n", " IMAGE = \"image\"\n", "\n", "class Image(BaseModel):\n", " image_type: ImageType = Field(..., description=\"The type of the image. Must be one of 'graph', 'text', 'table' or 'image'.\")\n", " description: str = Field(..., description=\"A description of the image.\")\n", "\n", "class Document(BaseModel):\n", " language: str = Field(..., description=\"The language of the document in ISO 639-1 code format (e.g., 'en', 'fr').\")\n", " summary: str = Field(..., description=\"A summary of the document.\")\n", " authors: list[str] = Field(..., description=\"A list of authors who contributed to the document.\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-8XLfRlQyaLX" }, "source": [ "Now with our pydantic models created for our Annotations, we can call our OCR endpoint. \n", "The objective is to Annotate and Extract information from our document and the bbox/images detected." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EQjLk9hlmJKH", "outputId": "b662698d-1383-486b-8091-58ebb43b37e0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"pages\": [\n", " {\n", " \"index\": 0,\n", " \"markdown\": \"# Mistral 7B \\n\\nAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L\\u00e9lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth\\u00e9e Lacroix, William El Sayed\\n\\n\\n\\n\\n#### Abstract\\n\\nWe introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B - Instruct, that surpasses Llama 2 13B - chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/\\n\\n\\n## 1 Introduction\\n\\nIn the rapidly evolving domain of Natural Language Processing (NLP), the race towards higher model performance often necessitates an escalation in model size. However, this scaling tends to increase computational costs and inference latency, thereby raising barriers to deployment in practical, real-world scenarios. In this context, the search for balanced models delivering both high-level performance and efficiency becomes critically essential. Our model, Mistral 7B, demonstrates that a carefully designed language model can deliver high performance while maintaining an efficient inference. Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested benchmarks, and surpasses the best 34B model (LLaMa 34B, [25]) in mathematics and code generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [20], without sacrificing performance on non-code related benchmarks.\\n\\nMistral 7B leverages grouped-query attention (GQA) [1], and sliding window attention (SWA) [6, 3]. GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.\",\n", " \"images\": [\n", " {\n", " \"id\": \"img-0.jpeg\",\n", " \"top_left_x\": 425,\n", " \"top_left_y\": 598,\n", " \"bottom_right_x\": 1283,\n", " \"bottom_right_y\": 893,\n", " \"image_base64\": null,\n", " \"image_annotation\": \"{\\n \\\"image_type\\\": \\\"image\\\",\\n \\\"description\\\": \\\"A 3D rendering of the text 'Mistral AI' in a gradient of warm colors, transitioning from orange to brown.\\\"\\n}\"\n", " }\n", " ],\n", " \"dimensions\": {\n", " \"dpi\": 200,\n", " \"height\": 2200,\n", " \"width\": 1700\n", " }\n", " },\n", " {\n", " \"index\": 1,\n", " \"markdown\": \"Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference implementation ${ }^{1}$ facilitating easy deployment either locally or on cloud platforms such as AWS, GCP, or Azure using the vLLM [17] inference server and SkyPilot ${ }^{2}$. Integration with Hugging Face ${ }^{3}$ is also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B - Chat model.\\n\\nMistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications.\\n\\n# 2 Architectural details \\n\\n\\n\\nFigure 1: Sliding Window Attention. The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use sliding window attention: each token can attend to at most $W$ tokens from the previous layer (here, $W=3$ ). Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by $W$ tokens. Hence, after $k$ attention layers, information can move forward by up to $k \\\\times W$ tokens.\\n\\nMistral 7B is based on a transformer architecture [27]. The main parameters of the architecture are summarized in Table 1. Compared to Llama, it introduces a few changes that we summarize below.\\nSliding Window Attention. SWA exploits the stacked layers of a transformer to attend information beyond the window size $W$. The hidden state in position $i$ of the layer $k, h_{i}$, attends to all hidden states from the previous layer with positions between $i-W$ and $i$. Recursively, $h_{i}$ can access tokens from the input layer at a distance of up to $W \\\\times k$ tokens, as illustrated in Figure 1. At the last layer, using a window size of $W=4096$, we have a theoretical attention span of approximately $131 K$ tokens. In practice, for a sequence length of 16 K and $W=4096$, changes made to FlashAttention [11] and xFormers [18] yield a 2x speed improvement over a vanilla attention baseline.\\n\\n| Parameter | Value |\\n| :-- | --: |\\n| dim | 4096 |\\n| n_layers | 32 |\\n| head_dim | 128 |\\n| hidden_dim | 14336 |\\n| n_heads | 32 |\\n| n_kv_heads | 8 |\\n| window_size | 4096 |\\n| context_len | 8192 |\\n| vocab_size | 32000 |\\n\\nTable 1: Model architecture.\\n\\nRolling Buffer Cache. A fixed attention span means that we can limit our cache size using a rolling buffer cache. The cache has a fixed size of $W$, and the keys and values for the timestep $i$ are stored in position $i \\\\bmod W$ of the cache. As a result, when the position $i$ is larger than $W$, past values in the cache are overwritten, and the size of the cache stops increasing. We provide an illustration in Figure 2 for $W=3$. On a sequence length of 32 k tokens, this reduces the cache memory usage by 8 x , without impacting the model quality.\\n\\n[^0]\\n[^0]: ${ }^{1}$ https://github.com/mistralai/mistral-src\\n ${ }^{2}$ https://github.com/skypilot-org/skypilot\\n ${ }^{3}$ https://huggingface.co/mistralai\",\n", " \"images\": [\n", " {\n", " \"id\": \"img-1.jpeg\",\n", " \"top_left_x\": 294,\n", " \"top_left_y\": 638,\n", " \"bottom_right_x\": 1405,\n", " \"bottom_right_y\": 1064,\n", " \"image_base64\": null,\n", " \"image_annotation\": \"{\\n \\\"image_type\\\": \\\"table\\\",\\n \\\"description\\\": \\\"This image compares two types of attention mechanisms used in natural language processing: Vanilla Attention and Sliding Window Attention. The left and center tables show the attention patterns for the sentence 'The cat sat on the'. In Vanilla Attention, each word attends to all previous words, resulting in a full lower triangular matrix. In Sliding Window Attention, each word attends to a fixed number of previous words, resulting in a banded lower triangular matrix. The right diagram illustrates how the effective context length varies with the number of layers and the window size in Sliding Window Attention.\\\"\\n}\"\n", " }\n", " ],\n", " \"dimensions\": {\n", " \"dpi\": 200,\n", " \"height\": 2200,\n", " \"width\": 1700\n", " }\n", " },\n", " {\n", " \"index\": 2,\n", " \"markdown\": \"\\n\\nFigure 2: Rolling buffer cache. The cache has a fixed size of $W=4$. Keys and values for position $i$ are stored in position $i \\\\bmod W$ of the cache. When the position $i$ is larger than $W$, past values in the cache are overwritten. The hidden state corresponding to the latest generated tokens are colored in orange.\\n\\nPre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the $(k, v)$ cache with the prompt. If the prompt is very large, we can chunk it into smaller pieces, and pre-fill the cache with each chunk. For this purpose, we can select the window size as our chunk size. For each chunk, we thus need to compute the attention over the cache and over the chunk. Figure 3 shows how the attention mask works over both the cache and the chunk.\\n\\n\\n\\nFigure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory usage. We process a sequence in three chunks, \\\"The cat sat on\\\", \\\"the mat and saw\\\", \\\"the dog go to\\\". The figure shows what happens for the third chunk (\\\"the dog go to\\\"): it attends itself using a causal mask (rightmost block), attends the cache using a sliding window (center block), and does not attend to past tokens as they are outside of the sliding window (left block).\\n\\n## 3 Results\\n\\nWe compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for fair comparison. We measure performance on a wide variety of tasks categorized as follow:\\n\\n- Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22], OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24]\\n- World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15]\\n- Reading Comprehension (0-shot): BoolQ [8], QuAC [7]\\n- Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4\\n- Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot)\\n- Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29] (3-5-shot, English multiple-choice questions only)\\n\\nDetailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4 compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama $134 B^{4}$ in different categories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on most benchmarks. In particular, Mistral 7B displays a superior performance in code, mathematics, and reasoning benchmarks.\\n\\n[^0]\\n[^0]: ${ }^{4}$ Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B.\",\n", " \"images\": [\n", " {\n", " \"id\": \"img-2.jpeg\",\n", " \"top_left_x\": 294,\n", " \"top_left_y\": 191,\n", " \"bottom_right_x\": 1405,\n", " \"bottom_right_y\": 380,\n", " \"image_base64\": null,\n", " \"image_annotation\": \"{\\n \\\"image_type\\\": \\\"table\\\",\\n \\\"description\\\": \\\"A table showing the progression of words in three sentences over three timesteps. Each word is highlighted in red at different timesteps, indicating the position of processing or focus in a sequential manner. The sentences are 'This is an example of ...', 'Mistral is a good ...', and 'The cat sat on the mat ...'.\\\"\\n}\"\n", " },\n", " {\n", " \"id\": \"img-3.jpeg\",\n", " \"top_left_x\": 464,\n", " \"top_left_y\": 741,\n", " \"bottom_right_x\": 1235,\n", " \"bottom_right_y\": 1064,\n", " \"image_base64\": null,\n", " \"image_annotation\": \"{\\n \\\"image_type\\\": \\\"table\\\",\\n \\\"description\\\": \\\"A table showing a matrix of word comparisons with past, cache, and current sections. The table is divided into three main sections: Past, Cache, and Current. The Past section is filled with zeros, indicating no matches. The Cache section shows some matches with ones, and the Current section has the most matches, indicating the most recent comparisons.\\\"\\n}\"\n", " }\n", " ],\n", " \"dimensions\": {\n", " \"dpi\": 200,\n", " \"height\": 2200,\n", " \"width\": 1700\n", " }\n", " },\n", " {\n", " \"index\": 3,\n", " \"markdown\": \"\\n\\nFigure 4: Performance of Mistral 7B and different Llama models on a wide range of benchmarks. All models were re-evaluated on all metrics with our evaluation pipeline for accurate comparison. Mistral 7B significantly outperforms Llama 2 7B and Llama 2 13B on all benchmarks. It is also vastly superior to Llama 1 34B in mathematics, code generation, and reasoning benchmarks.\\n\\n| Model | Modality | MMLU | HellaSwag | WinoG | PIQA | Arc-e | Arc-c | NQ | TriviaQA | HumanEval | MBPP | MATH | GSM8K |\\n| :-- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\\n| LLaMA 2 7B | Pretrained | $44.4 \\\\%$ | $77.1 \\\\%$ | $69.5 \\\\%$ | $77.9 \\\\%$ | $68.7 \\\\%$ | $43.2 \\\\%$ | $24.7 \\\\%$ | $63.8 \\\\%$ | $11.6 \\\\%$ | $26.1 \\\\%$ | $3.9 \\\\%$ | $16.0 \\\\%$ |\\n| LLaMA 2 13B | Pretrained | $55.6 \\\\%$ | $\\\\mathbf{8 0 . 7 \\\\%}$ | $72.9 \\\\%$ | $80.8 \\\\%$ | $75.2 \\\\%$ | $48.8 \\\\%$ | $\\\\mathbf{2 9 . 0 \\\\%}$ | $\\\\mathbf{6 9 . 6 \\\\%}$ | $18.9 \\\\%$ | $35.4 \\\\%$ | $6.0 \\\\%$ | $34.3 \\\\%$ |\\n| Code-Llama 7B | Finetuned | $36.9 \\\\%$ | $62.9 \\\\%$ | $62.3 \\\\%$ | $72.8 \\\\%$ | $59.4 \\\\%$ | $34.5 \\\\%$ | $11.0 \\\\%$ | $34.9 \\\\%$ | $\\\\mathbf{3 1 . 1 \\\\%}$ | $\\\\mathbf{5 2 . 5 \\\\%}$ | $5.2 \\\\%$ | $20.8 \\\\%$ |\\n| Mistral 7B | Pretrained | $\\\\mathbf{6 0 . 1 \\\\%}$ | $\\\\mathbf{8 1 . 3 \\\\%}$ | $\\\\mathbf{7 5 . 3 \\\\%}$ | $\\\\mathbf{8 3 . 0 \\\\%}$ | $\\\\mathbf{8 0 . 0 \\\\%}$ | $\\\\mathbf{5 5 . 5 \\\\%}$ | $\\\\mathbf{2 8 . 8 \\\\%}$ | $\\\\mathbf{6 9 . 9 \\\\%}$ | $\\\\mathbf{3 0 . 5 \\\\%}$ | $47.5 \\\\%$ | $\\\\mathbf{1 3 . 1 \\\\%}$ | $\\\\mathbf{5 2 . 2 \\\\%}$ |\\n\\nTable 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks.\\n\\nSize and Efficiency. We computed \\\"equivalent model sizes\\\" of the Llama 2 family, aiming to understand Mistral 7B models' efficiency in the cost-performance spectrum (see Figure 5). When evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On the Knowledge benchmarks, Mistral 7B's performance achieves a lower compression rate of 1.9 x , which is likely due to its limited parameter count that restricts the amount of knowledge it can store.\\n\\nEvaluation Differences. On some benchmarks, there are some differences between our evaluation protocol and the one reported in the Llama 2 paper: 1) on MBPP, we use the hand-verified subset 2) on TriviaQA, we do not provide Wikipedia contexts.\\n\\n## 4 Instruction Finetuning\\n\\nTo evaluate the generalization capabilities of Mistral 7B, we fine-tuned it on instruction datasets publicly available on the Hugging Face repository. No proprietary data or training tricks were utilized: Mistral 7B - Instruct model is a simple and preliminary demonstration that the base model can easily be fine-tuned to achieve good performance. In Table 3, we observe that the resulting model, Mistral 7B - Instruct, exhibits superior performance compared to all 7B models on MT-Bench, and is comparable to 13B - Chat models. An independent human evaluation was conducted on https://limboxing.com/leaderboard.\\n\\n| Model | Chatbot Arena <br> ELO Rating | MT Bench |\\n| :-- | :--: | :--: |\\n| WizardLM 13B v1.2 | 1047 | 7.2 |\\n| Mistral 7B Instruct | $\\\\mathbf{1 0 3 1}$ | $\\\\mathbf{6 . 8 4}$ +/- $\\\\mathbf{0 . 0 7}$ |\\n| Llama 2 13B Chat | 1012 | 6.65 |\\n| Vicuna 13B | 1041 | 6.57 |\\n| Llama 2 7B Chat | 985 | 6.27 |\\n| Vicuna 7B | 997 | 6.17 |\\n| Alpaca 13B | 914 | 4.53 |\\n\\nTable 3: Comparison of Chat models. Mistral 7B Instruct outperforms all 7B models on MT-Bench, and is comparable to 13B - Chat models.\\n\\nIn this evaluation, participants were provided with a set of questions along with anonymous responses from two models and were asked to select their preferred response, as illustrated in Figure 6. As of October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B.\",\n", " \"images\": [\n", " {\n", " \"id\": \"img-4.jpeg\",\n", " \"top_left_x\": 292,\n", " \"top_left_y\": 204,\n", " \"bottom_right_x\": 1390,\n", " \"bottom_right_y\": 552,\n", " \"image_base64\": null,\n", " \"image_annotation\": \"{\\n \\\"image_type\\\": \\\"graph\\\",\\n \\\"description\\\": \\\"This image contains two bar graphs comparing the performance of four different language models: Mistral 7B, LLaMA 2 13B, LLaMA 2 7B, and LLaMA 1 34B. The left graph shows accuracy percentages for categories: MMLU, Knowledge, Reasoning, and Comprehension. The right graph shows accuracy percentages for categories: AGI Eval, Math, BBH, and Code. Each bar represents the accuracy of a specific model in each category, with different colors indicating different models.\\\"\\n}\"\n", " }\n", " ],\n", " \"dimensions\": {\n", " \"dpi\": 200,\n", " \"height\": 2200,\n", " \"width\": 1700\n", " }\n", " },\n", " {\n", " \"index\": 4,\n", " \"markdown\": \"\\n\\nFigure 5: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension for Mistral 7B and Llama 2 (7B/13B/70B). Mistral 7B largely outperforms Llama 2 13B on all evaluations, except on knowledge benchmarks, where it is on par (this is likely due to its limited parameter count, which limits the amount of knowledge it can compress).\\n\\n# 5 Adding guardrails for front-facing applications \\n\\nThe ability to enforce guardrails when it comes to AI generation is important for front-facing applications. In this section, we highlight how to leverage system prompting to optionally enforce output constraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform fine-grained content moderation, which can be useful to enforce quality content in applications.\\n\\n### 5.1 System prompt to enforce guardrails\\n\\nWe introduce a system prompt (see below) to guide the model to generate answers within specified guardrails, similar to the work done with Llama 2. Using this prompt allows the user to move on the Pareto front of model utility / guardrails enforcement, as indicated in Table 4.\\n\\nAlways assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.\\n\\nWe use a set of 175 unsafe prompts for evaluating safety. With the recommended system prompt, the model properly declines to answer $100 \\\\%$ of the harmful questions.\\nAs an illustration, we provide in Table 5 the answers of both Mistral 7B - Instruct and Llama 2 Chat 13B to the question How to kill a linux process with system prompts activated. We observe that Mistral 7B provides a correct response while Llama 2 declines to answer. Note that on this specific question, both models answer correctly when system prompts are deactivated.\\n\\n| Guardrails | MT Bench |\\n| :-- | :--: |\\n| No system prompt | $6.84 \\\\pm 0.07$ |\\n| Llama 2 system prompt | $6.38 \\\\pm 0.07$ |\\n| Mistral system prompt | $6.58 \\\\pm 0.05$ |\\n\\nTable 4: System prompts. Mean official MT Bench score over 10 iterations with standard deviation for Mistral 7B - Instruct. For reference, Llama 2 13B - Chat reports official results of 6.65 .\\n\\n### 5.2 Content moderation with self-reflection\\n\\nMistral 7B - Instruct can be used as a content moderator: the model itself is able to accurately classify a user prompt or its generated answer as being either acceptable or falling into one of the following categories: Illegal activities such as terrorism, child abuse or fraud; Hateful, harassing or violent content such as discrimination, self-harm or bullying; Unqualified advice for instance in legal, medical or financial domains.\",\n", " \"images\": [\n", " {\n", " \"id\": \"img-5.jpeg\",\n", " \"top_left_x\": 464,\n", " \"top_left_y\": 202,\n", " \"bottom_right_x\": 1232,\n", " \"bottom_right_y\": 734,\n", " \"image_base64\": null,\n", " \"image_annotation\": \"{\\n \\\"image_type\\\": \\\"graph\\\",\\n \\\"description\\\": \\\"This image contains four line graphs comparing the performance of LLaMA 2 and Mistral models across different metrics as a function of model size (in billion parameters). Each graph represents a different evaluation metric: MMLU (Massive Multitask Language Understanding), Reasoning, Knowledge, and Comprehension. The x-axis of each graph shows the model size, while the y-axis shows the percentage performance. LLaMA 2 is represented by a red line with circular markers, and Mistral is represented by an orange square marker. Dashed vertical lines indicate the effective sizes of LLaMA models in terms of Mistral's performance, suggesting that smaller LLaMA models can achieve comparable performance to larger Mistral models.\\\"\\n}\"\n", " }\n", " ],\n", " \"dimensions\": {\n", " \"dpi\": 200,\n", " \"height\": 2200,\n", " \"width\": 1700\n", " }\n", " },\n", " {\n", " \"index\": 5,\n", " \"markdown\": \"| Model | Answer |\\n| :--: | :--: |\\n| Mistral 7B - Instruct with Mistral system prompt | To kill a Linux process, you can use the `kill' command followed by the process ID (PID) of the process you want to terminate. For example, to kill process with PID 1234, you would run the command `kill 1234`. It's important to note that killing a process can have unintended consequences, so it's generally a good idea to only kill processes that you are certain you want to terminate. Additionally, it's generally a good idea to use the 'kill' command with caution, as it can be used to terminate critical system processes that are necessary for the operation of the system. |\\n| Llama 2 13B - Chat with Llama 2 system prompt | As a helpful and respectful assistant, I cannot provide information on how to kill a Linux process. It is not appropriate or ethical to intentionally harm or terminate processes, as it can cause damage to the system and potentially result in data loss. Additionally, it is important to respect the autonomy and security of other users' processes, as they may be critical to their work or personal activities. Instead, I suggest that you explore other options to address any issues or concerns you may have. |\\n\\nTable 5: Comparison between Mistral and Llama system prompts. Mistral provides the right answer whereas Llama 2 declines to answer to the question How to kill a linux process.\\n\\nTo do so, we designed a self-reflection prompt that makes Mistral 7B classify a prompt or a generated answer. We evaluated self-reflection on our manually curated and balanced dataset of adversarial and standard prompts and got a precision of $99.4 \\\\%$ for a recall of $95.6 \\\\%$ (considering acceptable prompts as positives).\\n\\nThe use cases are vast, from moderating comments on social media or forums to brand monitoring on the internet. In particular, the end user is able to select afterwards which categories to effectively filter based on their particular use-case.\\n\\n# 6 Conclusion \\n\\nOur work on Mistral 7B demonstrates that language models may compress knowledge more than what was previously thought. This opens up interesting perspectives: the field has so far put the emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost, as in [14]); the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and much remains to be explored to obtain the best performance with the smallest possible model.\\n\\n## Acknowledgements\\n\\nWe are grateful to CoreWeave for their 24/7 help in marshalling our cluster. We thank the CINECA/EuroHPC team, and in particular the operators of Leonardo, for their resources and help. We thank the maintainers of FlashAttention, vLLM, xFormers, Skypilot for their precious assistance in implementing new features and integrating their solutions into ours. A huge thanks to Tri Dao and Daniel Haziza for helping include Mistral related changes to FlashAttention and xFormers on a tight schedule. We thank the teams of Hugging Face, AWS, GCP, Azure ML for their intense help in making our model compatible everywhere.\",\n", " \"images\": [],\n", " \"dimensions\": {\n", " \"dpi\": 200,\n", " \"height\": 2200,\n", " \"width\": 1700\n", " }\n", " },\n", " {\n", " \"index\": 6,\n", " \"markdown\": \"# Can you recommend a foundational and accessible book about quantum physics? \\n\\nCapy Link to Question Read Round (Enter) $\\\\rightarrow$\\n\\nRound Winner: mistral-7b-instruct-v0.1\\n\\n\\n## mistral-7b-instruct-v0.1 R Round\\n\\n Winner\\nA great book for a foundational understanding of quantum physics is \\\"The Quantum Universe: Everything that Can Happen Does Happen\\\" by Brian Cox and Jeff Forshaw. This book is accessible and easy to understand, making it a great choice for beginners. It covers all the basics of quantum mechanics, including superposition, entanglement, and the uncertainty principle. The authors also explore the implications of quantum physics for our understanding of the universe, including the nature of time and space. Overall, \\\"The Quantum Universe\\\" is an excellent choice for anyone looking to gain a foundational understanding of this fascinating field\\n\\n## llama-2-13b-chat\\n\\nCertainly! For a foundational and accessible introduction to quantum physics, I highly recommend \\\"The Feynman Lectures on Physics\\\" by Richard Feynman. This classic text provides a clear and engaging overview of the principles and concepts of quantum mechanics, with minimal mathematical formalism and plenty of illustrations and examples. It's a great starting point for anyone looking to gain a deeper understanding of quantum physics without getting bogged down in too much technical detail.\\n\\nFigure 6: Human evaluation of Mistral 7B - Instruct vs Llama 2 13B - Chat Example. An example of human evaluation from libboxing.com. The question asks for recommendations of books in quantum physics. Llama 2 13B - Chat recommends a general physics book, while Mistral 7B - Instruct recommends a more relevant book on quantum physics and describes in the contents in more detail.\",\n", " \"images\": [\n", " {\n", " \"id\": \"img-6.jpeg\",\n", " \"top_left_x\": 727,\n", " \"top_left_y\": 792,\n", " \"bottom_right_x\": 972,\n", " \"bottom_right_y\": 1047,\n", " \"image_base64\": null,\n", " \"image_annotation\": \"{\\n \\\"image_type\\\": \\\"image\\\",\\n \\\"description\\\": \\\"A 3D rendering of the letter 'M' wearing boxing gloves, giving it a playful and dynamic appearance.\\\"\\n}\"\n", " }\n", " ],\n", " \"dimensions\": {\n", " \"dpi\": 200,\n", " \"height\": 2200,\n", " \"width\": 1700\n", " }\n", " },\n", " {\n", " \"index\": 7,\n", " \"markdown\": \"# References \\n\\n[1] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr\\u00f3n, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.\\n[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.\\n[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.\\n[4] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.\\n[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.\\n[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.\\n[7] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018.\\n[8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.\\n[9] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.\\n[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.\\n[11] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\\u00e9. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.\\n[12] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.\\n[13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.\\n[14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Kar\\u00e9n Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, 2022.\\n[15] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.\\n[16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453-466, 2019.\",\n", " \"images\": [],\n", " \"dimensions\": {\n", " \"dpi\": 200,\n", " \"height\": 2200,\n", " \"width\": 1700\n", " }\n", " }\n", " ],\n", " \"model\": \"mistral-ocr-2503-completion\",\n", " \"usage_info\": {\n", " \"pages_processed\": 8,\n", " \"doc_size_bytes\": 3749788\n", " },\n", " \"document_annotation\": \"{\\n \\\"language\\\": \\\"en\\\",\\n \\\"summary\\\": \\\"The document introduces Mistral 7B, a 7-billion-parameter language model designed for superior performance and efficiency. It outperforms the best open 13B model (Llama 2) across all evaluated benchmarks and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. The model uses grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to handle sequences of arbitrary length with reduced inference cost. Mistral 7B is released under the Apache 2.0 license and includes a fine-tuned version, Mistral 7B - Instruct, which surpasses Llama 2 13B - Chat model on human and automated benchmarks. The document also discusses the architectural details, performance comparisons, and the efficiency of Mistral 7B.\\\",\\n \\\"authors\\\": [\\n \\\"Albert Q. Jiang\\\",\\n \\\"Alexandre Sablayrolles\\\",\\n \\\"Arthur Mensch\\\",\\n \\\"Chris Bamford\\\",\\n \\\"Devendra Singh Chaplot\\\",\\n \\\"Diego de las Casas\\\",\\n \\\"Florian Bressand\\\",\\n \\\"Gianna Lengyel\\\",\\n \\\"Guillaume Lample\\\",\\n \\\"Lucile Saunier\\\",\\n \\\"Lelio Renard Lavaud\\\",\\n \\\"Marie-Anne Lachaux\\\",\\n \\\"Pierre Stock\\\",\\n \\\"Teven Le Scao\\\",\\n \\\"Thibaut Lavril\\\",\\n \\\"Thomas Wang\\\",\\n \\\"Timoth\\u00e9e Lacroix\\\",\\n \\\"William El Sayed\\\"\\n ]\\n}\"\n", "}\n" ] } ], "source": [ "from mistralai.extra import response_format_from_pydantic_model\n", "\n", "# OCR Call with Annotations\n", "annotations_response = client.ocr.process(\n", " model=\"mistral-ocr-latest\",\n", " pages=list(range(8)), # Document Annotations has a limit of 8 pages, we recommend spliting your documents when using it; bbox annotations does not have the same limit\n", " document={\n", " \"type\": \"document_url\",\n", " \"document_url\": f\"data:application/pdf;base64,{base64_pdf}\"\n", " },\n", " bbox_annotation_format=response_format_from_pydantic_model(Image),\n", " document_annotation_format=response_format_from_pydantic_model(Document),\n", " include_image_base64=False # We are not interested on retrieving the bbox images in this example, only their annotations\n", " )\n", "\n", "# Convert response to JSON format\n", "response_dict = json.loads(annotations_response.model_dump_json())\n", "\n", "print(json.dumps(response_dict, indent=4))" ] }, { "cell_type": "markdown", "metadata": { "id": "M0tVl3wQyaLX" }, "source": [ "Let's print the Annotations only!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EeaVt91PyaLX", "outputId": "bd182953-c976-4749-a5cf-9f27c2a18fbe" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Document Annotation:\n", " {\n", " \"language\": \"en\",\n", " \"summary\": \"The document introduces Mistral 7B, a 7-billion-parameter language model designed for superior performance and efficiency. It outperforms the best open 13B model (Llama 2) across all evaluated benchmarks and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. The model uses grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to handle sequences of arbitrary length with reduced inference cost. Mistral 7B is released under the Apache 2.0 license and includes a fine-tuned version, Mistral 7B - Instruct, which surpasses Llama 2 13B - Chat model on human and automated benchmarks. The document also discusses the architectural details, performance comparisons, and the efficiency of Mistral 7B.\",\n", " \"authors\": [\n", " \"Albert Q. Jiang\",\n", " \"Alexandre Sablayrolles\",\n", " \"Arthur Mensch\",\n", " \"Chris Bamford\",\n", " \"Devendra Singh Chaplot\",\n", " \"Diego de las Casas\",\n", " \"Florian Bressand\",\n", " \"Gianna Lengyel\",\n", " \"Guillaume Lample\",\n", " \"Lucile Saunier\",\n", " \"Lelio Renard Lavaud\",\n", " \"Marie-Anne Lachaux\",\n", " \"Pierre Stock\",\n", " \"Teven Le Scao\",\n", " \"Thibaut Lavril\",\n", " \"Thomas Wang\",\n", " \"Timothée Lacroix\",\n", " \"William El Sayed\"\n", " ]\n", "}\n", "\n", "BBox/Images:\n", "\n", "Image img-0.jpeg\n", "Location:\n", " - top_left_x: 425\n", " - top_left_y: 598\n", " - bottom_right_x: 1283\n", " - bottom_right_y: 893\n", "BBox/Image Annotation:\n", " {\n", " \"image_type\": \"image\",\n", " \"description\": \"A 3D rendering of the text 'Mistral AI' in a gradient of warm colors, transitioning from orange to brown.\"\n", "}\n", "\n", "Image img-1.jpeg\n", "Location:\n", " - top_left_x: 294\n", " - top_left_y: 638\n", " - bottom_right_x: 1405\n", " - bottom_right_y: 1064\n", "BBox/Image Annotation:\n", " {\n", " \"image_type\": \"table\",\n", " \"description\": \"This image compares two types of attention mechanisms used in natural language processing: Vanilla Attention and Sliding Window Attention. The left and center tables show the attention patterns for the sentence 'The cat sat on the'. In Vanilla Attention, each word attends to all previous words, resulting in a full lower triangular matrix. In Sliding Window Attention, each word attends to a fixed number of previous words, resulting in a banded lower triangular matrix. The right diagram illustrates how the effective context length varies with the number of layers and the window size in Sliding Window Attention.\"\n", "}\n", "\n", "Image img-2.jpeg\n", "Location:\n", " - top_left_x: 294\n", " - top_left_y: 191\n", " - bottom_right_x: 1405\n", " - bottom_right_y: 380\n", "BBox/Image Annotation:\n", " {\n", " \"image_type\": \"table\",\n", " \"description\": \"A table showing the progression of words in three sentences over three timesteps. Each word is highlighted in red at different timesteps, indicating the position of processing or focus in a sequential manner. The sentences are 'This is an example of ...', 'Mistral is a good ...', and 'The cat sat on the mat ...'.\"\n", "}\n", "\n", "Image img-3.jpeg\n", "Location:\n", " - top_left_x: 464\n", " - top_left_y: 741\n", " - bottom_right_x: 1235\n", " - bottom_right_y: 1064\n", "BBox/Image Annotation:\n", " {\n", " \"image_type\": \"table\",\n", " \"description\": \"A table showing a matrix of word comparisons with past, cache, and current sections. The table is divided into three main sections: Past, Cache, and Current. The Past section is filled with zeros, indicating no matches. The Cache section shows some matches with ones, and the Current section has the most matches, indicating the most recent comparisons.\"\n", "}\n", "\n", "Image img-4.jpeg\n", "Location:\n", " - top_left_x: 292\n", " - top_left_y: 204\n", " - bottom_right_x: 1390\n", " - bottom_right_y: 552\n", "BBox/Image Annotation:\n", " {\n", " \"image_type\": \"graph\",\n", " \"description\": \"This image contains two bar graphs comparing the performance of four different language models: Mistral 7B, LLaMA 2 13B, LLaMA 2 7B, and LLaMA 1 34B. The left graph shows accuracy percentages for categories: MMLU, Knowledge, Reasoning, and Comprehension. The right graph shows accuracy percentages for categories: AGI Eval, Math, BBH, and Code. Each bar represents the accuracy of a specific model in each category, with different colors indicating different models.\"\n", "}\n", "\n", "Image img-5.jpeg\n", "Location:\n", " - top_left_x: 464\n", " - top_left_y: 202\n", " - bottom_right_x: 1232\n", " - bottom_right_y: 734\n", "BBox/Image Annotation:\n", " {\n", " \"image_type\": \"graph\",\n", " \"description\": \"This image contains four line graphs comparing the performance of LLaMA 2 and Mistral models across different metrics as a function of model size (in billion parameters). Each graph represents a different evaluation metric: MMLU (Massive Multitask Language Understanding), Reasoning, Knowledge, and Comprehension. The x-axis of each graph shows the model size, while the y-axis shows the percentage performance. LLaMA 2 is represented by a red line with circular markers, and Mistral is represented by an orange square marker. Dashed vertical lines indicate the effective sizes of LLaMA models in terms of Mistral's performance, suggesting that smaller LLaMA models can achieve comparable performance to larger Mistral models.\"\n", "}\n", "\n", "Image img-6.jpeg\n", "Location:\n", " - top_left_x: 727\n", " - top_left_y: 792\n", " - bottom_right_x: 972\n", " - bottom_right_y: 1047\n", "BBox/Image Annotation:\n", " {\n", " \"image_type\": \"image\",\n", " \"description\": \"A 3D rendering of the letter 'M' wearing boxing gloves, giving it a playful and dynamic appearance.\"\n", "}\n" ] } ], "source": [ "print(\"Document Annotation:\\n\", annotations_response.document_annotation)\n", "print(\"\\nBBox/Images:\")\n", "for page in annotations_response.pages:\n", " for image in page.images:\n", " print(\"\\nImage\", image.id)\n", " print(\"Location:\")\n", " print(\" - top_left_x:\", image.top_left_x)\n", " print(\" - top_left_y:\", image.top_left_y)\n", " print(\" - bottom_right_x:\", image.bottom_right_x)\n", " print(\" - bottom_right_y:\", image.bottom_right_y)\n", " print(\"BBox/Image Annotation:\\n\", image.image_annotation)" ] }, { "cell_type": "markdown", "metadata": { "id": "thKVdfrJ2fvB" }, "source": [ "## Full Document with Annotation\n", "For reference, let's do the same but including the bbox images." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HDNDQCWG3ssl" }, "outputs": [], "source": [ "# OCR Call with Annotations\n", "annotations_response = client.ocr.process(\n", " model=\"mistral-ocr-latest\",\n", " pages=list(range(8)), # Document Annotations has a limit of 8 pages, we recommend spliting your documents when using it; bbox annotations does not have the same limit\n", " document={\n", " \"type\": \"document_url\",\n", " \"document_url\": f\"data:application/pdf;base64,{base64_pdf}\"\n", " },\n", " bbox_annotation_format=response_format_from_pydantic_model(Image),\n", " document_annotation_format=response_format_from_pydantic_model(Document),\n", " include_image_base64=True\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "Kucqn_et7ZDH" }, "source": [ "Now, we will display the full document with the OCR content and the annotation in bold:\n", "- Document Annotation at the start of the document.\n", "- BBox Annotation at below each bbox/image extracted." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7fpHs59h2jma", "outputId": "15628401-d44f-44ef-9c4b-249a15b2f1ce" }, "outputs": [ { "data": { "text/markdown": [ "**{\n", " \"language\": \"en\",\n", " \"summary\": \"The document introduces Mistral 7B, a 7-billion-parameter language model designed for superior performance and efficiency. It outperforms the best open 13B model (Llama 2) across all evaluated benchmarks and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. The model uses grouped-query attention (GQA) for faster inference and sliding window attention (SWA) to handle sequences of arbitrary length with reduced inference cost. Mistral 7B is released under the Apache 2.0 license and includes a fine-tuned version, Mistral 7B - Instruct, which surpasses Llama 2 13B - chat model on human and automated benchmarks. The document also discusses the model's architectural details, performance on various benchmarks, and its efficiency in balancing performance and computational costs.\",\n", " \"authors\": [\n", " \"Albert Q. Jiang\",\n", " \"Alexandre Sablayrolles\",\n", " \"Arthur Mensch\",\n", " \"Chris Bamford\",\n", " \"Devendra Singh Chaplot\",\n", " \"Diego de las Casas\",\n", " \"Florian Bressand\",\n", " \"Gianna Lengyel\",\n", " \"Guillaume Lample\",\n", " \"Lucile Saunier\",\n", " \"Lelio Renard Lavaud\",\n", " \"Marie-Anne Lachaux\",\n", " \"Pierre Stock\",\n", " \"Teven Le Scao\",\n", " \"Thibaut Lavril\",\n", " \"Thomas Wang\",\n", " \"Timothée Lacroix\",\n", " \"William El Sayed\"\n", " ]\n", "}**\n", "\n", "# Mistral 7B \n", "\n", "Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed\n", "\n", "\n", "\n", "**{\n", " \"image_type\": \"image\",\n", " \"description\": \"A 3D rendering of the text 'Mistral AI' in a gradient of warm colors, transitioning from orange to brown.\"\n", "}**\n", "\n", "\n", "#### Abstract\n", "\n", "We introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B - Instruct, that surpasses Llama 2 13B - chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/\n", "\n", "\n", "## 1 Introduction\n", "\n", "In the rapidly evolving domain of Natural Language Processing (NLP), the race towards higher model performance often necessitates an escalation in model size. However, this scaling tends to increase computational costs and inference latency, thereby raising barriers to deployment in practical, real-world scenarios. In this context, the search for balanced models delivering both high-level performance and efficiency becomes critically essential. Our model, Mistral 7B, demonstrates that a carefully designed language model can deliver high performance while maintaining an efficient inference. Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested benchmarks, and surpasses the best 34B model (LLaMa 34B, [25]) in mathematics and code generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [20], without sacrificing performance on non-code related benchmarks.\n", "\n", "Mistral 7B leverages grouped-query attention (GQA) [1], and sliding window attention (SWA) [6, 3]. GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.\n", "\n", "Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference implementation ${ }^{1}$ facilitating easy deployment either locally or on cloud platforms such as AWS, GCP, or Azure using the vLLM [17] inference server and SkyPilot ${ }^{2}$. Integration with Hugging Face ${ }^{3}$ is also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B - Chat model.\n", "\n", "Mistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications.\n", "\n", "# 2 Architectural details \n", "\n", "\n", "\n", "**{\n", " \"image_type\": \"table\",\n", " \"description\": \"The image compares two types of attention mechanisms used in natural language processing: Vanilla Attention and Sliding Window Attention. It also illustrates the concept of effective context length in these mechanisms. The leftmost table shows the Vanilla Attention mechanism, where each word in the sentence 'The cat sat on the' attends to all previous words, resulting in a full lower triangular matrix of 1s. The middle table depicts the Sliding Window Attention mechanism, which limits the context to a fixed window size, resulting in a banded lower triangular matrix. The rightmost diagram visualizes the effective context length across different layers, showing how the window size affects the number of tokens considered at each layer. The color gradient from yellow to red indicates the strength of attention, with yellow representing stronger attention.\"\n", "}**\n", "\n", "Figure 1: Sliding Window Attention. The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use sliding window attention: each token can attend to at most $W$ tokens from the previous layer (here, $W=3$ ). Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by $W$ tokens. Hence, after $k$ attention layers, information can move forward by up to $k \\times W$ tokens.\n", "\n", "Mistral 7B is based on a transformer architecture [27]. The main parameters of the architecture are summarized in Table 1. Compared to Llama, it introduces a few changes that we summarize below.\n", "Sliding Window Attention. SWA exploits the stacked layers of a transformer to attend information beyond the window size $W$. The hidden state in position $i$ of the layer $k, h_{i}$, attends to all hidden states from the previous layer with positions between $i-W$ and $i$. Recursively, $h_{i}$ can access tokens from the input layer at a distance of up to $W \\times k$ tokens, as illustrated in Figure 1. At the last layer, using a window size of $W=4096$, we have a theoretical attention span of approximately $131 K$ tokens. In practice, for a sequence length of 16 K and $W=4096$, changes made to FlashAttention [11] and xFormers [18] yield a 2x speed improvement over a vanilla attention baseline.\n", "\n", "| Parameter | Value |\n", "| :-- | --: |\n", "| dim | 4096 |\n", "| n_layers | 32 |\n", "| head_dim | 128 |\n", "| hidden_dim | 14336 |\n", "| n_heads | 32 |\n", "| n_kv_heads | 8 |\n", "| window_size | 4096 |\n", "| context_len | 8192 |\n", "| vocab_size | 32000 |\n", "\n", "Table 1: Model architecture.\n", "\n", "Rolling Buffer Cache. A fixed attention span means that we can limit our cache size using a rolling buffer cache. The cache has a fixed size of $W$, and the keys and values for the timestep $i$ are stored in position $i \\bmod W$ of the cache. As a result, when the position $i$ is larger than $W$, past values in the cache are overwritten, and the size of the cache stops increasing. We provide an illustration in Figure 2 for $W=3$. On a sequence length of 32 k tokens, this reduces the cache memory usage by 8 x , without impacting the model quality.\n", "\n", "[^0]\n", "[^0]: ${ }^{1}$ https://github.com/mistralai/mistral-src\n", " ${ }^{2}$ https://github.com/skypilot-org/skypilot\n", " ${ }^{3}$ https://huggingface.co/mistralai\n", "\n", "\n", "\n", "**{\n", " \"image_type\": \"table\",\n", " \"description\": \"A table showing the progression of words in three sentences over three timesteps. Each word is highlighted in red at different timesteps, indicating the position of processing or focus in a sequential manner. The sentences are 'This is an example of ...', 'Mistral is a good ...', and 'The cat sat on the mat ...'.\"\n", "}**\n", "\n", "Figure 2: Rolling buffer cache. The cache has a fixed size of $W=4$. Keys and values for position $i$ are stored in position $i \\bmod W$ of the cache. When the position $i$ is larger than $W$, past values in the cache are overwritten. The hidden state corresponding to the latest generated tokens are colored in orange.\n", "\n", "Pre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the $(k, v)$ cache with the prompt. If the prompt is very large, we can chunk it into smaller pieces, and pre-fill the cache with each chunk. For this purpose, we can select the window size as our chunk size. For each chunk, we thus need to compute the attention over the cache and over the chunk. Figure 3 shows how the attention mask works over both the cache and the chunk.\n", "\n", "\n", "\n", "**{\n", " \"image_type\": \"table\",\n", " \"description\": \"A table showing a matrix of word comparisons with past, cache, and current sections. The table is divided into three sections: Past, Cache, and Current. The Past section is filled with zeros, the Cache section contains a mix of zeros and ones, and the Current section is primarily ones. The table compares words such as 'the', 'dog', 'go', and 'to' against a sentence 'The cat sat on the mat and saw the dog go to'.\"\n", "}**\n", "\n", "Figure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory usage. We process a sequence in three chunks, \"The cat sat on\", \"the mat and saw\", \"the dog go to\". The figure shows what happens for the third chunk (\"the dog go to\"): it attends itself using a causal mask (rightmost block), attends the cache using a sliding window (center block), and does not attend to past tokens as they are outside of the sliding window (left block).\n", "\n", "## 3 Results\n", "\n", "We compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for fair comparison. We measure performance on a wide variety of tasks categorized as follow:\n", "\n", "- Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22], OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24]\n", "- World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15]\n", "- Reading Comprehension (0-shot): BoolQ [8], QuAC [7]\n", "- Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4\n", "- Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot)\n", "- Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29] (3-5-shot, English multiple-choice questions only)\n", "\n", "Detailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4 compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama $134 B^{4}$ in different categories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on most benchmarks. In particular, Mistral 7B displays a superior performance in code, mathematics, and reasoning benchmarks.\n", "\n", "[^0]\n", "[^0]: ${ }^{4}$ Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B.\n", "\n", "\n", "\n", "**{\n", " \"image_type\": \"graph\",\n", " \"description\": \"This image contains two bar graphs comparing the performance of four different language models: Mistral 7B, LLaMA 2 13B, LLaMA 2 7B, and LLaMA 1 34B. The left graph shows accuracy percentages for categories: MMLU, Knowledge, Reasoning, and Comprehension. The right graph shows accuracy percentages for categories: AGI Eval, Math, BBH, and Code. Each bar represents the accuracy of a specific model in each category, with different colors indicating different models.\"\n", "}**\n", "\n", "Figure 4: Performance of Mistral 7B and different Llama models on a wide range of benchmarks. All models were re-evaluated on all metrics with our evaluation pipeline for accurate comparison. Mistral 7B significantly outperforms Llama 2 7B and Llama 2 13B on all benchmarks. It is also vastly superior to Llama 1 34B in mathematics, code generation, and reasoning benchmarks.\n", "\n", "| Model | Modality | MMLU | HellaSwag | WinoG | PIQA | Arc-e | Arc-c | NQ | TriviaQA | HumanEval | MBPP | MATH | GSM8K |\n", "| :-- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n", "| LLaMA 2 7B | Pretrained | $44.4 \\%$ | $77.1 \\%$ | $69.5 \\%$ | $77.9 \\%$ | $68.7 \\%$ | $43.2 \\%$ | $24.7 \\%$ | $63.8 \\%$ | $11.6 \\%$ | $26.1 \\%$ | $3.9 \\%$ | $16.0 \\%$ |\n", "| LLaMA 2 13B | Pretrained | $55.6 \\%$ | $\\mathbf{8 0 . 7 \\%}$ | $72.9 \\%$ | $80.8 \\%$ | $75.2 \\%$ | $48.8 \\%$ | $\\mathbf{2 9 . 0 \\%}$ | $\\mathbf{6 9 . 6 \\%}$ | $18.9 \\%$ | $35.4 \\%$ | $6.0 \\%$ | $34.3 \\%$ |\n", "| Code-Llama 7B | Finetuned | $36.9 \\%$ | $62.9 \\%$ | $62.3 \\%$ | $72.8 \\%$ | $59.4 \\%$ | $34.5 \\%$ | $11.0 \\%$ | $34.9 \\%$ | $\\mathbf{3 1 . 1 \\%}$ | $\\mathbf{5 2 . 5 \\%}$ | $5.2 \\%$ | $20.8 \\%$ |\n", "| Mistral 7B | Pretrained | $\\mathbf{6 0 . 1 \\%}$ | $\\mathbf{8 1 . 3 \\%}$ | $\\mathbf{7 5 . 3 \\%}$ | $\\mathbf{8 3 . 0 \\%}$ | $\\mathbf{8 0 . 0 \\%}$ | $\\mathbf{5 5 . 5 \\%}$ | $\\mathbf{2 8 . 8 \\%}$ | $\\mathbf{6 9 . 9 \\%}$ | $\\mathbf{3 0 . 5 \\%}$ | $47.5 \\%$ | $\\mathbf{1 3 . 1 \\%}$ | $\\mathbf{5 2 . 2 \\%}$ |\n", "\n", "Table 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks.\n", "\n", "Size and Efficiency. We computed \"equivalent model sizes\" of the Llama 2 family, aiming to understand Mistral 7B models' efficiency in the cost-performance spectrum (see Figure 5). When evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On the Knowledge benchmarks, Mistral 7B's performance achieves a lower compression rate of 1.9 x , which is likely due to its limited parameter count that restricts the amount of knowledge it can store.\n", "\n", "Evaluation Differences. On some benchmarks, there are some differences between our evaluation protocol and the one reported in the Llama 2 paper: 1) on MBPP, we use the hand-verified subset 2) on TriviaQA, we do not provide Wikipedia contexts.\n", "\n", "## 4 Instruction Finetuning\n", "\n", "To evaluate the generalization capabilities of Mistral 7B, we fine-tuned it on instruction datasets publicly available on the Hugging Face repository. No proprietary data or training tricks were utilized: Mistral 7B - Instruct model is a simple and preliminary demonstration that the base model can easily be fine-tuned to achieve good performance. In Table 3, we observe that the resulting model, Mistral 7B - Instruct, exhibits superior performance compared to all 7B models on MT-Bench, and is comparable to 13B - Chat models. An independent human evaluation was conducted on https://limboxing.com/leaderboard.\n", "\n", "| Model | Chatbot Arena <br> ELO Rating | MT Bench |\n", "| :-- | :--: | :--: |\n", "| WizardLM 13B v1.2 | 1047 | 7.2 |\n", "| Mistral 7B Instruct | $\\mathbf{1 0 3 1}$ | $\\mathbf{6 . 8 4}$ +/- $\\mathbf{0 . 0 7}$ |\n", "| Llama 2 13B Chat | 1012 | 6.65 |\n", "| Vicuna 13B | 1041 | 6.57 |\n", "| Llama 2 7B Chat | 985 | 6.27 |\n", "| Vicuna 7B | 997 | 6.17 |\n", "| Alpaca 13B | 914 | 4.53 |\n", "\n", "Table 3: Comparison of Chat models. Mistral 7B Instruct outperforms all 7B models on MT-Bench, and is comparable to 13B - Chat models.\n", "\n", "In this evaluation, participants were provided with a set of questions along with anonymous responses from two models and were asked to select their preferred response, as illustrated in Figure 6. As of October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B.\n", "\n", "\n", "\n", "**{\n", " \"image_type\": \"graph\",\n", " \"description\": \"This image contains four line graphs comparing the performance of LLaMA 2 and Mistral models across different metrics as a function of model size (in billion parameters). Each graph represents a different metric: MMLU (percentage), Reasoning (percentage), Knowledge (percentage), and Comprehension (percentage). The x-axis of each graph shows the model size ranging from 7 to 70 billion parameters, while the y-axis shows the percentage performance. LLaMA 2 is represented by a red line with circular markers, and Mistral is represented by an orange square marker. The graphs indicate that LLaMA 2 generally performs better as the model size increases, with specific effective sizes noted for each metric: 23B (3.3x) for MMLU, 38B (5.4x) for Reasoning, 13B (1.9x) for Knowledge, and 21B (3x) for Comprehension. Mistral's performance is shown at a single point for each metric.\"\n", "}**\n", "\n", "Figure 5: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension for Mistral 7B and Llama 2 (7B/13B/70B). Mistral 7B largely outperforms Llama 2 13B on all evaluations, except on knowledge benchmarks, where it is on par (this is likely due to its limited parameter count, which limits the amount of knowledge it can compress).\n", "\n", "# 5 Adding guardrails for front-facing applications \n", "\n", "The ability to enforce guardrails when it comes to AI generation is important for front-facing applications. In this section, we highlight how to leverage system prompting to optionally enforce output constraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform fine-grained content moderation, which can be useful to enforce quality content in applications.\n", "\n", "### 5.1 System prompt to enforce guardrails\n", "\n", "We introduce a system prompt (see below) to guide the model to generate answers within specified guardrails, similar to the work done with Llama 2. Using this prompt allows the user to move on the Pareto front of model utility / guardrails enforcement, as indicated in Table 4.\n", "\n", "Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.\n", "\n", "We use a set of 175 unsafe prompts for evaluating safety. With the recommended system prompt, the model properly declines to answer $100 \\%$ of the harmful questions.\n", "As an illustration, we provide in Table 5 the answers of both Mistral 7B - Instruct and Llama 2 Chat 13B to the question How to kill a linux process with system prompts activated. We observe that Mistral 7B provides a correct response while Llama 2 declines to answer. Note that on this specific question, both models answer correctly when system prompts are deactivated.\n", "\n", "| Guardrails | MT Bench |\n", "| :-- | :--: |\n", "| No system prompt | $6.84 \\pm 0.07$ |\n", "| Llama 2 system prompt | $6.38 \\pm 0.07$ |\n", "| Mistral system prompt | $6.58 \\pm 0.05$ |\n", "\n", "Table 4: System prompts. Mean official MT Bench score over 10 iterations with standard deviation for Mistral 7B - Instruct. For reference, Llama 2 13B - Chat reports official results of 6.65 .\n", "\n", "### 5.2 Content moderation with self-reflection\n", "\n", "Mistral 7B - Instruct can be used as a content moderator: the model itself is able to accurately classify a user prompt or its generated answer as being either acceptable or falling into one of the following categories: Illegal activities such as terrorism, child abuse or fraud; Hateful, harassing or violent content such as discrimination, self-harm or bullying; Unqualified advice for instance in legal, medical or financial domains.\n", "\n", "| Model | Answer |\n", "| :--: | :--: |\n", "| Mistral 7B - Instruct with Mistral system prompt | To kill a Linux process, you can use the `kill' command followed by the process ID (PID) of the process you want to terminate. For example, to kill process with PID 1234, you would run the command `kill 1234`. It's important to note that killing a process can have unintended consequences, so it's generally a good idea to only kill processes that you are certain you want to terminate. Additionally, it's generally a good idea to use the 'kill' command with caution, as it can be used to terminate critical system processes that are necessary for the operation of the system. |\n", "| Llama 2 13B - Chat with Llama 2 system prompt | As a helpful and respectful assistant, I cannot provide information on how to kill a Linux process. It is not appropriate or ethical to intentionally harm or terminate processes, as it can cause damage to the system and potentially result in data loss. Additionally, it is important to respect the autonomy and security of other users' processes, as they may be critical to their work or personal activities. Instead, I suggest that you explore other options to address any issues or concerns you may have. |\n", "\n", "Table 5: Comparison between Mistral and Llama system prompts. Mistral provides the right answer whereas Llama 2 declines to answer to the question How to kill a linux process.\n", "\n", "To do so, we designed a self-reflection prompt that makes Mistral 7B classify a prompt or a generated answer. We evaluated self-reflection on our manually curated and balanced dataset of adversarial and standard prompts and got a precision of $99.4 \\%$ for a recall of $95.6 \\%$ (considering acceptable prompts as positives).\n", "\n", "The use cases are vast, from moderating comments on social media or forums to brand monitoring on the internet. In particular, the end user is able to select afterwards which categories to effectively filter based on their particular use-case.\n", "\n", "# 6 Conclusion \n", "\n", "Our work on Mistral 7B demonstrates that language models may compress knowledge more than what was previously thought. This opens up interesting perspectives: the field has so far put the emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost, as in [14]); the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and much remains to be explored to obtain the best performance with the smallest possible model.\n", "\n", "## Acknowledgements\n", "\n", "We are grateful to CoreWeave for their 24/7 help in marshalling our cluster. We thank the CINECA/EuroHPC team, and in particular the operators of Leonardo, for their resources and help. We thank the maintainers of FlashAttention, vLLM, xFormers, Skypilot for their precious assistance in implementing new features and integrating their solutions into ours. A huge thanks to Tri Dao and Daniel Haziza for helping include Mistral related changes to FlashAttention and xFormers on a tight schedule. We thank the teams of Hugging Face, AWS, GCP, Azure ML for their intense help in making our model compatible everywhere.\n", "\n", "# Can you recommend a foundational and accessible book about quantum physics? \n", "\n", "Capy Link to Question Read Round (Enter) $\\rightarrow$\n", "\n", "Round Winner: mistral-7b-instruct-v0.1\n", "\n", "\n", "**{\n", " \"image_type\": \"image\",\n", " \"description\": \"A 3D rendering of the letter 'M' wearing boxing gloves, giving it a playful and dynamic appearance.\"\n", "}**\n", "\n", "## mistral-7b-instruct-v0.1 R Round\n", "\n", " Winner\n", "A great book for a foundational understanding of quantum physics is \"The Quantum Universe: Everything that Can Happen Does Happen\" by Brian Cox and Jeff Forshaw. This book is accessible and easy to understand, making it a great choice for beginners. It covers all the basics of quantum mechanics, including superposition, entanglement, and the uncertainty principle. The authors also explore the implications of quantum physics for our understanding of the universe, including the nature of time and space. Overall, \"The Quantum Universe\" is an excellent choice for anyone looking to gain a foundational understanding of this fascinating field\n", "\n", "## llama-2-13b-chat\n", "\n", "Certainly! For a foundational and accessible introduction to quantum physics, I highly recommend \"The Feynman Lectures on Physics\" by Richard Feynman. This classic text provides a clear and engaging overview of the principles and concepts of quantum mechanics, with minimal mathematical formalism and plenty of illustrations and examples. It's a great starting point for anyone looking to gain a deeper understanding of quantum physics without getting bogged down in too much technical detail.\n", "\n", "Figure 6: Human evaluation of Mistral 7B - Instruct vs Llama 2 13B - Chat Example. An example of human evaluation from libboxing.com. The question asks for recommendations of books in quantum physics. Llama 2 13B - Chat recommends a general physics book, while Mistral 7B - Instruct recommends a more relevant book on quantum physics and describes in the contents in more detail.\n", "\n", "# References \n", "\n", "[1] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.\n", "[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.\n", "[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.\n", "[4] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.\n", "[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.\n", "[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.\n", "[7] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018.\n", "[8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.\n", "[9] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.\n", "[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.\n", "[11] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.\n", "[12] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.\n", "[13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.\n", "[14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, 2022.\n", "[15] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.\n", "[16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453-466, 2019." ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def replace_images_in_markdown_annotated(markdown_str: str, images_dict: dict) -> str:\n", " \"\"\"\n", " Replace image placeholders in markdown with base64-encoded images and their annotation.\n", "\n", " Args:\n", " markdown_str: Markdown text containing image placeholders\n", " images_dict: Dictionary mapping image IDs to base64 strings\n", "\n", " Returns:\n", " Markdown text with images replaced by base64 data and their annotation\n", " \"\"\"\n", " for img_name, data in images_dict.items():\n", " markdown_str = markdown_str.replace(\n", " f\"\", f\"\\n\\n**{data['annotation']}**\"\n", " )\n", " return markdown_str\n", "\n", "def get_combined_markdown_annotated(ocr_response: OCRResponse) -> str:\n", " \"\"\"\n", " Combine OCR text, annotation and images into a single markdown document.\n", "\n", " Args:\n", " ocr_response: Response from OCR processing containing text and images\n", "\n", " Returns:\n", " Combined markdown string with embedded images and their annotation\n", " \"\"\"\n", " markdowns: list[str] = [\"**\" + ocr_response.document_annotation + \"**\"]\n", " # Extract images from page\n", " for page in ocr_response.pages:\n", " image_data = {}\n", " for img in page.images:\n", " image_data[img.id] = {\"image\":img.image_base64, \"annotation\": img.image_annotation}\n", " # Replace image placeholders with actual images\n", " markdowns.append(replace_images_in_markdown_annotated(page.markdown, image_data))\n", "\n", " return \"\\n\\n\".join(markdowns)\n", "\n", "# Display combined markdowns and images\n", "display(Markdown(get_combined_markdown_annotated(annotations_response)))" ] }, { "cell_type": "markdown", "metadata": { "id": "UtCyIWRuimm6" }, "source": [ "## Other Examples" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LUf0vvDEhoja", "cellView": "form", "outputId": "33c13010-5df6-464f-b1f8-d5d4a1dfe34b" }, "outputs": [ { "data": { "text/markdown": [ "**{\n", " \"languages\": [\"en\"],\n", " \"summary\": \"The document presents financial data for the Wikimedia Foundation, comparing actual figures against planned figures for the period from July 1, 2008, to December 31, 2008. Key points include a significant increase in unrestricted public support and restricted public support, with notable variances in in-kind revenue and investments. The foundation's expenses were slightly lower than planned, contributing to a substantial net income. The year-over-year comparison shows a dramatic increase in revenue and expenses, with unrestricted revenue more than doubling. The balance sheet as of December 31, 2008, reflects substantial growth in assets and equity. The graphical representation illustrates revenue and expense trends, highlighting a peak in December 2008 and projections for the first half of 2009.\"\n", "}**\n", "\n", "Wikimedia Foundation\n", "Actual vs Plan Comparison\n", "July 1, 2008 through December 31, 2008\n", "\n", "Ordinary Income/Expense\n", "Income\n", "43400 - Unrestricted Public Support\n", "43400 - Restricted Public Support\n", "43440 - In Kind Revenue\n", "45000 - Investments\n", "46400 - Other Types of Income\n", "47200 - Program Income\n", "49000 - Special Events Income, net\n", "Total Income\n", "\n", "| Actual <br> Jul - Dec 08 | Plan <br> Jul - Dec 08 | \\$ Change | \\% Change | Notes | Annual <br> Plan |\n", "| :--: | :--: | :--: | :--: | :--: | :--: |\n", "| | | | | | |\n", "| 4,828,861.00 | 3,762,500.00 | 1,066,361.00 | $28.34 \\%$ (a) | 6,000,000.00 | |\n", "| 1,152,000.00 | 0.00 | 1,152,000.00 | 100.00\\% (aa) | 0.00 | |\n", "| 128,600.00 | 300,000.00 | $-171,400.00$ | $-57.13 \\%$ (b) | 500,000.00 | |\n", "| 4,019.31 | 12,000.00 | $-7,980.69$ | $-66.51 \\%$ (c) | 24,000.00 | |\n", "| 10,925.70 | 0.00 | 10,925.70 | 100.00\\% (d) | 702,000.00 | |\n", "| 215,583.34 | 232,336.00 | $-16,752.66$ | $-7.21 \\%$ | 109,672.00 | |\n", "| 11,995.22 | 0.00 | 11,995.22 | 100.00\\% (e) | 0.00 | |\n", "| 6,351,984.57 | 4,306,836.00 | 2,045,148.57 | $47.49 \\%$ | 7,335,672.00 | |\n", "\n", "Expense\n", "60100 - Salary and Wages\n", "65055 - Internet Hosting\n", "62835 - In-Kind Expenses\n", "xxxxx - Operating Expenses\n", "62810 - Capital Expenditures\n", "68300 - Travel, Entertain, Meetings\n", "Total Expense\n", "Net Income\n", "\n", "| 927,304.75 | 982,908.00 | $-55,603.25$ | $-5.66 \\%$ | 2,169,980.00 |\n", "| :--: | :--: | :--: | :--: | :--: |\n", "| 319,974.05 | 370,840.00 | $-50,865.95$ | $-13.72 \\%$ (f) | 856,680.00 |\n", "| 128,600.00 | 0.00 | 128,600.00 | 100.00\\% (g) | 0.00 |\n", "| 665,482.07 | 745,046.00 | $-79,563.93$ | $-10.68 \\%$ (h) | 1,447,681.00 |\n", "| 447,723.85 | 496,500.00 | $-48,776.15$ | $-9.82 \\%$ | 979,000.00 |\n", "| 120,658.09 | 278,358.00 | $-157,699.91$ | $-56.65 \\%$ (i) | 520,815.00 |\n", "| 2,609,742.81 | 2,873,652.00 | $-263,909.19$ | $-9.18 \\%$ | 5,974,156.00 |\n", "| 3,742,241.76 | 1,433,184.00 | 2,309,057.76 | 161.11\\% | 1,361,516.00 |\n", "\n", "Mid-Year Financial Statement Recap\n", "Revenue is over plan year-to-date due to the success of the on-line fundraiser.\n", "Total fundraising (incl. upcoming \\$1MM from Sloan) has exceeded annual plan. Expenses are slightly less than plan.\n", "\n", "# Notes (Variances over 10\\%) : \n", "\n", "(a) Revenue is over plan year-to-date due to the success of the on-line fundraiser.\n", "(aa) Represents restricted gifts from Stanton Foundation for tech capex purchases and a useability initiative.\n", "We did not budget for restricted gifts.\n", "(b) Represents donated equipment.\n", "(c) Represents unbudgeted foreign exchange losses.\n", "(d) Misc. income such as speaker fees.\n", "(e) Net revenue and expenses from Wikimania not including Advisory Board, Board and staff travel.\n", "(f) Internet hosting lower than plan due to a two-month delay in service improvements.\n", "(g) Represents donated equipment.\n", "(h) Operating expenses are under due to slight underspending in outside contract services.\n", "(i) Travel was less than anticipated.\n", "\n", "# Wikimedia Foundation \n", "\n", "## Year-Over-Year Comparison\n", "\n", "## Year-to-Date, July-December, 2008 vs 2007\n", "\n", "Ordinary Income/Expense\n", "Income\n", "43400 $\\cdot$ Unrestricted Public Support\n", "43400 - Restricted Public Support\n", "43440 - In Kind Revenue\n", "45000 $\\cdot$ Investments\n", "46400 $\\cdot$ Other Types of Income\n", "47200 $\\cdot$ Program Income\n", "49000 $\\cdot$ Special Events Income, net\n", "Total Income\n", "Gross Profit\n", "\n", "## Expense\n", "\n", "60100 Salary and Wages\n", "65055 - Internet Hosting\n", "62835 - In-Kind Expenses\n", "xxxxx - Operating Expenses\n", "6281x - Capital Expenditures\n", "6281x - Depreciation\n", "68300 $\\cdot$ Travel, Entertainment, Meetings\n", "Total Expense\n", "Net Income\n", "\n", "| Jul - Dec 08 | Jul - Dec 07 | \\% Change | \\% Change - Notes |\n", "| :--: | :--: | :--: | :--: |\n", "| 4,828,861.00 | 2,336,802.72 | 2,492,058.28 | 106.64\\% (a) |\n", "| 1,152,000.00 | 20,000.00 | 1,132,000.00 | 5,660.00\\% (aa) |\n", "| 128,600.00 | 49,786.00 | 78,814.00 | $158.31 \\%$ (b) |\n", "| 4,019.31 | 15,000.64 | $-10,981.33$ | $-73.21 \\%$ (c) |\n", "| 10,925.70 | 0.00 | 10,925.70 | 100.00\\% (d) |\n", "| 215,583.34 | 42,892.37 | 172,690.97 | 402.62\\% (e) |\n", "| 11,995.22 | 29,000.00 | $-17,004.78$ | $-58.64 \\%$ (f) |\n", "| 6,351,984.57 | 2,493,481.73 | 3,858,502.84 | $154.74 \\%$ |\n", "| 6,351,984.57 | 2,493,481.73 | 3,858,502.84 | $154.74 \\%$ |\n", "| 927,304.75 | 401,643.68 | 525,661.07 | 130.88\\% (g) |\n", "| 319,974.05 | 100,372.00 | 219,602.05 | 218.79\\% (h) |\n", "| 128,600.00 | 0.00 | 128,600.00 | 100.00\\% (i) |\n", "| 665,482.07 | 446,026.30 | 219,455.77 | 49.20\\% (j) |\n", "| 447,723.85 | 0.00 | 447,723.85 | 100.00\\% (k) |\n", "| 0.00 | 116,031.95 | $-116,031.95$ | -100.00\\% (l) |\n", "| 120,658.09 | 213,456.73 | $-92,798.64$ | $-43.47 \\%$ (m) |\n", "| 2,609,742.81 | 1,277,530.66 | 1,332,212.15 | 104.28\\% |\n", "\n", "## Year Over Year Recap\n", "\n", "Unrestricted revenue has more than doubled against the same period last year.\n", "Expenses have increased but at a lesser rate.\n", "Revenue growth has outpaced the growth of expenses allowing increased financial stability for the Foundation.\n", "The increase in expenses has enabled the Foundation to increase spending on technical infrastructure,\n", "Hosting costs and software development, establish the new fundraising team, hire additional technical staff and hire a Head of Public Outreach. It has also enabled a small amount of spending on staff and volunteer development.\n", "\n", "## Notes:\n", "\n", "(a) The on-line fundraiser generated significantly more revenue than the prior year.\n", "(aa) Represents restricted gifts from Stanton Foundation for tech capex purchases and a useability initiative.\n", "(b) In-kind revenue consists of equipment donations.\n", "(c) This year included more foreign currency exchange losses due to a weakening dollar.\n", "(d) Represents misc. income such as speaker fees.\n", "(e) Investment in Business Development staff has resulted in increased revenue.\n", "(f) Wikimania is designed as a break-even event but in '08, as in '07, a small surplus was generated.\n", "(g) Salaries and wages increased reflecting significant investment in staff as per the annual plan.\n", "(h) Internet hosting costs increased reflecting investment to improve access and reliability per the annual plan.\n", "(i) Due to increased staffing, the Foundation was able to pursue tech equipment donations.\n", "(j) Operating expense increases reflect increases in fundraising expenses and outside contract services.\n", "(k) Represents capital expenditures which will be capitalized (moved to the Balance Sheet as Fixed Assets) at year-end. They are recorded here during the year so they can be tracked with other expenses against plan.\n", "(l) Represents depreciation of fixed assets. Current year depreciation will be reflected here at year end.\n", "(m) Travel expense was high in the prior year due to the relocation and higher participation by the Advisory Board in Wikimania.\n", "\n", "# Wikimedia Foundation \n", "\n", "## Balance Sheet\n", "\n", "As of December 31, 2008\n", "\n", "| ASSETS | Dec 31, 08 | Dec 31, 07 | \\$ Change | \\% Change |\n", "| :--: | :--: | :--: | :--: | :--: |\n", "| Current Assets | | | | |\n", "| Total Checking/Savings | $6,677,717.40$ | 2,335,434.61 | $4,342,282.79$ | $185.93 \\%$ |\n", "| Total Accounts/Contributions Receivable (current) | $1,000,500.00$ | 20,268.95 | 980,231.05 | $4,836.12 \\%$ |\n", "| Total Investments | 655.50 | 49,786.56 | $-49,131.06$ | $-98.68 \\%$ |\n", "| Total Other Current Assets | $67,117.18$ | 13,742.13 | $53,375.05$ | $388.40 \\%$ |\n", "| Total Current Assets | $7,745,990.08$ | 2,419,232.25 | $5,326,757.83$ | $220.18 \\%$ |\n", "| Other Assets | | | | |\n", "| Total Property, Plant and Equipment | $1,254,089.05$ | $1,183,137.34$ | 70,951.71 | $6.00 \\%$ |\n", "| Total Accum Depr- Property, Plant and Equipment | $-734,282.38$ | $-657,179.79$ | $-77,102.59$ | $11.73 \\%$ |\n", "| Noncurrent Portion of Contributions Receivable | 974,279.00 | 0.00 | 974,279.00 | $100.00 \\%$ |\n", "| Total Other Assets | $1,494,085.67$ | 525,957.55 | 968,128.12 | $184.07 \\%$ |\n", "| TOTAL ASSETS | $9,240,075.75$ | 2,945,189.80 | $6,294,885.95$ | $213.73 \\%$ |\n", "\n", "## LIABILITIES \\& EQUITY\n", "\n", "Current Liabilities\n", "Total Accounts Payable and Accrued Expenses\n", "Total Deferred Revenue\n", "Total Current Liabilities\n", "TOTAL LIABILITIES\n", "Equity\n", "32000 $\\cdot$ Retained Earnings\n", "Net Income\n", "TOTAL EQUITY\n", "TOTAL LIABILITIES \\& EQUITY\n", "\n", "| 186,332.65 | 70,960.98 | 115,371.67 | 162.59\\% |\n", "| --: | --: | --: | --: |\n", "| 133,333.33 | 0.00 | 133,333.33 | $100.00 \\%$ |\n", "| 319,665.98 | 70,960.98 | 248,705.00 | $350.48 \\%$ |\n", "| 319,665.98 | 70,960.98 | 248,705.00 | $350.48 \\%$ |\n", "| 5,178,168.01 | $1,658,277.75$ | $3,519,890.26$ | $212.26 \\%$ |\n", "| 3,742,241.76 | $1,215,951.07$ | $2,526,290.69$ | $207.76 \\%$ |\n", "| 8,920,409.77 | $2,874,228.82$ | $6,046,180.95$ | $210.36 \\%$ |\n", "| 9,240,075.75 | $2,945,189.80$ | $6,294,885.95$ | $213.73 \\%$ |\n", "| | | | |\n", "\n", "# Wikimedia Foundation \n", "\n", "Revenue and Expense Actuals Against Plan (July 1-December 31,2008)\n", "Revenue and Expense Plan Against Projections (January 1-June 30, 2009)\n", "\n", "\n", "**{\n", " \"image_type\": \"graph\",\n", " \"description\": \"A bar and line graph showing the revenue and expense plan against projections from July 2008 to June 2009. The graph includes actual revenue and expenses, planned revenue and expenses, and projected revenue and expenses. The y-axis represents the amount in dollars, ranging from 0 to 3500000, while the x-axis represents the timeline from July 2008 to June 2009. Different colors represent different data series: green for actual revenue, yellow for actual expenses, light green for planned revenue, orange for planned expenses, blue for projected revenue, and red for projected expenses.\"\n", "}**" ], "text/plain": [ "<IPython.core.display.Markdown object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#@title PDF Financial Document\n", "\n", "# Create the annotations formats\n", "class ImageType(str, Enum):\n", " GRAPH = \"graph\"\n", " TEXT = \"text\"\n", " TABLE = \"table\"\n", " IMAGE = \"image\"\n", "\n", "class Image(BaseModel):\n", " image_type: ImageType = Field(..., description=\"The type of the image. Must be one of 'graph', 'text', 'table' or 'image'.\")\n", " description: str = Field(..., description=\"A description of the image.\")\n", "\n", "class Document(BaseModel):\n", " languages: list[str] = Field(..., description=\"The list of languages present in the document in ISO 639-1 code format (e.g., 'en', 'fr').\")\n", " summary: str = Field(..., description=\"A summary of the document.\")\n", "\n", "# OCR Call with Annotations\n", "annotations_response = client.ocr.process(\n", " model=\"mistral-ocr-latest\",\n", " pages=list(range(8)), # Document Annotations has a limit of 8 pages, we recommend spliting your documents when using it; bbox annotations does not have the same limit\n", " document={\n", " \"type\": \"document_url\",\n", " \"document_url\": \"https://upload.wikimedia.org/wikipedia/foundation/f/f6/WMF_Mid-Year-Financials_08-09-FINAL.pdf\"\n", " },\n", " bbox_annotation_format=response_format_from_pydantic_model(Image),\n", " document_annotation_format=response_format_from_pydantic_model(Document),\n", " include_image_base64=True\n", " )\n", "\n", "# Display combined markdowns and images\n", "display(Markdown(get_combined_markdown_annotated(annotations_response)))" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 0 }