Mistral AI

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

vLLM is an open-source LLM inference and serving engine. It is particularly appropriate as a target platform for self-deploying Mistral models on-premise.

Pre-requisites

The hardware requirements for vLLM are listed on its installation documentation page.
By default, vLLM sources the model weights from Hugging Face. To access Mistral model repositories you need to be authenticated on Hugging Face, so an access token HF_TOKEN with the READ permission will be required. You should also make sure that you have accepted the conditions of access on each model card page.
If you already have the model artifacts on your infrastructure you can use them directly by pointing vLLM to their local path instead of a Hugging Face model ID. In this scenario you will be able to skip all Hugging Face related setup steps.

Getting started

The following sections will guide you through the process of deploying and querying Mistral models on vLLM.

Installing vLLM

Create a Python virtual environment and install the vllm package (version >=0.6.1.post1 to ensure maximum compatibility with all Mistral models).
Authenticate on the HuggingFace Hub using your access token $HF_TOKEN :
```
1huggingface-cli login --token $HF_TOKEN
```

Offline mode inference

When using vLLM in offline mode the model is loaded and used for one-off batch inference workloads.

1    ```python
2    from vllm import LLM
3    from vllm.sampling_params import SamplingParams
4
5    model_name = "mistralai/Mistral-NeMo-Instruct-2407"
6    sampling_params = SamplingParams(max_tokens=8192)
7
8    llm = LLM(
9        model=model_name,
10        tokenizer_mode="mistral",
11        load_format="mistral",
12        config_format="mistral",
13    )
14
15    messages = [
16        {
17            "role": "user",
18            "content": "Who is the best French painter. Answer with detailed explanations.",
19        }
20    ]
21
22    res = llm.chat(messages=messages, sampling_params=sampling_params)
23    print(res[0].outputs[0].text)
24
25    ```
26
27</TabItem>
28
29    <TabItem value="vllm-batch-small" label="Text input (Mistral Small)">
30
31    ```python
32    from vllm import LLM
33    from vllm.sampling_params import SamplingParams
34
35    model_name = "mistralai/Mistral-Small-Instruct-2409"
36    sampling_params = SamplingParams(max_tokens=8192)
37
38    llm = LLM(
39        model=model_name,
40        tokenizer_mode="mistral",
41        load_format="mistral",
42        config_format="mistral",
43    )
44
45    messages = [
46        {
47            "role": "user",
48            "content": "Who is the best French painter. Answer with detailed explanations.",
49        }
50    ]
51
52    res = llm.chat(messages=messages, sampling_params=sampling_params)
53    print(res[0].outputs[0].text)
54
55    ```
56
57</TabItem>
58
59<TabItem value="vllm-batch-pixtral" label="Image + text input (Pixtral-12B)">
60    Suppose you want to caption the following images:
61      <center>
62          <a href="https://picsum.photos/id/1/512/512"><img alt="" src="/img/laptop.png" width="20%"/></a>
63          <a href="https://picsum.photos/id/11/512/512"><img alt="" src="/img/countryside.png"  width="20%"/></a>
64          <a href="https://picsum.photos/id/111/512/512"><img alt="" src="/img/vintage_car.png"  width="20%"/></a>
65      </center>
66
67    You can do so by running the following code:
68
69    ```python
70    from vllm import LLM
71    from vllm.sampling_params import SamplingParams
72
73    model_name = "mistralai/Pixtral-12B-2409"
74    max_img_per_msg = 3
75
76    sampling_params = SamplingParams(max_tokens=8192)
77    llm = LLM(
78        model=model_name,
79        tokenizer_mode="mistral",
80        load_format="mistral",
81        config_format="mistral",
82        limit_mm_per_prompt={"image": max_img_per_msg},
83    )
84
85    urls = [f"https://picsum.photos/id/{id}/512/512" for id in ["1", "11", "111"]]
86
87    messages = [
88        {
89            "role": "user",
90            "content": [
91                {"type": "text", "text": "Describe this image"},
92                ] + [{"type": "image_url", "image_url": {"url": f"{u}"}} for u in urls],
93        },
94    ]
95
96    res = llm.chat(messages=messages, sampling_params=sampling_params)
97    print(res[0].outputs[0].text)
98    ```
99</TabItem>

Server mode inference

In server mode, vLLM spawns an HTTP server that continuously waits for clients to connect and send requests concurrently. The server exposes a REST API that implements the OpenAI protocol, allowing you to directly reuse existing code relying on the OpenAI API.

1      ```bash
2      vllm serve mistralai/Mistral-Nemo-Instruct-2407 \
3        --tokenizer_mode mistral \
4        --config_format mistral \
5        --load_format mistral
6      ```
7
8    You can now run inference requests with text input:
9
10      <Tabs>
11        <TabItem value="vllm-infer-nemo-curl" label="cURL">
12            ```bash
13            curl --location 'http://localhost:8000/v1/chat/completions' \
14                --header 'Content-Type: application/json' \
15                --header 'Authorization: Bearer token' \
16                --data '{
17                    "model": "mistralai/Mistral-Nemo-Instruct-2407",
18                    "messages": [
19                      {
20                        "role": "user",
21                        "content": "Who is the best French painter? Answer in one short sentence."
22                      }
23                    ]
24                  }'
25            ```
26        </TabItem>
27        <TabItem value="vllm-infer-nemo-python" label="Python">
28            ```python
29            import httpx
30
31            url = 'http://localhost:8000/v1/chat/completions'
32            headers = {
33                'Content-Type': 'application/json',
34                'Authorization': 'Bearer token'
35            }
36            data = {
37                "model": "mistralai/Mistral-Nemo-Instruct-2407",
38                "messages": [
39                    {
40                        "role": "user",
41                        "content": "Who is the best French painter? Answer in one short sentence."
42                    }
43                ]
44            }
45
46            response = httpx.post(url, headers=headers, json=data)
47
48            print(response.json())
49
50            ```
51        </TabItem>
52      </Tabs>
53
54</TabItem>
55
56    <TabItem value="vllm-server-text-small" label="Text input (Mistral Small)">
57    Start the inference server to deploy your model, e.g. for Mistral Small:
58
59      ```bash
60      vllm serve mistralai/Mistral-Small-Instruct-2409 \
61        --tokenizer_mode mistral \
62        --config_format mistral \
63        --load_format mistral
64      ```
65
66    You can now run inference requests with text input:
67
68      <Tabs>
69        <TabItem value="vllm-infer-small-curl" label="cURL">
70            ```bash
71            curl --location 'http://localhost:8000/v1/chat/completions' \
72                --header 'Content-Type: application/json' \
73                --header 'Authorization: Bearer token' \
74                --data '{
75                    "model": "mistralai/Mistral-Small-Instruct-2409",
76                    "messages": [
77                      {
78                        "role": "user",
79                        "content": "Who is the best French painter? Answer in one short sentence."
80                      }
81                    ]
82                  }'
83            ```
84        </TabItem>
85        <TabItem value="vllm-infer-small-python" label="Python">
86            ```python
87            import httpx
88
89            url = 'http://localhost:8000/v1/chat/completions'
90            headers = {
91                'Content-Type': 'application/json',
92                'Authorization': 'Bearer token'
93            }
94            data = {
95                "model": "mistralai/Mistral-Small-Instruct-2409",
96                "messages": [
97                    {
98                        "role": "user",
99                        "content": "Who is the best French painter? Answer in one short sentence."
100                    }
101                ]
102            }
103
104            response = httpx.post(url, headers=headers, json=data)
105
106            print(response.json())
107
108            ```
109        </TabItem>
110      </Tabs>
111
112</TabItem>
113
114<TabItem value="vllm-server-mm" label="Image + text input (Pixtral-12B)">

Start the inference server to deploy your model, e.g. for Pixtral-12B:

1```bash
2vllm serve mistralai/Pixtral-12B-2409 \
3    --tokenizer_mode mistral \
4    --config_format mistral \
5    --load_format mistral
6```

:::info

The default number of image inputs per prompt is set to 1. To increase it, set the --limit_mm_per_prompt option (e.g. --limit_mm_per_prompt 'image=4').
If you encounter memory issues, set the --max_model_len option to reduce the memory requirements of vLLM (e.g. --max_model_len 16384). More troubleshooting details can be found in the vLLM documentation.

:::

You can now run inference requests with images and text inputs. Suppose you want to caption the following image:

1    <center>
2        <a href="https://picsum.photos/id/237/512/512"><img alt="" src="/img/doggo.png"  width="20%"/></a>
3    </center>
4    <br/>

You can prompt the model and retrieve its response like so: bash curl --location 'http://localhost:8000/v1/chat/completions' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer token' \ --data '{ "model": "mistralai/Pixtral-12B-2409", "messages": [ { "role": "user", "content": [ {"type" : "text", "text": "Describe this image in a short sentence."}, {"type": "image_url", "image_url": {"url": "https://picsum.photos/id/237/200/300"}} ] } ] }'

1import httpx
2
3        url = "http://localhost:8000/v1/chat/completions"
4        headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
5        data = {
6            "model": "mistralai/Pixtral-12B-2409",
7            "messages": [
8                {
9                    "role": "user",
10                    "content": [
11                        {"type": "text", "text": "Describe this image in a short sentence."},
12                        {
13                            "type": "image_url",
14                            "image_url": {"url": "https://picsum.photos/id/237/200/300"},
15                        },
16                    ],
17                }
18            ],
19        }
20
21        response = httpx.post(url, headers=headers, json=data)
22
23        print(response.json())
24        ```
25        </TabItem>
26    </Tabs>
27
28
29
30    </TabItem>
31
32</Tabs>
33
34## Deploying with Docker
35
36If you are looking to deploy vLLM as a containerized inference server you can leverage
37the project's official Docker image (see more details in the
38[vLLM Docker documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html)).
39
40- Set the HuggingFace access token environment variable in your shell:
41
42  ```bash
43  export HF_TOKEN=your-access-token

Run the Docker command to start the container: bash docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model mistralai/Mistral-NeMo-Instruct-2407 \ --tokenizer_mode mistral \ --load_format mistral \ --config_format mistral

Once the container is up and running you will be able to run inference on your model using the same code as in a standalone deployment.

Cloud TensorRT

Command Palette

Pre-requisites

Getting started

Installing vLLM

Offline mode inference

Server mode inference