← Back to Cookbook
README
Details
File: concept-deep-dive/tokenization/README.md
Type: Markdown Guide
Use Cases: Tokenization
Content
Markdown content:
# Concept Deep Dive: Tokenization
Tokenization is a crucial concept around LLMs, and it can be more complex than one may think!
For our tokenization implementation, please refer to [mistral-common](https://github.com/mistralai/mistral-common).
In this deep dive, we will dig into 3 versions of our tokenizer:
- V1: The tokenizer behind our very first models.
- V2: Introducing control tokens and function calling!
- V3: Better function calling implementation.
- V3-Tekken: Different version based on `tiktoken`, opposed to the other versions based on `sentencepiece`.
## Overview
| Section | Description |
|:------------------------:|:---------------------------------------------------------------------------:|
| [Basics](basics.md) | Basics of tokenization. |
| [Boundaries & Token Healing](boundaries.md) | Main problems with tokenization and token healing. |
| [Control Tokens](control_tokens.md) | Introduction to Control Tokens and their advantages. |
| [Templates](templates.md) | A summarized list of our tokenizers with their chat templates. |
| [Tokenizer](tokenizer.md) | Make your own tokenizer with sentencepiece. |
| [Tool Calling](tool_calling.md) | Learn about how tokenization for our tool calling works. |
| | |
| [Chat Templates](chat_templates.md) | Legacy documentation around our chat templates. |