README

Details

File: concept-deep-dive/tokenization/README.md
Type: Markdown Guide
Use Cases: Tokenization
Content

Markdown content:
# Concept Deep Dive: Tokenization

Tokenization is a crucial concept around LLMs, and it can be more complex than one may think!

For our tokenization implementation, please refer to [mistral-common](https://github.com/mistralai/mistral-common).

In this deep dive, we will dig into 3 versions of our tokenizer:
- V1: The tokenizer behind our very first models.
- V2: Introducing control tokens and function calling!
- V3: Better function calling implementation.
    - V3-Tekken: Different version based on `tiktoken`, opposed to the other versions based on `sentencepiece`.

## Overview

| Section                  | Description                                                                 |
|:------------------------:|:---------------------------------------------------------------------------:|
| [Basics](basics.md)               | Basics of tokenization. |
| [Boundaries & Token Healing](boundaries.md)               | Main problems with tokenization and token healing. |
| [Control Tokens](control_tokens.md)               | Introduction to Control Tokens and their advantages. |
| [Templates](templates.md)               | A summarized list of our tokenizers with their chat templates.           |
| [Tokenizer](tokenizer.md)          | Make your own tokenizer with sentencepiece.                             |
| [Tool Calling](tool_calling.md)          | Learn about how tokenization for our tool calling works.                            |
|          |                            |
| [Chat Templates](chat_templates.md)          | Legacy documentation around our chat templates.                             |