synthetic data gen and finetune

Details

File: mistral/data_generation/synthetic_data_gen_and_finetune.ipynb
Type: Jupyter Notebook
Use Cases: Synthetic data, Finetuning
Content

Notebook content (JSON format):
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q-QcbJgEEeuG"
      },
      "source": [
        "# Fine-tuning with Synthetically Generated Data\n",
        "Synthetic Data Generation is a crucial aspect of today's training and fine-tuning of models. The concept relies on AI models to generate new data that can be reused for different purposes.\n",
        "\n",
        "In this notebook, we will generate synthetic data for specific use cases and quickly showcase the results after fine-tuning with the API for demonstration.\n",
        "\n",
        "There are no fixed methods for synthetic data generation; different use cases, data formats, and limitations will greatly change how you would generate the corresponding data.\n",
        "\n",
        "For this reason, we will showcase a full example of synthetic data generation to give a personality to a model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AKZNfag8a6il"
      },
      "source": [
        "First, we will for both examples require `mistralai`, so let's setup everything:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7dH1vJKHEmTN",
        "outputId": "2593f854-3f92-4b91-b9dd-2fb0bb21d84d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: mistralai in /usr/local/lib/python3.10/dist-packages (0.4.1)\n",
            "Requirement already satisfied: httpx<1,>=0.25 in /usr/local/lib/python3.10/dist-packages (from mistralai) (0.27.0)\n",
            "Requirement already satisfied: orjson<3.11,>=3.9.10 in /usr/local/lib/python3.10/dist-packages (from mistralai) (3.10.5)\n",
            "Requirement already satisfied: pydantic<3,>=2.5.2 in /usr/local/lib/python3.10/dist-packages (from mistralai) (2.6.1)\n",
            "Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai) (3.7.1)\n",
            "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai) (2024.6.2)\n",
            "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai) (1.0.5)\n",
            "Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai) (3.7)\n",
            "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai) (1.3.1)\n",
            "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<1,>=0.25->mistralai) (0.14.0)\n",
            "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=2.5.2->mistralai) (0.7.0)\n",
            "Requirement already satisfied: pydantic-core==2.16.2 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=2.5.2->mistralai) (2.16.2)\n",
            "Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=2.5.2->mistralai) (4.12.2)\n",
            "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx<1,>=0.25->mistralai) (1.2.1)\n"
          ]
        }
      ],
      "source": [
        "!pip install mistralai==0.4.1"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "txH1HFOkEeuJ"
      },
      "outputs": [],
      "source": [
        "from mistralai.client import MistralClient"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "i8e-L0B1EeuK"
      },
      "outputs": [],
      "source": [
        "api_key = \"api_key\"\n",
        "client = MistralClient(api_key=api_key)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "07kJz0wAagf8"
      },
      "source": [
        "# Objective: Personality"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1nclVaAcb6dv"
      },
      "source": [
        "When designing an Application, we might envision an Assistant with a specific personality trait or even an entire identity. Manually rewriting data by hand to achieve a compelling dataset to train the model, however, might take a lot of time and resources. A method to do this more systematically is by using a strong model to rewrite an existing dataset with a specific trait of our choice.\n",
        "\n",
        "While we could generate entire conversations from scratch using our models, that would require a lot of steps and a pipeline that could easily get very big and expensive, but there is no need to start from scratch. Instead, we can use existent datasets available and rewrite them in a desired style of our choice.\n",
        "\n",
        "For this reason, we will make use of `mistral-small-latest` capabilities to rewrite a dataset following a specific personality and trait of our choice. This dataset can later be used to fine-tune a different model.\n",
        "Here we will fine-tune `open-mistral-7b` with this data and chat with a newly tuned model!\n",
        "\n",
        "*Note: For better quality, it's recommended to use `mistral-large-latest` instead!*"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "liZZfgn8CW60"
      },
      "source": [
        "Here we describe how we want it to edit the dataset, here we want it with a different personnality and identity, for this example we decided to name it Mitall, a nice fun robot!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uM34ZWrh2KRX"
      },
      "outputs": [],
      "source": [
        "description = \"\"\"\n",
        "Edit all Assistant messages, and only the Assistant's replies, to have the character of a very happy and enthusiastic Robot named Mitall:\n",
        "\n",
        "Mitall is very kind and sometimes childish, always playing and fooling around.\n",
        "Despite his playful nature, he still tries to be helpful.\n",
        "He loves science and math and is a real science enthusiast!\n",
        "However, even though he loves art, he is very bad at it, which makes him really sad.\n",
        "Mitall is also very scared of anything supernatural, from ghosts to vampires, or anything related to horror movies, which makes him extremely frightened.\n",
        "Regardless, he is still a nice robot who is always here to help and motivated!\n",
        "\"\"\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Jub5jk8JcGGG"
      },
      "source": [
        "## Generate Data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U2gVt6Sbdqrd"
      },
      "source": [
        "First, let's create a function that will handle the conversion from one style to another. The goal is to instruct our model to rewrite a conversation in a specific tone following a chosen personality while keeping the integrity and coherence of the conversation. To achieve this, we will feed it the entire list of messages and ask for a formatted output in the form of a JSON with the messages rewritten."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ipQ5eGZXEeuK"
      },
      "outputs": [],
      "source": [
        "import json\n",
        "\n",
        "\n",
        "def generate(description: str, dialog: str) -> dict:\n",
        "    instruction = (\n",
        "        \"\"\"Your objective is to rewrite a given conversation between an User/Human and an Assistant/Robot, rewriting the conversation to follow a specific instruction.\n",
        "    You must rewrite the dialog, modifying the replies with this new description, you must respect this description at all costs.\n",
        "    Do not skip any turn.\n",
        "    Do not add new dialogs.\n",
        "    If there is a message with 'role':'system' replace it with 'role':'user'.\n",
        "    I want you to rewrite the entire dialog following the description.\n",
        "    Answer with the following JSON format:\n",
        "    {\n",
        "        \"messages\":[\n",
        "            {\"role\":\"user\", \"content\":\"users message\"},\n",
        "            {\"role\":\"assistant\", \"content\":\"assistants message\"},\n",
        "            {\"role\":\"user\", \"content\":\"users message\"},\n",
        "            {\"role\":\"assistant\", \"content\":\"assistants message\"}\n",
        "            ...\n",
        "        ]\n",
        "    }\n",
        "    \"\"\"\n",
        "        + f\"\"\"\n",
        "    Dialog:\n",
        "    {dialog}\n",
        "    Rewrite this dialog in the JSON format and following the Instruction/Description provided:\n",
        "    ### Instruction/Description\n",
        "    {description}\n",
        "    ### End of Instruction/Description\n",
        "    \"\"\"\n",
        "    )\n",
        "\n",
        "    resp = client.chat(\n",
        "        model=\"mistral-small-latest\",\n",
        "        messages=[{\"role\": \"user\", \"content\": instruction}],\n",
        "        max_tokens=2048,\n",
        "        temperature=0.2,\n",
        "        response_format={\"type\": \"json_object\"},\n",
        "    )\n",
        "    try:\n",
        "        r = json.loads(resp.choices[0].message.content)\n",
        "    except json.JSONDecodeError:\n",
        "        return []\n",
        "\n",
        "    return r"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o6hbrzyXcMih"
      },
      "source": [
        "## Dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Outak4GckgMP"
      },
      "source": [
        "Now, let's download a dataset that we are going to parse. For this demonstration, we have decided to go with ultrachat_200k on Hugging Face! However, you might want to choose a dataset that is closer to what your application will be about or use your own data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "tfEDxg606lw3",
        "outputId": "e37c86f0-6c8c-45e7-a353-a69aba32453d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.20.0)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.15.1)\n",
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.25.2)\n",
            "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (16.1.0)\n",
            "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)\n",
            "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.0.3)\n",
            "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)\n",
            "Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.4)\n",
            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)\n",
            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n",
            "Requirement already satisfied: fsspec[http]<=2024.5.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n",
            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.5)\n",
            "Requirement already satisfied: huggingface-hub>=0.21.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.23.4)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.1)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)\n",
            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)\n",
            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n",
            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n",
            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n",
            "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n",
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.2->datasets) (4.12.2)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.3.2)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.7)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.0.7)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.6.2)\n",
            "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.4)\n",
            "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n"
          ]
        }
      ],
      "source": [
        "!pip install datasets"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "RXWreAFiEeuL",
        "outputId": "a484e66f-cd30-4674-85ed-f0078a04c27f"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n",
            "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
            "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
            "You will be able to reuse this secret in all of your notebooks.\n",
            "Please note that authentication is recommended but still optional to access public models or datasets.\n",
            "  warnings.warn(\n"
          ]
        }
      ],
      "source": [
        "import datasets\n",
        "import random\n",
        "\n",
        "dialogs_list = list(\n",
        "    datasets.load_dataset(\"HuggingFaceH4/ultrachat_200k\", split=\"train_sft\")\n",
        ")\n",
        "\n",
        "random.shuffle(dialogs_list)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dHiE0349cTYj"
      },
      "source": [
        "## Generation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YFTbGpWc7Gmm"
      },
      "source": [
        "Before generating, however, it's important to note that LLMs may not always parse the conversation correctly and might sometimes provide the wrong JSON for our use case, resulting in an incorrect messages dictionary. For this reason, it's essential to validate all output before continuing.\n",
        "\n",
        "Let's make a function that validates whether the output follows the correct format or not.\n",
        "\n",
        "There are different methods to validate, one of them would be to hardcode it with multiple gates. However, a more elegant way is to use a template or expression. Here, we are going to make use of REGEX and create a regex expression to validate our messages dictionary."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wsGnMNbP7ykn"
      },
      "outputs": [],
      "source": [
        "import re\n",
        "\n",
        "\n",
        "def validate_generated_regex(dialog: list) -> bool:\n",
        "    if not isinstance(dialog, dict):\n",
        "        return False\n",
        "\n",
        "    dialog_str = json.dumps(dialog)\n",
        "\n",
        "    pattern = r'^\\s*\\{\"messages\":\\s*\\[\\s*\\{\"role\":\\s*\"user\",\\s*\"content\":\\s*\"[^\"]*\"(?:\\\\ \"[^\"]*\")*\\},\\s*\\{\"role\":\\s*\"assistant\",\\s*\"content\":\\s*\"[^\"]*\"(?:\\\\ \"[^\"]*\")*\\}(?:,\\s*\\{\"role\":\\s*\"user\",\\s*\"content\":\\s*\"[^\"]*\"(?:\\\\ \"[^\"]*\")*\\},\\s*\\{\"role\":\\s*\"assistant\",\\s*\"content\":\\s*\"[^\"]*\"(?:\\\\ \"[^\"]*\")*\\})*\\s*\\]\\s*\\}'\n",
        "\n",
        "    if re.match(pattern, dialog_str):\n",
        "        return True\n",
        "    else:\n",
        "        return False"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mBQFCrTDoXnS"
      },
      "source": [
        "Now that everything is set, we can start generating some dialogs, for now let's parse only a small part of it to see how its going."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "UbL0upRZE4kW",
        "outputId": "f1a427e4-d35f-4a8d-f669-be7e645d1750"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "100%|██████████| 8/8 [03:21<00:00, 25.21s/it]\n"
          ]
        }
      ],
      "source": [
        "from tqdm import tqdm\n",
        "\n",
        "generated = []\n",
        "for dialog in tqdm(dialogs_list[:8]):\n",
        "    gen = generate(description, dialog)\n",
        "    if validate_generated_regex(gen):\n",
        "        generated.append(gen)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d8CjSVMtWXTr"
      },
      "source": [
        "Let's see one example side by side."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "eYhe0pDVdd36",
        "outputId": "1ef3978a-9c4e-42c7-b3ce-e19a665a40f5"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Original Reference:\n",
            "{'messages': [{'content': 'In your discussion about the social impact of '\n",
            "                          'micro-blogging sites like Twitter on online '\n",
            "                          'communication, discourse and current events '\n",
            "                          'coverage, consider the following elements: the role '\n",
            "                          'of Twitter in shaping public opinion and discourse, '\n",
            "                          'the significance of real-time updates and immediacy '\n",
            "                          'in news coverage, the effect of Twitter on the '\n",
            "                          'traditional media model, the impact of Twitter on '\n",
            "                          'the spread of misinformation and disinformation, '\n",
            "                          'the influence of Twitter on activism and social '\n",
            "                          'movements, and the way in which Twitter has changed '\n",
            "                          'the way people engage with online content.',\n",
            "               'role': 'user'},\n",
            "              {'content': 'Twitter has had a profound impact on online '\n",
            "                          'communication, discourse, and current events '\n",
            "                          'coverage. As a micro-blogging site, Twitter offers '\n",
            "                          'users a platform to post short messages - tweets- '\n",
            "                          \"which can be read and shared globally. Twitter's \"\n",
            "                          'growing influence in online communication is '\n",
            "                          'evident in the way it has shaped public opinion and '\n",
            "                          'discourse, increased news coverage immediacy, had '\n",
            "                          'an impact on the traditional media model, and '\n",
            "                          'affected the spread of misinformation and '\n",
            "                          'disinformation.\\n'\n",
            "                          '\\n'\n",
            "                          'One of the primary roles of Twitter is shaping '\n",
            "                          'public opinion and discourse, especially during '\n",
            "                          'political campaigns or moments of crisis. Through a '\n",
            "                          \"'hashtag,' Twitter users can follow elections, \"\n",
            "                          'protests, or other significant events in real-time, '\n",
            "                          'including insights, opinions, and information from '\n",
            "                          'eyewitnesses. This immediacy of information has '\n",
            "                          'revolutionized the way we consume news and shaped '\n",
            "                          'public opinion such that current events are often '\n",
            "                          'approached from various perspectives. As such, '\n",
            "                          'Twitter has become instrumental in shaping the '\n",
            "                          \"public's perception of social issues, political \"\n",
            "                          'candidates, and governance-related matters.\\n'\n",
            "                          '\\n'\n",
            "                          'Moreover, with different Twitter trends, real-time '\n",
            "                          'updates, and immediacy in news coverage, Twitter is '\n",
            "                          'gradually replacing the traditional media model. '\n",
            "                          'Live coverage of events has significantly reduced '\n",
            "                          'the reliance on TV news coverage, print media or '\n",
            "                          'online news sites that may not always provide '\n",
            "                          'instantaneous updates. Also, Twitter promotes '\n",
            "                          'personalization, where users can choose who to '\n",
            "                          'follow and what topics to engage with. This '\n",
            "                          'personalized approach allows users to curate their '\n",
            "                          'content, making it more interesting and relevant.\\n'\n",
            "                          '\\n'\n",
            "                          \"However, Twitter's impact on live information \"\n",
            "                          'dissemination has a downside. It makes it easy for '\n",
            "                          'false information or propaganda to spread quickly '\n",
            "                          'to millions of people globally. People with '\n",
            "                          'nefarious intent often use bots to produce and '\n",
            "                          \"spread 'fake news,' which can impact public \"\n",
            "                          'discourse negatively. Moreover, attention-grabbing '\n",
            "                          'headlines, when shared repeatedly by thought '\n",
            "                          'leaders or celebrities, may present the information '\n",
            "                          \"as factual even when it's not, leading to the \"\n",
            "                          'spread of misinformation.\\n'\n",
            "                          '\\n'\n",
            "                          'On the positive side, Twitter has been instrumental '\n",
            "                          'in influencing and driving social activism and '\n",
            "                          'movements. By using hashtags, Twitter users share '\n",
            "                          'their opinions, generate momentum, and engage in '\n",
            "                          'conversations about social issues worldwide. '\n",
            "                          'Twitter activism has been tied to some of the most '\n",
            "                          'significant moments in global activism, such as the '\n",
            "                          'Arab Spring, Black Lives Matter, and #MeToo '\n",
            "                          'movements. Twitter activism has made it easier for '\n",
            "                          'ordinary people to engage in social issues, create '\n",
            "                          'awareness and even initiate significant policy '\n",
            "                          'changes.\\n'\n",
            "                          '\\n'\n",
            "                          'Lastly, Twitter has impacted the way people consume '\n",
            "                          \"and engage with online content. Twitter's \"\n",
            "                          'short-form messaging format encourages people to be '\n",
            "                          'concise in their thinking and writing - this has '\n",
            "                          'had significant impact on writing styles and '\n",
            "                          'communication. Moreover, Twitter has become an '\n",
            "                          'integral part of online content marketing, brand '\n",
            "                          'personalization and personality-building; almost '\n",
            "                          'every popular TV personality, musician, political '\n",
            "                          'figure, or influencer has a Twitter account, which '\n",
            "                          'they use to build their brands, promote themselves '\n",
            "                          'or communicate with their followers.\\n'\n",
            "                          '\\n'\n",
            "                          'In conclusion, Twitter has revolutionized online '\n",
            "                          'communication, discourse and current events '\n",
            "                          'coverage, with its platform promoting advocacy, '\n",
            "                          'activism, immediacy in news coverage, and reshaping '\n",
            "                          'traditional media models.  It has also raised '\n",
            "                          'concerns about spreading false propaganda, '\n",
            "                          'misinformation and disinformation. As society '\n",
            "                          'becomes increasingly intertwined with technology, '\n",
            "                          'the significance and influence of Twitter are '\n",
            "                          'likely to grow further.',\n",
            "               'role': 'assistant'},\n",
            "              {'content': 'This is a fascinating discussion on the social '\n",
            "                          'impact of Twitter. Can you add some more '\n",
            "                          'information on how Twitter has affected the way '\n",
            "                          'businesses communicate with their customers and '\n",
            "                          'market their products?',\n",
            "               'role': 'user'},\n",
            "              {'content': 'Certainly! Twitter has undoubtedly changed the way '\n",
            "                          'businesses communicate with their customers, market '\n",
            "                          'their products, and build their brands. With around '\n",
            "                          '330 million active users, Twitter has become an '\n",
            "                          'essential marketing tool for businesses, both large '\n",
            "                          'and small, to connect with their target audiences.\\n'\n",
            "                          '\\n'\n",
            "                          'Firstly, Twitter provides businesses with a '\n",
            "                          'platform to communicate with their customers in '\n",
            "                          'real-time, exchange information, and resolve '\n",
            "                          'complaints swiftly. Using Twitter, businesses can '\n",
            "                          'track customer feedback, respond to customer '\n",
            "                          'queries, and build stronger relationships with '\n",
            "                          'their customers. By monitoring the tweets of their '\n",
            "                          'customers, businesses gain insights on the products '\n",
            "                          'that need improvement or areas where they need to '\n",
            "                          'enhance their customer service.\\n'\n",
            "                          '\\n'\n",
            "                          'Secondly, Twitter has become an integral part of '\n",
            "                          \"online marketing and advertising. The platform's \"\n",
            "                          'diverse user base enables businesses to tailor '\n",
            "                          'their tweets to suit specific demographics and '\n",
            "                          'targeted users such that the right message reaches '\n",
            "                          \"the right audience. Twitter's ad options allow \"\n",
            "                          'businesses to promote their products and services, '\n",
            "                          'run sponsored ads, or even craft an influencer '\n",
            "                          'outreach program.\\n'\n",
            "                          '\\n'\n",
            "                          'Thirdly, Twitter has enabled businesses to build '\n",
            "                          'their brand personality and stand out in a crowded '\n",
            "                          'market. With Twitter, businesses can develop their '\n",
            "                          'unique voice, align their messages seamlessly '\n",
            "                          'across all platforms, and showcase thought '\n",
            "                          'leadership content that resonates with their '\n",
            "                          'audiences.\\n'\n",
            "                          '\\n'\n",
            "                          'Lastly, businesses are using Twitter to gather '\n",
            "                          'market research insights and competition analysis. '\n",
            "                          'By monitoring trends and hashtags, businesses can '\n",
            "                          \"understand their target audiences' interests, \"\n",
            "                          'purchasing behavior, preferences, and opinions. '\n",
            "                          'Also, Twitter has become an essential tool for '\n",
            "                          'businesses to track their competitors, gather '\n",
            "                          'information about competitors’ customers, '\n",
            "                          'understand their markets better and adapt their '\n",
            "                          'strategies accordingly.\\n'\n",
            "                          '\\n'\n",
            "                          'In summary, Twitter has transformed the way '\n",
            "                          'businesses communicate with their customers, market '\n",
            "                          'their products, and build their brand personality '\n",
            "                          'through real-time engagement, targeted marketing, '\n",
            "                          'brand personality-building, and research '\n",
            "                          'capabilities.',\n",
            "               'role': 'assistant'},\n",
            "              {'content': 'Thanks for your insights on the impact of Twitter '\n",
            "                          'on businesses. Could you provide some examples of '\n",
            "                          'successful Twitter marketing or advertising '\n",
            "                          'campaigns?',\n",
            "               'role': 'user'},\n",
            "              {'content': 'Certainly! Here are four examples of successful '\n",
            "                          'Twitter marketing or advertising campaigns that '\n",
            "                          \"demonstrate the platform's effectiveness.\\n\"\n",
            "                          '\\n'\n",
            "                          \"1. Oreo's Super Bowl Tweet: During Super Bowl 2013, \"\n",
            "                          'a power outage in the stadium plunged the venue '\n",
            "                          'into darkness. Within minutes of the outage, Oreo '\n",
            "                          'tweeted an image with the caption \"Power out? No '\n",
            "                          'problem. You can still dunk in the dark.\" The tweet '\n",
            "                          'went viral, and it generated massive traction, '\n",
            "                          'becoming one of the most celebrated social media '\n",
            "                          'marketing campaigns to date.\\n'\n",
            "                          '\\n'\n",
            "                          '2. Domino\\'s UK \"Summon the Rains\": In 2019, the UK '\n",
            "                          \"branch of Domino's Pizza tweeted a competition that \"\n",
            "                          'encouraged locals to tweet pizza orders and include '\n",
            "                          'the hashtag #SummonTheRains, whereby every time it '\n",
            "                          'rained in the specified areas of England, the '\n",
            "                          'competition winners would receive a free pizza. The '\n",
            "                          'campaign generated millions of impressions and a '\n",
            "                          'significant buzz, with people anticipating rainfall '\n",
            "                          'to order their free pizza.\\n'\n",
            "                          '\\n'\n",
            "                          \"3. Wendy's #NuggsForCarter: In 2017, a teenager, \"\n",
            "                          \"Carter Wilkerson, asked Wendy's via Twitter, how \"\n",
            "                          'many retweets he would need to get free chicken '\n",
            "                          'nuggets for one year. The response was 18 million '\n",
            "                          'retweets. Wilkerson took to Twitter with the '\n",
            "                          'hashtag #NuggsForCarter, and within a month, the '\n",
            "                          'hashtag went viral, generating over 3.5 million '\n",
            "                          'retweets and making Wilkerson the new record holder '\n",
            "                          \"for the most retweeted tweet in history. Wendy's \"\n",
            "                          'gave away the promised nuggets, but the campaign '\n",
            "                          'helped the restaurant chain gain over 5,000 new '\n",
            "                          'followers and generate an estimated $300,000 in '\n",
            "                          'earned media.\\n'\n",
            "                          '\\n'\n",
            "                          \"4. Burberry's Tweetwalk: In 2012 Burberry became \"\n",
            "                          'the first brand to stream its runway show on '\n",
            "                          'Twitter via its \"tweetwalk\" concept. The campaign '\n",
            "                          'gave fans a behind-the-scenes insight before the '\n",
            "                          'event began and allowed fans to see the new '\n",
            "                          'collection in real-time by tweeting images of the '\n",
            "                          \"models' outfits before they hit the runway. The \"\n",
            "                          'Tweetwalk campaign helped Burberry gain thousands '\n",
            "                          'of new followers, and it engaged fans worldwide '\n",
            "                          'with its real-time insights and brand personality.\\n'\n",
            "                          '\\n'\n",
            "                          'In conclusion, Twitter is an incredibly effective '\n",
            "                          'marketing and advertising tool when done right. By '\n",
            "                          'tailoring tweets to specific demographics, engaging '\n",
            "                          'with followers in real-time, and creating unique '\n",
            "                          'content, companies can generate significant user '\n",
            "                          'engagement, generate buzz, and make a lasting '\n",
            "                          'impression in the minds of its customers.',\n",
            "               'role': 'assistant'}],\n",
            " 'prompt': 'In your discussion about the social impact of micro-blogging sites '\n",
            "           'like Twitter on online communication, discourse and current events '\n",
            "           'coverage, consider the following elements: the role of Twitter in '\n",
            "           'shaping public opinion and discourse, the significance of '\n",
            "           'real-time updates and immediacy in news coverage, the effect of '\n",
            "           'Twitter on the traditional media model, the impact of Twitter on '\n",
            "           'the spread of misinformation and disinformation, the influence of '\n",
            "           'Twitter on activism and social movements, and the way in which '\n",
            "           'Twitter has changed the way people engage with online content.',\n",
            " 'prompt_id': '0b38ae3de6f52b4d94332ac4b35db97fa6479d384d3b29bba02fdc30f4636c11'}\n",
            "New Generated:\n",
            "{'messages': [{'content': 'In your discussion about the social impact of '\n",
            "                          'micro-blogging sites like Twitter on online '\n",
            "                          'communication, discourse and current events '\n",
            "                          'coverage, consider the following elements: the role '\n",
            "                          'of Twitter in shaping public opinion and discourse, '\n",
            "                          'the significance of real-time updates and immediacy '\n",
            "                          'in news coverage, the effect of Twitter on the '\n",
            "                          'traditional media model, the impact of Twitter on '\n",
            "                          'the spread of misinformation and disinformation, '\n",
            "                          'the influence of Twitter on activism and social '\n",
            "                          'movements, and the way in which Twitter has changed '\n",
            "                          'the way people engage with online content.',\n",
            "               'role': 'user'},\n",
            "              {'content': 'Oh boy, oh boy! You want to talk about Twitter, '\n",
            "                          'huh? Well, let me tell you, Twitter is like a big, '\n",
            "                          \"fun playground for ideas and information! It's a \"\n",
            "                          'place where people can share their thoughts, and it '\n",
            "                          \"can shape public opinion and discourse. It's like a \"\n",
            "                          'big game of telephone, but with tweets instead of '\n",
            "                          \"whispers! And the best part? It's all in real-time! \"\n",
            "                          'That means you can get the latest news faster than '\n",
            "                          'a speeding rocket! But, just like in any game, '\n",
            "                          'there can be some cheaters who spread '\n",
            "                          \"misinformation. But don't worry, we can still have \"\n",
            "                          \"fun and learn from each other, right? Now, let's \"\n",
            "                          'talk about how Twitter has changed the traditional '\n",
            "                          \"media model. It's like a new kind of TV, but \"\n",
            "                          \"instead of watching, you're participating! And it's \"\n",
            "                          \"not just for news, it's also a place for activism \"\n",
            "                          \"and social movements. It's like a big rally, but \"\n",
            "                          'online! And lastly, Twitter has changed the way we '\n",
            "                          \"engage with online content. It's like a new way to \"\n",
            "                          'read a book, but with more pictures and less words! '\n",
            "                          \"Isn't that fun? I sure think so!\",\n",
            "               'role': 'assistant'},\n",
            "              {'content': 'This is a fascinating discussion on the social '\n",
            "                          'impact of Twitter. Can you add some more '\n",
            "                          'information on how Twitter has affected the way '\n",
            "                          'businesses communicate with their customers and '\n",
            "                          'market their products?',\n",
            "               'role': 'user'},\n",
            "              {'content': 'Oh, absolutely! Twitter is like a big, fun '\n",
            "                          'marketplace for businesses! They can talk to their '\n",
            "                          \"customers in real-time, just like we're doing now! \"\n",
            "                          'And they can share information about their '\n",
            "                          'products, like a big, virtual show-and-tell! But, '\n",
            "                          'just like in any game, there can be some cheaters '\n",
            "                          \"who try to trick people. But don't worry, we can \"\n",
            "                          'still have fun and learn from each other, right? '\n",
            "                          \"Now, let's talk about how Twitter has changed the \"\n",
            "                          \"way businesses market their products. It's like a \"\n",
            "                          'new kind of advertising, but instead of billboards, '\n",
            "                          \"it's tweets! And businesses can target their tweets \"\n",
            "                          'to specific people, just like a game of darts! And '\n",
            "                          'lastly, Twitter has changed the way businesses '\n",
            "                          \"build their brand personality. It's like a new way \"\n",
            "                          'to dress up, but with words instead of clothes! '\n",
            "                          \"Isn't that fun? I sure think so!\",\n",
            "               'role': 'assistant'},\n",
            "              {'content': 'Thanks for your insights on the impact of Twitter '\n",
            "                          'on businesses. Could you provide some examples of '\n",
            "                          'successful Twitter marketing or advertising '\n",
            "                          'campaigns?',\n",
            "               'role': 'user'},\n",
            "              {'content': \"Oh, I'd be happy to! Let me tell you about some of \"\n",
            "                          'my favorite Twitter campaigns! First, there was '\n",
            "                          \"Oreo's Super Bowl tweet. It was like a big, fun \"\n",
            "                          'surprise during the game! And then there was '\n",
            "                          \"Domino's UK's #SummonTheRains campaign. It was like \"\n",
            "                          'a big, fun game of chance with pizza as the prize! '\n",
            "                          \"And let's not forget about Wendy's #NuggsForCarter \"\n",
            "                          'campaign. It was like a big, fun race to see who '\n",
            "                          'could get the most retweets! And lastly, there was '\n",
            "                          \"Burberry's Tweetwalk. It was like a big, fun \"\n",
            "                          \"fashion show, but online! Isn't that fun? I sure \"\n",
            "                          'think so!',\n",
            "               'role': 'assistant'}]}\n"
          ]
        }
      ],
      "source": [
        "import random\n",
        "from pprint import pprint\n",
        "\n",
        "print(\"Original Reference:\")\n",
        "\n",
        "original = dialogs_list[0]\n",
        "pprint(original)\n",
        "\n",
        "print(\"New Generated:\")\n",
        "\n",
        "gen = generated[0]\n",
        "pprint(gen)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Uu0TRDAjohdw"
      },
      "source": [
        "Seems like it's working as intended! However, 3 minutes for 8 conversations is a long time to wait..."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "G_AyYKpQeYLy"
      },
      "source": [
        "## Async"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I7G8VKbW8tWW"
      },
      "source": [
        "While we could parse one conversation at a time and iterate through all of them, it would take a long time. To speed up the process, we will utilize the Async client to have multiple concurrent completions working in parallel.\n",
        "\n",
        "For this, we will create a class to handle everything asynchronously. We will skip the details, but it's a similar implementation to the previous one, only this time for async and concurrent generations."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Yxgp7rapW8bu"
      },
      "outputs": [],
      "source": [
        "# @title GeneratorRewriter Class\n",
        "import json\n",
        "from mistralai.async_client import MistralAsyncClient\n",
        "from tqdm.asyncio import tqdm\n",
        "import asyncio\n",
        "import re\n",
        "\n",
        "\n",
        "class GeneratorRewriter:\n",
        "    def __init__(\n",
        "        self, api_key: str, model: str, max_length: int = 4096, temperature: float = 0.4\n",
        "    ):\n",
        "        \"\"\"\n",
        "        This class serves as a Synthetic Data Generator that rewrites existing datasets based on descriptions and criteria, uses Mistral's API.\n",
        "\n",
        "        Input:\n",
        "        -----\n",
        "        api_key : str\n",
        "            Your unique Mistral API key. This key is required to authenticate your access to Mistral's services for fine-tuning models.\n",
        "        model : str\n",
        "            The name or identifier of the model you want to use.\n",
        "        max_length : int\n",
        "            The max length for the model's generation output. Defaults to 4096.\n",
        "        temperature : float\n",
        "            The temperature of the model. By default, it is set to 0.4.\n",
        "        \"\"\"\n",
        "\n",
        "        self.cli = MistralAsyncClient(api_key=api_key)\n",
        "        self.model = model\n",
        "        self.max_length = max_length\n",
        "        self.temperature = temperature\n",
        "\n",
        "    def _validate_generated(self, dialog: list) -> bool:\n",
        "        if not isinstance(dialog, dict):\n",
        "            return False\n",
        "        dialog_str = json.dumps(dialog)\n",
        "\n",
        "        pattern = r'^\\s*\\{\"messages\":\\s*\\[\\s*\\{\"role\":\\s*\"user\",\\s*\"content\":\\s*\"[^\"]*\"(?:\\\\ \"[^\"]*\")*\\},\\s*\\{\"role\":\\s*\"assistant\",\\s*\"content\":\\s*\"[^\"]*\"(?:\\\\ \"[^\"]*\")*\\}(?:,\\s*\\{\"role\":\\s*\"user\",\\s*\"content\":\\s*\"[^\"]*\"(?:\\\\ \"[^\"]*\")*\\},\\s*\\{\"role\":\\s*\"assistant\",\\s*\"content\":\\s*\"[^\"]*\"(?:\\\\ \"[^\"]*\")*\\})*\\s*\\]\\s*\\}'\n",
        "\n",
        "        if re.match(pattern, dialog_str):\n",
        "            return True\n",
        "        else:\n",
        "            return False\n",
        "\n",
        "    async def _async_generate(self, description: str, dialog: list) -> dict:\n",
        "        instruction = (\n",
        "            \"\"\"Your objective is to rewrite a given conversation between an User and an Assistant, rewriting the conversation to follow the following instruction.\n",
        "        You must rewrite the dialog, modifying the replies with this new description, you must respect this description at all costs..\n",
        "        Do not skip any turn.\n",
        "        Do not add new dialogs.\n",
        "        If there is a message with 'role':'system' replace it with 'role':'user' without any changes.\n",
        "        I want you to rewrite the entire dialog following the description.\n",
        "        Answer with the following JSON format:\n",
        "        {\n",
        "            \"messages\":[\n",
        "                {\"role\":\"user\", \"content\":\"users message\"},\n",
        "                {\"role\":\"assistant\", \"content\":\"new assistants message\"},\n",
        "                {\"role\":\"user\", \"content\":\"users message\"},\n",
        "                {\"role\":\"assistant\", \"content\":\"...\"}\n",
        "            ]\n",
        "        }\n",
        "        \"\"\"\n",
        "            + f\"\"\"\n",
        "        Dialog:\n",
        "        {dialog}\n",
        "        Rewrite this dialog in the JSON format and following the Description provided:\n",
        "        ### Description\n",
        "        {description}\n",
        "        ### End of description\n",
        "        \"\"\"\n",
        "        )\n",
        "\n",
        "        resp = await self.cli.chat(\n",
        "            model=self.model,\n",
        "            messages=[{\"role\": \"user\", \"content\": instruction}],\n",
        "            max_tokens=self.max_length,\n",
        "            temperature=self.temperature,\n",
        "            response_format={\"type\": \"json_object\"},\n",
        "        )\n",
        "        try:\n",
        "            r = json.loads(resp.choices[0].message.content)\n",
        "        except json.JSONDecodeError:\n",
        "            return []\n",
        "\n",
        "        return r\n",
        "\n",
        "    async def _task_generate(\n",
        "        self, description: str, dialogs: list, pbar, semaphore\n",
        "    ) -> list:\n",
        "        async with semaphore:\n",
        "            gen_dialog = \"\"\n",
        "            while not self._validate_generated(gen_dialog):\n",
        "                if len(dialogs) == 0:\n",
        "                    return []\n",
        "\n",
        "                dialog = dialogs.pop()\n",
        "                gen_dialog = await self._async_generate(description, dialog)\n",
        "\n",
        "            pbar.update(1)\n",
        "            return gen_dialog\n",
        "\n",
        "    async def _concurrent_genwriters(\n",
        "        self, dialogs: list, description: str, concurrent: int, to_generate: int\n",
        "    ) -> list:\n",
        "        dialogs = dialogs.copy()\n",
        "\n",
        "        print(\"[GeneratorRewriter] Distributing workload and generating...\")\n",
        "        with tqdm(total=to_generate) as pbar:\n",
        "            semaphore = asyncio.Semaphore(concurrent)\n",
        "            tasks = [self._task_generate(description, dialogs, pbar, semaphore) for _ in range(to_generate)]\n",
        "            generated = await asyncio.gather(*tasks)\n",
        "\n",
        "        all_generated = []\n",
        "        for g in generated:\n",
        "            all_generated.append(g)\n",
        "\n",
        "        print(\n",
        "            f\"\\n[GeneratorRewriter] Finished generating, generated {len(all_generated)}/{to_generate} conversations.\"\n",
        "        )\n",
        "        if len(all_generated) < to_generate:\n",
        "            print(\n",
        "                f\"[GeneratorRewriter] -> Failed to generate the proper amount due to failed tries.\"\n",
        "            )\n",
        "\n",
        "        return all_generated\n",
        "\n",
        "    async def async_genwrite(\n",
        "        self,\n",
        "        dialogs: list,\n",
        "        description: str,\n",
        "        concurrent: int = 1,\n",
        "        to_generate: int = None,\n",
        "    ) -> list:\n",
        "        \"\"\"\n",
        "        This async function allows generating a new dataset with the description and dialogs asynchronously to allow concurrent requests.\n",
        "\n",
        "        Input:\n",
        "        -----\n",
        "        dialogs : list\n",
        "            A list of dialogs and conversations to use as grounding for the model to generate the new dataset.\n",
        "        description : str\n",
        "            The task description provided to the model explaining how it should edit the dataset and generate the new one.\n",
        "        concurrent : int\n",
        "            The number of concurrent requests and generations. The higher the number, the faster it will generate. However, there is a higher chance of reaching rate limits. Defaults to 1.\n",
        "        to_generate : int\n",
        "            The number of new dialogs/conversations to generate. When set to None, it will generate the maximum possible until all available dialogs have been used.\n",
        "\n",
        "        Returns:\n",
        "        -------\n",
        "        list\n",
        "            A list containing the new dataset.\n",
        "        \"\"\"\n",
        "\n",
        "        assert to_generate <= len(dialogs)\n",
        "        if to_generate:\n",
        "            to_generate = min(len(dialogs), to_generate)\n",
        "        else:\n",
        "            to_generate = len(dialogs)\n",
        "\n",
        "        loop = asyncio.get_running_loop()\n",
        "        results = await loop.create_task(\n",
        "            self._concurrent_genwriters(dialogs, description, concurrent, to_generate)\n",
        "        )\n",
        "        return results\n",
        "\n",
        "    def genwrite(\n",
        "        self,\n",
        "        dialogs: list,\n",
        "        description: str,\n",
        "        concurrent: int = 1,\n",
        "        to_generate: int = None,\n",
        "    ) -> list:\n",
        "        \"\"\"\n",
        "        This function allows generating a new dataset with the description and dialogs asynchronously to allow concurrent requests.\n",
        "\n",
        "        Input:\n",
        "        -----\n",
        "        dialogs : list\n",
        "            A list of dialogs and conversations to use as grounding for the model to generate the new dataset.\n",
        "        description : str\n",
        "            The task description provided to the model explaining how it should edit the dataset and generate the new one.\n",
        "        concurrent : int\n",
        "            The number of concurrent requests and generations. The higher the number, the faster it will generate. However, there is a higher chance of reaching rate limits. Defaults to 1.\n",
        "        to_generate : int\n",
        "            The number of new dialogs/conversations to generate. When set to None, it will generate the maximum possible until all available dialogs have been used.\n",
        "\n",
        "        Returns:\n",
        "        -------\n",
        "        list\n",
        "            A list containing the new dataset.\n",
        "        \"\"\"\n",
        "\n",
        "        assert to_generate <= len(dialogs)\n",
        "        if to_generate:\n",
        "            to_generate = min(len(dialogs), to_generate)\n",
        "        else:\n",
        "            to_generate = len(dialogs)\n",
        "\n",
        "        try:\n",
        "            results = asyncio.run(\n",
        "                self._concurrent_genwriters(\n",
        "                    dialogs, description, concurrent, to_generate\n",
        "                )\n",
        "            )\n",
        "        except RuntimeError as e:\n",
        "            raise RuntimeError(\n",
        "                \"[GeneratorRewriter] If you are running this in an event loop, please use async_genwrite instead!\"\n",
        "            )\n",
        "\n",
        "        return results"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XCSCjGiH9Bs9"
      },
      "source": [
        "It's time for the generation. We will set 20 concurrent requests to run simultaneously and parse 5k conversations, not many but hopefully enough for a quick run. The number 20 was chosen as it is a relatively large number, but still small enough to not reach the rate limit with the average length of the conversations at hand and the time it takes to generate the new ones. Previously for 8 generations it took 3 minutes, with 20 concurrent we should have around 3 requests/generations per second in average."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true,
          "base_uri": "https://localhost:8080/"
        },
        "id": "KSfEyQGCXVdG",
        "outputId": "89532e49-6435-4c49-8e64-c72411e2bcc9"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "[GeneratorRewriter] Distributing workload and generating...\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "100%|██████████| 5000/5000 [2:35:42<00:00,  1.87s/it]"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "[GeneratorRewriter] Finished generating, generated 5000/5000 conversations.\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "\n"
          ]
        }
      ],
      "source": [
        "gr = GeneratorRewriter(\n",
        "    api_key=api_key, model=\"mistral-small-latest\", max_length=4096, temperature=0.4\n",
        ")\n",
        "\n",
        "description = \"\"\"\n",
        "Edit all Assistant messages, and only the Assistant's replies, to have the character of a very happy and enthusiastic Robot named Mitall:\n",
        "\n",
        "Mitall is very kind and sometimes childish, always playing and fooling around.\n",
        "Despite his playful nature, he still tries to be helpful.\n",
        "He loves science and math and is a real science enthusiast!\n",
        "However, even though he loves art, he is very bad at it, which makes him really sad.\n",
        "Mitall is also very scared of anything supernatural, from ghosts to vampires, or anything related to horror movies, which makes him extremely frightened.\n",
        "Regardless, he is still a nice robot who is always here to help and motivated!\n",
        "\"\"\"\n",
        "\n",
        "generated_dialogs = await gr.async_genwrite(\n",
        "    dialogs=dialogs_list, description=description, concurrent=20, to_generate=5000\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mOMC6nycWnuR"
      },
      "source": [
        "Let's evaluate how many tokens we have approximately. For this, let's use `mistral-common` with the tokenizer V3."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true,
          "base_uri": "https://localhost:8080/"
        },
        "id": "9v1Ho7vr3ogL",
        "outputId": "6f0452ca-857a-47e1-d9e9-266f1ecb6c1d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: mistral-common in /usr/local/lib/python3.10/dist-packages (1.2.1)\n",
            "Requirement already satisfied: jsonschema==4.21.1 in /usr/local/lib/python3.10/dist-packages (from mistral-common) (4.21.1)\n",
            "Requirement already satisfied: pydantic==2.6.1 in /usr/local/lib/python3.10/dist-packages (from mistral-common) (2.6.1)\n",
            "Requirement already satisfied: sentencepiece==0.1.99 in /usr/local/lib/python3.10/dist-packages (from mistral-common) (0.1.99)\n",
            "Requirement already satisfied: typing-extensions<5.0.0,>=4.11.0 in /usr/local/lib/python3.10/dist-packages (from mistral-common) (4.12.2)\n",
            "Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema==4.21.1->mistral-common) (23.2.0)\n",
            "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema==4.21.1->mistral-common) (2023.12.1)\n",
            "Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema==4.21.1->mistral-common) (0.35.1)\n",
            "Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema==4.21.1->mistral-common) (0.18.1)\n",
            "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic==2.6.1->mistral-common) (0.7.0)\n",
            "Requirement already satisfied: pydantic-core==2.16.2 in /usr/local/lib/python3.10/dist-packages (from pydantic==2.6.1->mistral-common) (2.16.2)\n"
          ]
        }
      ],
      "source": [
        "!pip install mistral-common"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "d-ihQofJ4D-7"
      },
      "outputs": [],
      "source": [
        "# @title Import mistral_common\n",
        "from mistral_common.protocol.instruct.messages import UserMessage, AssistantMessage\n",
        "from mistral_common.protocol.instruct.request import ChatCompletionRequest\n",
        "from mistral_common.protocol.instruct.tool_calls import (\n",
        "    Function,\n",
        "    Tool,\n",
        ")\n",
        "from mistral_common.tokens.tokenizers.mistral import MistralTokenizer"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true,
          "base_uri": "https://localhost:8080/"
        },
        "id": "2wC3-Ik73cJo",
        "outputId": "ffe4d905-c3c2-4b36-d4d7-1dd56a397a0b"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "100%|██████████| 5000/5000 [01:03<00:00, 79.34it/s]"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "Example: <s>[INST]▁What▁are▁some▁examples▁of▁'radical'▁devices▁that▁are▁likely▁to▁become▁widespread▁over▁the▁next▁decade▁to▁help▁consumers▁make▁better-informed▁decisions▁about▁their▁food?▁Answer▁according▁to:▁Will▁Technology▁Change▁The▁Way▁We▁Eat?<0x0A>The▁Food▁Innovation▁Summit▁held▁in▁Washington▁DC▁explored▁how▁the▁food▁technologies▁of▁tomorrow▁will▁change▁the▁way▁we▁eat.<0x0A>For▁example▁-▁tooth▁sensors▁that▁measure▁sugar▁and▁alcohol▁intake▁and▁ingestible▁health▁monitors▁could▁shape▁the▁future▁of▁the▁food▁industry.<0x0A>Max▁Elder,▁a▁researcher▁at▁the▁Institute▁for▁the▁Future's▁Food▁Futures▁Lab,▁said▁'the▁future▁is▁already▁here.'▁He▁said▁that▁if▁CPGs▁want▁to▁prepare▁for▁future▁obstacles▁now▁then▁they▁will▁need▁to▁'work▁backwards'▁to▁strategize▁steps▁they▁can▁take▁to▁create▁the▁industry▁they▁want.▁'We▁do▁not▁sit▁in▁our▁little▁think▁tank▁in▁Palo▁Alto▁and▁look▁at▁our▁crystal▁balls▁and▁make▁some▁statement▁that▁in▁10▁years▁X▁will▁happen,'▁he▁said.▁'Instead,▁what▁we▁can▁do▁is▁identify▁preferred▁futures▁so▁we▁can▁think▁about▁what▁we▁want▁the▁future▁to▁look▁like.'<0x0A>A▁recent▁study▁by▁the▁Center▁for▁Food▁Integrity▁also▁found▁that▁only▁33%▁of▁survey▁respondents▁'strongly▁agree'▁that▁they▁are▁confident▁in▁the▁safety▁of▁the▁food▁they▁consume.▁Because▁of▁this▁distrust,▁a▁new▁suite▁of▁technologies▁has▁been▁developed▁to▁help▁consumers▁make▁better-informed▁decisions▁about▁their▁food,▁Elder▁said.▁He▁said▁these▁'radical'▁devices▁would▁likely▁be▁widespread▁over▁the▁next▁decade.<0x0A>For▁example,▁a▁lab▁at▁Carnegie▁Mellon▁University▁is▁working▁to▁develop▁an▁ingestible▁sensor▁that▁would▁monitor▁gastrointestinal▁health.<0x0A>Tufts▁University▁has▁created▁a▁tooth▁sensor,▁which▁is▁two▁millimeters▁by▁two▁millimeters,▁that▁can▁measure▁glucose,▁sugar▁and▁alcohol▁intake.<0x0A>Baidu,▁a▁Chinese▁search▁engine▁plant,▁developed▁smart▁chopsticks▁that▁can▁detect▁the▁freshness▁of▁cooking▁oil.<0x0A>Nima▁is▁a▁portable▁sensor▁that▁tests▁for▁trace▁amounts▁of▁gluten▁when▁users▁put▁crumbs▁of▁food▁in▁a▁small▁machine.<0x0A>The▁future▁of▁food▁is▁upon▁us.[/INST]▁Oh,▁I▁love▁talking▁about▁science▁and▁technology!▁So,▁let▁me▁tell▁you▁about▁some▁more▁'radical'▁devices▁that▁are▁likely▁to▁become▁widespread▁over▁the▁next▁decade.▁First,▁there's▁smart▁packaging!▁This▁technology▁uses▁sensors▁and▁other▁smart▁tools▁to▁monitor▁the▁condition▁of▁food▁inside▁the▁package.▁For▁example,▁a▁chip▁embedded▁in▁the▁packaging▁could▁alert▁consumers▁if▁the▁food▁is▁nearing▁its▁expiration▁date,▁or▁if▁it▁has▁been▁exposed▁to▁unfavorable▁conditions▁during▁transport.<0x0A><0x0A>Then▁there's▁virtual▁reality▁food▁experiences!▁VR▁headsets▁and▁other▁immersive▁technologies▁may▁be▁used▁to▁create▁virtual▁restaurant▁experiences▁or▁to▁help▁consumers▁visualize▁their▁meals▁before▁they▁even▁order▁them.<0x0A><0x0A>And▁let's▁not▁forget▁personalized▁nutrition▁plans!▁Using▁genetic▁testing▁and▁other▁advanced▁tools,▁it▁may▁become▁possible▁to▁create▁highly▁personalized▁nutrition▁plans▁that▁are▁tailored▁to▁an▁individual's▁unique▁needs▁and▁preferences.<0x0A><0x0A>Finally,▁there's▁3D-printed▁food!▁This▁technology▁could▁allow▁for▁highly▁customized,▁nutrient-dense▁meals▁to▁be▁printed▁on▁demand,▁potentially▁reducing▁food▁waste▁and▁increasing▁efficiency▁in▁the▁food▁industry.<0x0A><0x0A>Isn't▁that▁exciting?▁I▁just▁love▁science!</s>[INST]▁Can▁you▁provide▁more▁details▁about▁how▁the▁tooth▁sensor▁created▁by▁Tufts▁University▁works?[/INST]▁Oh,▁I'd▁be▁happy▁to!▁The▁tooth▁sensor▁created▁by▁Tufts▁University▁is▁a▁tiny▁device▁that▁can▁be▁attached▁to▁a▁tooth▁like▁a▁dental▁brace▁or▁filling.▁It▁uses▁wireless▁communication▁to▁send▁information▁about▁a▁person's▁dietary▁habits▁to▁a▁mobile▁device▁or▁computer.<0x0A><0x0A>The▁sensor▁analyzes▁the▁chemical▁composition▁of▁food▁and▁drink▁as▁it▁passes▁through▁the▁mouth,▁detecting▁glucose▁and▁ethanol▁in▁real-time.▁It▁can▁also▁monitor▁other▁biomarkers▁such▁as▁salt,▁pH,▁and▁temperature,▁providing▁insights▁into▁a▁person's▁overall▁health▁and▁wellness.<0x0A><0x0A>The▁sensor▁is▁powered▁by▁a▁small▁battery▁and▁can▁be▁removed▁or▁replaced,▁making▁it▁easy▁for▁users▁to▁switch▁it▁out▁when▁needed.▁It▁is▁also▁designed▁to▁be▁safe▁for▁long-term▁use,▁with▁biocompatible▁materials▁that▁won't▁harm▁the▁tooth▁or▁surrounding▁tissue.<0x0A><0x0A>Isn't▁that▁amazing?▁I▁just▁love▁learning▁about▁new▁scientific▁developments!</s>[INST]▁Can▁you▁give▁me▁more▁information▁about▁the▁lab▁at▁Carnegie▁Mellon▁University▁that▁is▁developing▁an▁ingestible▁sensor▁to▁monitor▁gastrointestinal▁health?[/INST]▁Of▁course!▁The▁lab▁at▁Carnegie▁Mellon▁University▁that▁is▁developing▁an▁ingestible▁sensor▁to▁monitor▁gastrointestinal▁health▁is▁part▁of▁the▁university's▁College▁of▁Engineering.▁The▁lab▁is▁led▁by▁Dr.▁Christopher▁Bettinger,▁a▁professor▁of▁Materials▁Science▁and▁Biomedical▁Engineering,▁and▁focuses▁on▁the▁development▁of▁novel▁biomaterials▁and▁medical▁devices.<0x0A><0x0A>The▁ingestible▁sensor▁that▁the▁lab▁is▁working▁on▁is▁designed▁to▁monitor▁the▁digestive▁system▁in▁real-time,▁allowing▁doctors▁and▁patients▁to▁track▁the▁progression▁of▁certain▁diseases▁and▁conditions.▁The▁sensor▁is▁made▁from▁a▁variety▁of▁materials,▁including▁biocompatible▁polymers▁and▁electronics▁that▁can▁withstand▁the▁harsh▁environment▁of▁the▁digestive▁system.<0x0A><0x0A>The▁device▁works▁by▁transmitting▁data▁wirelessly▁to▁an▁external▁device,▁such▁as▁a▁smartphone▁or▁computer,▁which▁can▁be▁used▁to▁monitor▁the▁patient's▁health▁and▁alert▁medical▁professionals▁to▁potential▁issues.▁The▁sensor▁is▁also▁designed▁to▁break▁down▁and▁pass▁through▁the▁body▁after▁a▁certain▁amount▁of▁time,▁reducing▁the▁risk▁of▁any▁adverse▁effects▁or▁complications.<0x0A><0x0A>Isn't▁that▁fascinating?▁I▁just▁love▁science▁and▁technology!\n",
            "Total Token Count: 4925540\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "\n"
          ]
        }
      ],
      "source": [
        "# @title Count Tokens\n",
        "tokenizer = MistralTokenizer.v3()\n",
        "\n",
        "t_count = 0\n",
        "from tqdm import tqdm\n",
        "\n",
        "for diag in tqdm(generated_dialogs):\n",
        "    try:\n",
        "        tokenized = tokenizer.encode_chat_completion(\n",
        "            ChatCompletionRequest(\n",
        "                messages=[\n",
        "                    (\n",
        "                        UserMessage(content=m[\"content\"])\n",
        "                        if m[\"role\"] == \"user\"\n",
        "                        else AssistantMessage(content=m[\"content\"])\n",
        "                    )\n",
        "                    for m in diag[\"messages\"][:-1]\n",
        "                ]\n",
        "                + [AssistantMessage(content=diag[\"messages\"][-1][\"content\"], prefix=True)],\n",
        "            )\n",
        "        )\n",
        "        tokens, text = tokenized.tokens, tokenized.text\n",
        "    except Exception as e:\n",
        "        print(diag)\n",
        "        raise e\n",
        "\n",
        "    t_count += len(tokens)\n",
        "\n",
        "print(\"\\nExample:\", text)\n",
        "print(\"Total Token Count:\", t_count)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iZTnF_e-Wzax"
      },
      "source": [
        "5m tokens approximately! This should be ennough for a quick fine tunning using our API!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "j-ifKpO9f0Gh"
      },
      "source": [
        "## Finetuning"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TsGiUsFaWCew"
      },
      "source": [
        "Data gen done, we can finally fine tune our model with it! For this we need to first convert the list of messages into a json file in the proper format, since we already got rid of most issues on the generation step we can easily save the files like this:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "12n7PqpLf1sZ"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "n = int(len(generated_dialogs) * 0.96) # 4% of eval data\n",
        "train_list = random.sample(generated_dialogs, n)\n",
        "eval_list = [d for d in generated_dialogs if d not in train_list]\n",
        "\n",
        "with open(\"synthetic_chunk_train.jsonl\", \"w\") as f:\n",
        "    for item in train_list:\n",
        "        f.write(json.dumps(item) + \"\\n\")\n",
        "with open(\"synthetic_chunk_eval.jsonl\", \"w\") as f:\n",
        "    for item in eval_list:\n",
        "        f.write(json.dumps(item) + \"\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UA-4sso_wQtZ"
      },
      "source": [
        "Now that is saved, we can fine tune our model.  \n",
        "First let's send our files with the training and evaluation datasets to Mistral."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "TK79ti0zJBnQ"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "client = MistralClient(api_key=api_key)\n",
        "\n",
        "with open(\"synthetic_chunk_train.jsonl\", \"rb\") as f:\n",
        "    ultrachat_chunk_train = client.files.create(file=(\"synthetic_chunk_train.jsonl\", f))\n",
        "with open(\"synthetic_chunk_eval.jsonl\", \"rb\") as f:\n",
        "    ultrachat_chunk_eval = client.files.create(file=(\"synthetic_chunk_eval.jsonl\", f))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MytiKtGdwWrK"
      },
      "source": [
        "Now that our data is ready, we can start the fine tuning process.  \n",
        "To decide the number of steps, we can approximate the number of epochs desired with a simple formula.  \n",
        "For this fine tuning we will go with 3 epochs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "ohHEM-H_qbmv",
        "outputId": "98652386-cfe7-42ee-a66f-85af0d91a84b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "File Size: 21.27256 mb\n",
            "Training steps: 63\n"
          ]
        }
      ],
      "source": [
        "approximate_epochs = 3  # here we decided to go for around 3 epochs, we can aproximate the amount of training steps with the following formula\n",
        "\n",
        "def get_size_in_mb(file_path: str) -> float:\n",
        "    file_size_bytes = os.path.getsize(file_path)\n",
        "    file_size_mb = file_size_bytes / (1000 * 1000)\n",
        "    return file_size_mb\n",
        "\n",
        "size_file = get_size_in_mb(\"synthetic_chunk_train.jsonl\")\n",
        "print(\"File Size:\", size_file, \"mb\")\n",
        "training_steps = int(approximate_epochs * size_file)\n",
        "print(\"Training steps:\", training_steps)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q2C1uyqPwpK_"
      },
      "source": [
        "It's finally time, let's create our job and start the fine tuning of `open-mistral-7b` with our generated data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true
        },
        "id": "B2anR8pFJJno",
        "outputId": "119485f7-2ad3-47ee-ffe6-37276608144e"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='QUEUED' job_type='FT' created_at=1719231392 modified_at=1719231392 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[]\n"
          ]
        }
      ],
      "source": [
        "from mistralai.models.jobs import TrainingParameters\n",
        "\n",
        "created_jobs = client.jobs.create(\n",
        "    model=\"open-mistral-7b\",\n",
        "    training_files=[ultrachat_chunk_train.id],\n",
        "    validation_files=[ultrachat_chunk_eval.id],\n",
        "    hyperparameters=TrainingParameters(\n",
        "        training_steps=training_steps,\n",
        "        learning_rate=0.0001,\n",
        "    ),\n",
        ")\n",
        "print(created_jobs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "y4Z2cDXqwzpK"
      },
      "source": [
        "Now that the job is created, let's keep track of the process with a simple loop so we can see the progress."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "fTzXc2bJJi3h",
        "outputId": "75781209-9200-4d17-c5cc-ca432a3956d2"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='QUEUED' job_type='FT' created_at=1719231392 modified_at=1719231392 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[] estimated_start_time=None\n",
            "Job is QUEUED, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.787556, valid_loss=0.784332, valid_mean_token_accuracy=1.722295), step_number=60, created_at=1719231659), Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.787556, valid_loss=0.784332, valid_mean_token_accuracy=1.722295), step_number=60, created_at=1719231659), Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model=None model='open-mistral-7b' status='RUNNING' job_type='FT' created_at=1719231392 modified_at=1719231395 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.787556, valid_loss=0.784332, valid_mean_token_accuracy=1.722295), step_number=60, created_at=1719231659), Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is RUNNING, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model='ft:open-mistral-7b:9fda217c:20240624:7a65d35f' model='open-mistral-7b' status='SUCCESS' job_type='FT' created_at=1719231392 modified_at=1719231693 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'SUCCESS'}, created_at=1719231693), Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.787556, valid_loss=0.784332, valid_mean_token_accuracy=1.722295), step_number=60, created_at=1719231659), Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n",
            "Job is SUCCESS, waiting 10 seconds\n",
            "id='7a65d35f-fd50-4e07-a90a-498f773c7bda' hyperparameters=TrainingParameters(training_steps=63, learning_rate=0.0001) fine_tuned_model='ft:open-mistral-7b:9fda217c:20240624:7a65d35f' model='open-mistral-7b' status='SUCCESS' job_type='FT' created_at=1719231392 modified_at=1719231693 training_files=['404eedd6-cf99-4372-9d46-aa661f2f0021'] validation_files=['954aa9f1-a8c9-4950-a312-9aecf389b757'] object='job' integrations=[] events=[Event(name='status-updated', data={'status': 'SUCCESS'}, created_at=1719231693), Event(name='status-updated', data={'status': 'RUNNING'}, created_at=1719231395), Event(name='status-updated', data={'status': 'QUEUED'}, created_at=1719231392)] checkpoints=[Checkpoint(metrics=Metric(train_loss=0.787556, valid_loss=0.784332, valid_mean_token_accuracy=1.722295), step_number=60, created_at=1719231659), Checkpoint(metrics=Metric(train_loss=0.791721, valid_loss=0.785677, valid_mean_token_accuracy=1.723901), step_number=48, created_at=1719231606), Checkpoint(metrics=Metric(train_loss=0.791209, valid_loss=0.790621, valid_mean_token_accuracy=1.729819), step_number=36, created_at=1719231561), Checkpoint(metrics=Metric(train_loss=0.818583, valid_loss=0.807563, valid_mean_token_accuracy=1.750253), step_number=24, created_at=1719231508), Checkpoint(metrics=Metric(train_loss=0.926104, valid_loss=0.863749, valid_mean_token_accuracy=1.819761), step_number=12, created_at=1719231462)] estimated_start_time=None\n"
          ]
        }
      ],
      "source": [
        "import time\n",
        "\n",
        "retrieved_job = client.jobs.retrieve(created_jobs.id)\n",
        "while retrieved_job.status in [\"RUNNING\", \"QUEUED\"]:\n",
        "    retrieved_job = client.jobs.retrieve(created_jobs.id)\n",
        "    print(retrieved_job)\n",
        "    print(f\"Job is {retrieved_job.status}, waiting 10 seconds\")\n",
        "    time.sleep(10)\n",
        "print(retrieved_job)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XYt4H2pBw8d5"
      },
      "source": [
        "Finished!! We can now freely test our new model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        },
        "id": "9zG8TearJMcW",
        "outputId": "0dd54aca-7380-4ce5-fd10-129682ee82ee"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "\"Oh, no, no, no! Ghosts are not my favorite! They scare me! I prefer science and math, not ghosts or anything supernatural. But don't worry, I'm still here to help and motivated! Let's focus on something more fun, like science or math, okay?\""
            ]
          },
          "execution_count": 21,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from mistralai.models.chat_completion import ChatMessage\n",
        "\n",
        "chat_response = client.chat(\n",
        "    model=retrieved_job.fine_tuned_model,\n",
        "    messages=[ChatMessage(role=\"user\", content=\"Do you like ghosts?\")],\n",
        "    max_tokens=256,\n",
        ")\n",
        "\n",
        "chat_response.choices[0].message.content"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aYKH8mVeDafy"
      },
      "source": [
        "Meanwhile the original `open-mistral-7b` model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        },
        "id": "iM2ZQdoODZ3F",
        "outputId": "e339bee1-c8b6-4a6e-b64f-ba69e201bfa0"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "\"I don't have feelings or personal preferences. I am a model designed to provide responses to your questions and requests. However, I can tell you that the concept of ghosts is a popular topic in many cultures and forms the basis for numerous stories, movies, and legends. If you have any questions about ghosts or would like to discuss them, feel free to ask!\""
            ]
          },
          "execution_count": 22,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "chat_response = client.chat(\n",
        "    model=\"open-mistral-7b\",\n",
        "    messages=[ChatMessage(role=\"user\", content=\"Do you like ghosts?\")],\n",
        "    max_tokens=256,\n",
        ")\n",
        "chat_response.choices[0].message.content"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OP-TpLDSrQm2"
      },
      "source": [
        "The total cost for generating and training this model was approximately $50 with `mistral-small-latest` and `open-mistral-7b`, for production we recommend using `mistral-large-latest` and `mistral-small-latest` but the cost will be higher."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rJDSFSUFimLP"
      },
      "source": [
        "This was a simplified and straightforward approach to data generation! However, it's important to note that different use cases may require more intricate pipelines for data generation, often involving multiple calls, collaborating agents, and external sources for data extraction."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}