evaluation

Details

File: mistral/evaluation/evaluation.ipynb
Type: Jupyter Notebook
Use Cases: Evaluation
Content

Notebook content (JSON format):
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2a737870-49c9-4285-b8a5-2d4e8b5e5db4",
   "metadata": {
    "id": "2a737870-49c9-4285-b8a5-2d4e8b5e5db4"
   },
   "source": [
    "# How to evaluate LLMs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9746b03e-6b71-4927-ab60-1c9056711510",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install mistralai evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f032c997-6764-4d46-8ca5-625b33d41640",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Type your API Key··········\n"
     ]
    }
   ],
   "source": [
    "from mistralai import Mistral\n",
    "from getpass import getpass\n",
    "\n",
    "api_key= getpass(\"Type your API Key\")\n",
    "\n",
    "client = Mistral(api_key = api_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f501021-df7b-4cc7-b0d7-208d4644c85d",
   "metadata": {
    "id": "5f501021-df7b-4cc7-b0d7-208d4644c85d"
   },
   "source": [
    "## Example 1:  Information extraction benchmark with accuracy\n",
    "\n",
    "\n",
    "### Evaluation data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d5501a3b-e070-4779-8534-da29aa0ef034",
   "metadata": {},
   "outputs": [],
   "source": [
    "prompts = {\n",
    "    \"Johnson\": {\n",
    "        \"medical_notes\": \"A 60-year-old male patient, Mr. Johnson, presented with symptoms of increased thirst, frequent urination, fatigue, and unexplained weight loss. Upon evaluation, he was diagnosed with diabetes, confirmed by elevated blood sugar levels. Mr. Johnson's weight is 210 lbs. He has been prescribed Metformin to be taken twice daily with meals. It was noted during the consultation that the patient is a current smoker. \",\n",
    "        \"golden_answer\": {\n",
    "            \"age\": 60,\n",
    "            \"gender\": \"male\",\n",
    "            \"diagnosis\": \"diabetes\",\n",
    "            \"weight\": 210,\n",
    "            \"smoking\": \"yes\",\n",
    "        },\n",
    "    },\n",
    "    \"Smith\": {\n",
    "        \"medical_notes\": \"Mr. Smith, a 55-year-old male patient, presented with severe joint pain and stiffness in his knees and hands, along with swelling and limited range of motion. After a thorough examination and diagnostic tests, he was diagnosed with arthritis. It is important for Mr. Smith to maintain a healthy weight (currently at 150 lbs) and quit smoking, as these factors can exacerbate symptoms of arthritis and contribute to joint damage.\",\n",
    "        \"golden_answer\": {\n",
    "            \"age\": 55,\n",
    "            \"gender\": \"male\",\n",
    "            \"diagnosis\": \"arthritis\",\n",
    "            \"weight\": 150,\n",
    "            \"smoking\": \"yes\",\n",
    "        },\n",
    "    },\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4625b22-7de2-4c48-8ef1-7936ea222540",
   "metadata": {
    "id": "b4625b22-7de2-4c48-8ef1-7936ea222540"
   },
   "source": [
    "### How to evaluate?\n",
    "\n",
    "- Step 1: Define prompt template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b49c7dd5-1f23-41cc-b864-d2a7da563acb",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_mistral(user_message, model=\"mistral-large-latest\"):\n",
    "    messages = [{\"role\": \"user\", \"content\": user_message}]\n",
    "    chat_response = client.chat.complete(\n",
    "        model=model,\n",
    "        messages=messages,\n",
    "        response_format={\"type\": \"json_object\"},\n",
    "    )\n",
    "    return chat_response.choices[0].message.content\n",
    "\n",
    "\n",
    "# define prompt template\n",
    "prompt_template = \"\"\"\n",
    "Extract information from the following medical notes:\n",
    "{medical_notes}\n",
    "\n",
    "Return json format with the following JSON schema:\n",
    "\n",
    "{{\n",
    "        \"age\": {{\n",
    "            \"type\": \"integer\"\n",
    "        }},\n",
    "        \"gender\": {{\n",
    "            \"type\": \"string\",\n",
    "            \"enum\": [\"male\", \"female\", \"other\"]\n",
    "        }},\n",
    "        \"diagnosis\": {{\n",
    "            \"type\": \"string\",\n",
    "            \"enum\": [\"migraine\", \"diabetes\", \"arthritis\", \"acne\", \"common cold\"]\n",
    "        }},\n",
    "        \"weight\": {{\n",
    "            \"type\": \"integer\"\n",
    "        }},\n",
    "        \"smoking\": {{\n",
    "            \"type\": \"string\",\n",
    "            \"enum\": [\"yes\", \"no\"]\n",
    "        }},\n",
    "\n",
    "}}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5422a8e-feb7-48fe-81bc-9ccfe5a17193",
   "metadata": {
    "id": "b5422a8e-feb7-48fe-81bc-9ccfe5a17193"
   },
   "source": [
    "\n",
    "- Step 2: Define how we compare the model response with the golden answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "366608c4-c783-4325-8e18-cd3a668a103f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "\n",
    "def compare_json_objects(obj1, obj2):\n",
    "    total_fields = 0\n",
    "    identical_fields = 0\n",
    "    common_keys = set(obj1.keys()) & set(obj2.keys())\n",
    "    for key in common_keys:\n",
    "        identical_fields += obj1[key] == obj2[key]\n",
    "    percentage_identical = (identical_fields / max(len(obj1.keys()), 1)) * 100\n",
    "    return percentage_identical"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2714fcc-16e5-4bf6-9fe8-5387f806acf6",
   "metadata": {
    "id": "d2714fcc-16e5-4bf6-9fe8-5387f806acf6"
   },
   "source": [
    "- Step 3: Calculate accuracy rate across test cases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "04efe325-5a71-4dab-8521-989f9edd54dc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_rates = []\n",
    "\n",
    "# for each test case\n",
    "for name in prompts:\n",
    "\n",
    "    # define user message\n",
    "    user_message = prompt_template.format(medical_notes=prompts[name][\"medical_notes\"])\n",
    "\n",
    "    # run LLM\n",
    "    response = json.loads(run_mistral(user_message))\n",
    "\n",
    "    # calculate accuracy rate for this test case\n",
    "    accuracy_rates.append(\n",
    "        compare_json_objects(response, prompts[name][\"golden_answer\"])\n",
    "    )\n",
    "\n",
    "# calculate accuracy rate across test cases\n",
    "sum(accuracy_rates) / len(accuracy_rates)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92ea4e20-60a8-4ce7-b550-321123894988",
   "metadata": {
    "id": "92ea4e20-60a8-4ce7-b550-321123894988"
   },
   "source": [
    "## Example 2: evaluate code generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e8de41ed-1723-4a2d-97eb-8f467e024499",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_mistral(user_message, model=\"mistral-large-latest\"):\n",
    "    client = Mistral(api_key=api_key)\n",
    "    messages = [{\"role\":\"user\", \"content\": user_message}]\n",
    "    chat_response = client.chat.complete(\n",
    "        model=model,\n",
    "        messages=messages\n",
    "    )\n",
    "    return chat_response.choices[0].message.content\n",
    "\n",
    "# define prompt template\n",
    "python_prompts = {\n",
    "    \"sort_string\": {\n",
    "        \"prompt\": \"Write a python function to sort the given string.\",\n",
    "        \"test\": 'assert sort_string(\"data\") == \"aadt\"',\n",
    "    },\n",
    "    \"is_odd\": {\n",
    "        \"prompt\": \"Write a python function to check whether the given number is odd or not using bitwise operator.\",\n",
    "        \"test\": \"assert is_odd(5) == True\",\n",
    "    },\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1918c9f-6663-4a57-90c2-07b410f88315",
   "metadata": {
    "id": "d1918c9f-6663-4a57-90c2-07b410f88315"
   },
   "source": [
    "- Step 1: Define prompt template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c56ea1b3-5a2e-4668-bf4e-bf9c76b6f721",
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt_template = \"\"\"Write a Python function to execute the following task: {task}\n",
    "Return only valid Python code. Do not give any explanation.\n",
    "Never start with ```python.\n",
    "Always start with def {name}(.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edb6c550-d620-4432-9caf-a1b6fa422c18",
   "metadata": {
    "id": "edb6c550-d620-4432-9caf-a1b6fa422c18"
   },
   "source": [
    "- Step 2: Decide how we evaluate the code generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "de273b6d-0595-4ba2-99f1-688d8ecd1d47",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "897c2480adcd41ed81d6b5398d07ddec",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading builder script:   0%|          | 0.00/9.18k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ac09b9d3eba74539a5fb84bfd59f7f8f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading extra modules:   0%|          | 0.00/6.10k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from evaluate import load\n",
    "import os\n",
    "\n",
    "os.environ[\"HF_ALLOW_CODE_EVAL\"] = \"1\"\n",
    "code_eval = load(\"code_eval\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0fa646d7-14e8-427a-9356-4ac424eb2918",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.\n",
      "  self.pid = os.fork()\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "({'pass@1': 1.0},\n",
       " defaultdict(list,\n",
       "             {0: [(0,\n",
       "                {'task_id': 0,\n",
       "                 'passed': True,\n",
       "                 'result': 'passed',\n",
       "                 'completion_id': 0})]}))"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "code_eval.compute(\n",
    "    references=[\"assert is_odd(5) == True\"],\n",
    "    predictions=[[\"def is_odd(n):\\n    return n & 1 != 0\"]],\n",
    "    k=[1],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af443ce8-e016-4a75-b2b3-94da848cd356",
   "metadata": {
    "id": "af443ce8-e016-4a75-b2b3-94da848cd356"
   },
   "source": [
    "- Step 3: Calculate accuracy rate across test cases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "685a9f88-52b9-4095-bf8c-f844d2049075",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'pass@1': 1.0}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "refs = []\n",
    "preds = []\n",
    "\n",
    "for name in python_prompts:\n",
    "\n",
    "    # define user message\n",
    "    user_message = prompt_template.format(\n",
    "        task=python_prompts[name][\"prompt\"], name=name\n",
    "    )\n",
    "\n",
    "    # run LLM\n",
    "    response = run_mistral(user_message)\n",
    "\n",
    "    refs.append(python_prompts[name][\"test\"])\n",
    "    preds.append([response])\n",
    "\n",
    "# evaluate code generation\n",
    "pass_at_1, results = code_eval.compute(references=refs, predictions=preds)\n",
    "\n",
    "pass_at_1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ca61dfc-1160-4f2c-9d0c-be3baed7764c",
   "metadata": {
    "id": "5ca61dfc-1160-4f2c-9d0c-be3baed7764c"
   },
   "source": [
    "# Example 3: evaluate summary generation with LLM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "3f5e70e3-fe8a-4796-a556-2b69277191df",
   "metadata": {},
   "outputs": [],
   "source": [
    "news = (\n",
    "    \"BRUSSELS (Reuters) - Theresa May looked despondent , with deep rings under her eyes, EU chief executive Jean-Claude Juncker told aides after dining with the British prime minister last week, a German newspaper said on Sunday. The report by a Frankfurter Allgemeine Zeitung correspondent whose leaked account of a Juncker-May dinner in April caused upset in London, said Juncker thought her marked by battles over Brexit with her own Conservative ministers as she asked for EU help to create more room for maneuver at home. No immediate comment was available from Juncker s office, which has a policy of not commenting on reports of meetings. The FAZ said May, who flew in for a hastily announced dinner in Brussels with the European Commission president last Monday ahead of an EU summit, seemed to Juncker anxious, despondent and disheartened , a woman who trusts hardly anyone but is also not ready for a clear-out to free herself . As she later did over dinner on Thursday with fellow EU leaders, May asked for help to overcome British divisions. She indicated that back home friend and foe are at her back plotting to bring her down, the paper said. May said she had no room left to maneuver. The Europeans have to create it for her. May s face and appearance spoke volumes, Juncker later told his colleagues, the FAZ added. She has deep rings under her eyes. She looks like someone who can t sleep a wink. She smiles for the cameras, it went on, but it looks forced , unlike in the past, when she could shake with laughter. Now she needs all her strength not to lose her poise. As with the April dinner at 10 Downing Street, when the FAZ reported that Juncker thought May in another galaxy in terms of Brexit expectations, both sides issued statements after last week s meeting saying talks were constructive and friendly . They said they agreed negotiations should be accelerated . May dismissed the dinner leak six months ago as Brussels gossip , though officials on both sides said the report in the FAZ did little to foster an atmosphere of trust which they agree will be important to reach a deal. German Chancellor Angela Merkel was also reported to have been irritated by that leak. Although the summit on Thursday and Friday rejected May s call for an immediate start to talks on the future relationship, leaders made a gesture to speed up the process and voiced hopes of opening a new phase in December. Some said they understood May s difficulties in forging consensus in London.\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7c7dcb1-aebe-417b-8046-90e7bdc09e40",
   "metadata": {
    "id": "a7c7dcb1-aebe-417b-8046-90e7bdc09e40"
   },
   "source": [
    "- Step 1: Generate summary for the given news"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "8e83c322-445f-4608-9443-cbf9cdbe487c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_mistral(user_message, model=\"open-mistral-7b\", is_json=False):\n",
    "    client = Mistral(api_key=api_key)\n",
    "    messages = [{\"role\":\"user\", \"content\":user_message}]\n",
    "\n",
    "    if is_json:\n",
    "        chat_response = client.chat.complete(\n",
    "            model=model, messages=messages, response_format={\"type\": \"json_object\"}\n",
    "        )\n",
    "    else:\n",
    "        chat_response = client.chat.complete(model=model, messages=messages)\n",
    "\n",
    "    return chat_response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "ef46d23d-392b-4a54-822a-13ede737b3cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary_prompt = f\"\"\"\n",
    "Summarize the following news. Write the summary based on the following criteria: relevancy and readability. Consider the sources cited, the quality of evidence provided, and any potential biases or misinformation.\n",
    "\n",
    "## News:\n",
    "{news}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "ea400280-2e74-461c-8d40-7bcb5ea1ad2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = run_mistral(summary_prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2368cef0-93f9-4e14-8cb0-ce27205a5420",
   "metadata": {
    "id": "2368cef0-93f9-4e14-8cb0-ce27205a5420"
   },
   "source": [
    "- Step 2: Define evaluation metrics and rubrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "441a7ed7-0ad3-4d83-aaae-593711e140c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_rubrics = [\n",
    "    {\n",
    "        \"metric\": \"relevancy\",\n",
    "        \"rubrics\": \"\"\"\n",
    "        Score 1: The summary is not relevant to the original text.\n",
    "        Score 2: The summary is somewhat relevant to the original text, but has significant flaws.\n",
    "        Score 3: The summary is mostly relevant to the original text, and effectively conveys its main ideas and arguments.\n",
    "        Score 4: The summary is highly relevant to the original text, and provides additional value or insight.\n",
    "        \"\"\",\n",
    "    },\n",
    "    {\n",
    "        \"metric\": \"readability\",\n",
    "        \"rubrics\": \"\"\"\n",
    "        Score 1: The summary is difficult to read and understand.\n",
    "        Score 2: The summary is somewhat readable, but has significant flaws.\n",
    "        Score 3: The summary is mostly readable and easy to understand.\n",
    "        Score 4: The summary is highly readable and engaging.\n",
    "        \"\"\",\n",
    "    },\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed068aac-73eb-461d-b441-ffc9b98393dd",
   "metadata": {
    "id": "ed068aac-73eb-461d-b441-ffc9b98393dd"
   },
   "source": [
    "- Step 3: Employ a more powerful LLM (e.g., Mistral Large) as a judge\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "4ad10367-c8ee-4267-962c-e7e9197b27e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "scoring_prompt = \"\"\"\n",
    "Please read the provided news article and its corresponding summary.\n",
    "Based on the specified evaluation metric and rubrics, assign an integer score between 1 and 4 to the summary.\n",
    "Then, return a JSON object with the metric as the key and the evaluation score as the value.\n",
    "\n",
    "# Evaluation metric:\n",
    "{metric}\n",
    "\n",
    "# Evaluation rubrics:\n",
    "{rubrics}\n",
    "\n",
    "# News article\n",
    "{news}\n",
    "\n",
    "# Summary\n",
    "{summary}\n",
    "\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "f39d0204-5b2d-4236-9da9-bb90ded5f8fe",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"relevancy\": 4}\n",
      "{\"readability\": 3}\n"
     ]
    }
   ],
   "source": [
    "for i in eval_rubrics:\n",
    "    eval_output = run_mistral(\n",
    "        scoring_prompt.format(\n",
    "            news=news, summary=summary, metric=i[\"metric\"], rubrics=i[\"rubrics\"]\n",
    "        ),\n",
    "        model=\"mistral-large-latest\",\n",
    "        is_json=True,\n",
    "    )\n",
    "    print(eval_output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "xnMFcXf_jL9L",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}