← Back to Cookbook
text to SQL
Details
File: mistral/function_calling/text_to_SQL.ipynb
Type: Jupyter Notebook
Use Cases: Function calling, Database, SQL
Content
Notebook content (JSON format):
{ "cells": [ { "cell_type": "markdown", "id": "7686a24e-8fc8-430c-b67a-5597fc25280a", "metadata": { "id": "7686a24e-8fc8-430c-b67a-5597fc25280a" }, "source": [ "# Text to SQL on multi-tables database\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "id": "808cde16", "metadata": { "id": "808cde16" }, "source": [ "In this cookbook we will show you how to :\n", "\n", "- Use the function calling capabilities of Mistral models\n", "- Build a text2SQL architecture that scales more efficiently than a naive approach where all schemas are integrally injected in the system prompt \n", "- Evaluate your system with Mistral models and leveraging the DeepEval framework" ] }, { "cell_type": "markdown", "id": "250338e1-703c-4460-bacb-f78314c51ee6", "metadata": { "id": "250338e1-703c-4460-bacb-f78314c51ee6" }, "source": [ "# Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "1a68d946-aa0e-48f7-8a38-ab00cc0e7d6c", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1a68d946-aa0e-48f7-8a38-ab00cc0e7d6c", "outputId": "d899f28f-c58d-494c-8257-cfd3be1a2121" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting mistralai\n", " Downloading mistralai-1.3.1-py3-none-any.whl.metadata (28 kB)\n", "Requirement already satisfied: langchain in /usr/local/lib/python3.11/dist-packages (0.3.14)\n", "Collecting deepeval\n", " Downloading deepeval-2.1.9-py3-none-any.whl.metadata (13 kB)\n", "Requirement already satisfied: eval-type-backport>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (0.2.2)\n", "Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (0.28.1)\n", "Collecting jsonpath-python>=1.0.6 (from mistralai)\n", " Downloading jsonpath_python-1.0.6-py3-none-any.whl.metadata (12 kB)\n", "Requirement already satisfied: pydantic>=2.9.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.10.5)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.8.2)\n", "Collecting typing-inspect>=0.9.0 (from mistralai)\n", " Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)\n", "Requirement already satisfied: PyYAML>=5.3 in /usr/local/lib/python3.11/dist-packages (from langchain) (6.0.2)\n", "Requirement already satisfied: SQLAlchemy<3,>=1.4 in /usr/local/lib/python3.11/dist-packages (from langchain) (2.0.37)\n", "Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in /usr/local/lib/python3.11/dist-packages (from langchain) (3.11.11)\n", "Requirement already satisfied: langchain-core<0.4.0,>=0.3.29 in /usr/local/lib/python3.11/dist-packages (from langchain) (0.3.29)\n", "Requirement already satisfied: langchain-text-splitters<0.4.0,>=0.3.3 in /usr/local/lib/python3.11/dist-packages (from langchain) (0.3.5)\n", "Requirement already satisfied: langsmith<0.3,>=0.1.17 in /usr/local/lib/python3.11/dist-packages (from langchain) (0.2.10)\n", "Requirement already satisfied: numpy<2,>=1.22.4 in /usr/local/lib/python3.11/dist-packages (from langchain) (1.26.4)\n", "Requirement already satisfied: requests<3,>=2 in /usr/local/lib/python3.11/dist-packages (from langchain) (2.32.3)\n", "Requirement already satisfied: tenacity!=8.4.0,<10,>=8.1.0 in /usr/local/lib/python3.11/dist-packages (from langchain) (9.0.0)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from deepeval) (4.67.1)\n", "Requirement already satisfied: pytest in /usr/local/lib/python3.11/dist-packages (from deepeval) (8.3.4)\n", "Requirement already satisfied: tabulate in /usr/local/lib/python3.11/dist-packages (from deepeval) (0.9.0)\n", "Requirement already satisfied: typer in /usr/local/lib/python3.11/dist-packages (from deepeval) (0.15.1)\n", "Requirement already satisfied: rich in /usr/local/lib/python3.11/dist-packages (from deepeval) (13.9.4)\n", "Requirement already satisfied: protobuf in /usr/local/lib/python3.11/dist-packages (from deepeval) (4.25.5)\n", "Requirement already satisfied: sentry-sdk in /usr/local/lib/python3.11/dist-packages (from deepeval) (2.19.2)\n", "Collecting pytest-repeat (from deepeval)\n", " Downloading pytest_repeat-0.9.3-py3-none-any.whl.metadata (4.9 kB)\n", "Collecting pytest-xdist (from deepeval)\n", " Downloading pytest_xdist-3.6.1-py3-none-any.whl.metadata (4.3 kB)\n", "Collecting portalocker (from deepeval)\n", " Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)\n", "Collecting langchain_openai (from deepeval)\n", " Downloading langchain_openai-0.3.1-py3-none-any.whl.metadata (2.7 kB)\n", "Collecting langchain-community (from deepeval)\n", " Downloading langchain_community-0.3.15-py3-none-any.whl.metadata (2.9 kB)\n", "Collecting docx2txt~=0.8 (from deepeval)\n", " Downloading docx2txt-0.8.tar.gz (2.8 kB)\n", " Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", "Requirement already satisfied: importlib-metadata>=6.0.2 in /usr/local/lib/python3.11/dist-packages (from deepeval) (8.5.0)\n", "Requirement already satisfied: opentelemetry-api<2.0.0,>=1.24.0 in /usr/local/lib/python3.11/dist-packages (from deepeval) (1.29.0)\n", "Requirement already satisfied: opentelemetry-sdk<2.0.0,>=1.24.0 in /usr/local/lib/python3.11/dist-packages (from deepeval) (1.29.0)\n", "Collecting opentelemetry-exporter-otlp-proto-grpc<2.0.0,>=1.24.0 (from deepeval)\n", " Downloading opentelemetry_exporter_otlp_proto_grpc-1.29.0-py3-none-any.whl.metadata (2.2 kB)\n", "Collecting grpcio==1.67.1 (from deepeval)\n", " Downloading grpcio-1.67.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)\n", "Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.11/dist-packages (from deepeval) (1.6.0)\n", "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (2.4.4)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.3.2)\n", "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (24.3.0)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.5.0)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (6.1.0)\n", "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (0.2.1)\n", "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.18.3)\n", "Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (3.7.1)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (2024.12.14)\n", "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (1.0.7)\n", "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (3.10)\n", "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx>=0.27.0->mistralai) (0.14.0)\n", "Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.11/dist-packages (from importlib-metadata>=6.0.2->deepeval) (3.21.0)\n", "Requirement already satisfied: jsonpatch<2.0,>=1.33 in /usr/local/lib/python3.11/dist-packages (from langchain-core<0.4.0,>=0.3.29->langchain) (1.33)\n", "Requirement already satisfied: packaging<25,>=23.2 in /usr/local/lib/python3.11/dist-packages (from langchain-core<0.4.0,>=0.3.29->langchain) (24.2)\n", "Requirement already satisfied: typing-extensions>=4.7 in /usr/local/lib/python3.11/dist-packages (from langchain-core<0.4.0,>=0.3.29->langchain) (4.12.2)\n", "Requirement already satisfied: orjson<4.0.0,>=3.9.14 in /usr/local/lib/python3.11/dist-packages (from langsmith<0.3,>=0.1.17->langchain) (3.10.14)\n", "Requirement already satisfied: requests-toolbelt<2.0.0,>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from langsmith<0.3,>=0.1.17->langchain) (1.0.0)\n", "Requirement already satisfied: deprecated>=1.2.6 in /usr/local/lib/python3.11/dist-packages (from opentelemetry-api<2.0.0,>=1.24.0->deepeval) (1.2.15)\n", "Requirement already satisfied: googleapis-common-protos~=1.52 in /usr/local/lib/python3.11/dist-packages (from opentelemetry-exporter-otlp-proto-grpc<2.0.0,>=1.24.0->deepeval) (1.66.0)\n", "Collecting opentelemetry-exporter-otlp-proto-common==1.29.0 (from opentelemetry-exporter-otlp-proto-grpc<2.0.0,>=1.24.0->deepeval)\n", " Downloading opentelemetry_exporter_otlp_proto_common-1.29.0-py3-none-any.whl.metadata (1.8 kB)\n", "Collecting opentelemetry-proto==1.29.0 (from opentelemetry-exporter-otlp-proto-grpc<2.0.0,>=1.24.0->deepeval)\n", " Downloading opentelemetry_proto-1.29.0-py3-none-any.whl.metadata (2.3 kB)\n", "Collecting protobuf (from deepeval)\n", " Downloading protobuf-5.29.3-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)\n", "Requirement already satisfied: opentelemetry-semantic-conventions==0.50b0 in /usr/local/lib/python3.11/dist-packages (from opentelemetry-sdk<2.0.0,>=1.24.0->deepeval) (0.50b0)\n", "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (0.7.0)\n", "Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (2.27.2)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->mistralai) (1.17.0)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2->langchain) (3.4.1)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2->langchain) (2.3.0)\n", "Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.11/dist-packages (from SQLAlchemy<3,>=1.4->langchain) (3.1.1)\n", "Collecting mypy-extensions>=0.3.0 (from typing-inspect>=0.9.0->mistralai)\n", " Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)\n", "Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community->deepeval)\n", " Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)\n", "Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community->deepeval)\n", " Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)\n", "Collecting langchain\n", " Downloading langchain-0.3.15-py3-none-any.whl.metadata (7.1 kB)\n", "Collecting langchain-core<0.4.0,>=0.3.29 (from langchain)\n", " Downloading langchain_core-0.3.31-py3-none-any.whl.metadata (6.3 kB)\n", "Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community->deepeval)\n", " Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)\n", "Requirement already satisfied: openai<2.0.0,>=1.58.1 in /usr/local/lib/python3.11/dist-packages (from langchain_openai->deepeval) (1.59.6)\n", "Collecting tiktoken<1,>=0.7 (from langchain_openai->deepeval)\n", " Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n", "Requirement already satisfied: iniconfig in /usr/local/lib/python3.11/dist-packages (from pytest->deepeval) (2.0.0)\n", "Requirement already satisfied: pluggy<2,>=1.5 in /usr/local/lib/python3.11/dist-packages (from pytest->deepeval) (1.5.0)\n", "Collecting execnet>=2.1 (from pytest-xdist->deepeval)\n", " Downloading execnet-2.1.1-py3-none-any.whl.metadata (2.9 kB)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich->deepeval) (3.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich->deepeval) (2.18.0)\n", "Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.11/dist-packages (from typer->deepeval) (8.1.8)\n", "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.11/dist-packages (from typer->deepeval) (1.5.4)\n", "Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community->deepeval)\n", " Downloading marshmallow-3.25.1-py3-none-any.whl.metadata (7.3 kB)\n", "Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.11/dist-packages (from deprecated>=1.2.6->opentelemetry-api<2.0.0,>=1.24.0->deepeval) (1.17.0)\n", "Requirement already satisfied: jsonpointer>=1.9 in /usr/local/lib/python3.11/dist-packages (from jsonpatch<2.0,>=1.33->langchain-core<0.4.0,>=0.3.29->langchain) (3.0.0)\n", "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich->deepeval) (0.1.2)\n", "Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.11/dist-packages (from openai<2.0.0,>=1.58.1->langchain_openai->deepeval) (1.9.0)\n", "Requirement already satisfied: jiter<1,>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from openai<2.0.0,>=1.58.1->langchain_openai->deepeval) (0.8.2)\n", "Requirement already satisfied: sniffio in /usr/local/lib/python3.11/dist-packages (from openai<2.0.0,>=1.58.1->langchain_openai->deepeval) (1.3.1)\n", "Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community->deepeval)\n", " Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)\n", "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.11/dist-packages (from tiktoken<1,>=0.7->langchain_openai->deepeval) (2024.11.6)\n", "Downloading mistralai-1.3.1-py3-none-any.whl (261 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.3/261.3 kB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading deepeval-2.1.9-py3-none-any.whl (588 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m588.2/588.2 kB\u001b[0m \u001b[31m16.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading grpcio-1.67.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m38.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)\n", "Downloading opentelemetry_exporter_otlp_proto_grpc-1.29.0-py3-none-any.whl (18 kB)\n", "Downloading opentelemetry_exporter_otlp_proto_common-1.29.0-py3-none-any.whl (18 kB)\n", "Downloading opentelemetry_proto-1.29.0-py3-none-any.whl (55 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.8/55.8 kB\u001b[0m \u001b[31m2.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading protobuf-5.29.3-cp38-abi3-manylinux2014_x86_64.whl (319 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m319.7/319.7 kB\u001b[0m \u001b[31m18.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)\n", "Downloading langchain_community-0.3.15-py3-none-any.whl (2.5 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.5/2.5 MB\u001b[0m \u001b[31m30.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading langchain-0.3.15-py3-none-any.whl (1.0 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m31.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading langchain_core-0.3.31-py3-none-any.whl (412 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m412.2/412.2 kB\u001b[0m \u001b[31m18.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading langchain_openai-0.3.1-py3-none-any.whl (54 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m54.3/54.3 kB\u001b[0m \u001b[31m3.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading portalocker-3.1.1-py3-none-any.whl (19 kB)\n", "Downloading pytest_repeat-0.9.3-py3-none-any.whl (4.2 kB)\n", "Downloading pytest_xdist-3.6.1-py3-none-any.whl (46 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.1/46.1 kB\u001b[0m \u001b[31m2.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)\n", "Downloading execnet-2.1.1-py3-none-any.whl (40 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.6/40.6 kB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)\n", "Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)\n", "Downloading pydantic_settings-2.7.1-py3-none-any.whl (29 kB)\n", "Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.2/1.2 MB\u001b[0m \u001b[31m29.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading marshmallow-3.25.1-py3-none-any.whl (49 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.6/49.6 kB\u001b[0m \u001b[31m2.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)\n", "Building wheels for collected packages: docx2txt\n", " Building wheel for docx2txt (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Created wheel for docx2txt: filename=docx2txt-0.8-py3-none-any.whl size=3960 sha256=7b726a0cbe5588eb27a3c2b8518f75ee7ca8ac8380573c834740aee90d33bdcd\n", " Stored in directory: /root/.cache/pip/wheels/0f/0e/7a/3094a4ceefe657bff7e12dd9592a9d5b6487ef4338ace0afa6\n", "Successfully built docx2txt\n", "Installing collected packages: docx2txt, python-dotenv, protobuf, portalocker, mypy-extensions, marshmallow, jsonpath-python, httpx-sse, grpcio, execnet, typing-inspect, tiktoken, pytest-xdist, pytest-repeat, opentelemetry-proto, pydantic-settings, opentelemetry-exporter-otlp-proto-common, mistralai, dataclasses-json, langchain-core, opentelemetry-exporter-otlp-proto-grpc, langchain_openai, langchain, langchain-community, deepeval\n", " Attempting uninstall: protobuf\n", " Found existing installation: protobuf 4.25.5\n", " Uninstalling protobuf-4.25.5:\n", " Successfully uninstalled protobuf-4.25.5\n", " Attempting uninstall: grpcio\n", " Found existing installation: grpcio 1.69.0\n", " Uninstalling grpcio-1.69.0:\n", " Successfully uninstalled grpcio-1.69.0\n", " Attempting uninstall: langchain-core\n", " Found existing installation: langchain-core 0.3.29\n", " Uninstalling langchain-core-0.3.29:\n", " Successfully uninstalled langchain-core-0.3.29\n", " Attempting uninstall: langchain\n", " Found existing installation: langchain 0.3.14\n", " Uninstalling langchain-0.3.14:\n", " Successfully uninstalled langchain-0.3.14\n", "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "tensorflow 2.17.1 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 5.29.3 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[0mSuccessfully installed dataclasses-json-0.6.7 deepeval-2.1.9 docx2txt-0.8 execnet-2.1.1 grpcio-1.67.1 httpx-sse-0.4.0 jsonpath-python-1.0.6 langchain-0.3.15 langchain-community-0.3.15 langchain-core-0.3.31 langchain_openai-0.3.1 marshmallow-3.25.1 mistralai-1.3.1 mypy-extensions-1.0.0 opentelemetry-exporter-otlp-proto-common-1.29.0 opentelemetry-exporter-otlp-proto-grpc-1.29.0 opentelemetry-proto-1.29.0 portalocker-3.1.1 protobuf-5.29.3 pydantic-settings-2.7.1 pytest-repeat-0.9.3 pytest-xdist-3.6.1 python-dotenv-1.0.1 tiktoken-0.8.0 typing-inspect-0.9.0\n" ] } ], "source": [ "!pip install mistralai langchain deepeval" ] }, { "cell_type": "code", "execution_count": 5, "id": "43a1ca17-0d28-4ecf-97f3-358be21059bd", "metadata": { "id": "43a1ca17-0d28-4ecf-97f3-358be21059bd" }, "outputs": [], "source": [ "from mistralai import Mistral\n", "from getpass import getpass\n", "\n", "# To interract with the SQL database\n", "from langchain_community.utilities import SQLDatabase\n", "\n", "# To evaluate text2SQL performances\n", "from deepeval.models import DeepEvalBaseLLM\n", "from deepeval.test_case import LLMTestCase, LLMTestCaseParams\n", "from deepeval.dataset import EvaluationDataset\n", "from deepeval.metrics import GEval\n", "\n", "import json" ] }, { "cell_type": "markdown", "id": "03483390-4866-48ce-bb3e-a0684042ebd6", "metadata": { "id": "03483390-4866-48ce-bb3e-a0684042ebd6" }, "source": [ "# Load the Chinook database\n", "\n", "\"Chinook is a sample database available for SQL Server, Oracle, MySQL, etc. It can be created by running a single SQL script. Chinook database is an alternative to the Northwind database, being ideal for demos and testing ORM tools targeting single and multiple database servers.\"\n", "\n", "To run this notebook you will need to download the Chinook datase. You will find more information about this database by clicking on this [github link](https://github.com/lerocha/chinook-database).\n", "\n", "To create the `Chinook.db` in the same directory as this notebook you have several options :\n", "\n", "- You can download and build the database via the command line :\n", "\n", "```\n", "curl -s https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql | sqlite3 Chinook.db\n", "```\n", "\n", "- Another strategy consists in running the following script `https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql`.\n", "\n", "Firstly save the script to a folder/directory on your computer. Then create a database called Chinook with `sqlite3 Chinook.db`. Ultimately run the script with the command `.read Chinook_Sqlite.sql`\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 3, "id": "lCsAF_Xv87Hi", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lCsAF_Xv87Hi", "outputId": "029758fa-3a50-4f4e-a16a-8d1acafe166c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading package lists... Done\n", "Building dependency tree... Done\n", "Reading state information... Done\n", "Suggested packages:\n", " sqlite3-doc\n", "The following NEW packages will be installed:\n", " sqlite3\n", "0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.\n", "Need to get 768 kB of archives.\n", "After this operation, 1,873 kB of additional disk space will be used.\n", "Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 sqlite3 amd64 3.37.2-2ubuntu0.3 [768 kB]\n", "Fetched 768 kB in 1s (799 kB/s)\n", "debconf: unable to initialize frontend: Dialog\n", "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)\n", "debconf: falling back to frontend: Readline\n", "debconf: unable to initialize frontend: Readline\n", "debconf: (This frontend requires a controlling tty.)\n", "debconf: falling back to frontend: Teletype\n", "dpkg-preconfigure: unable to re-open stdin: \n", "Selecting previously unselected package sqlite3.\n", "(Reading database ... 124565 files and directories currently installed.)\n", "Preparing to unpack .../sqlite3_3.37.2-2ubuntu0.3_amd64.deb ...\n", "Unpacking sqlite3 (3.37.2-2ubuntu0.3) ...\n", "Setting up sqlite3 (3.37.2-2ubuntu0.3) ...\n", "Processing triggers for man-db (2.10.2-1) ...\n" ] } ], "source": [ "!sudo apt install sqlite3" ] }, { "cell_type": "code", "execution_count": 4, "id": "HhueYoeE8rEd", "metadata": { "id": "HhueYoeE8rEd" }, "outputs": [], "source": [ "!curl -s https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql | sqlite3 Chinook.db\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "qYt91HG198Qs", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qYt91HG198Qs", "outputId": "ca176c43-2daa-467c-c58c-18b3394adee7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SQLite version 3.37.2 2022-01-06 13:25:41\n", "Enter \".help\" for usage hints.\n", "\u001b[?2004hsqlite> .tables\n", "Album Employee InvoiceLine PlaylistTrack\n", "Artist Genre MediaType Track \n", "Customer Invoice Playlist \n", "\u001b[?2004hsqlite> SELECT * FROM Artist LIMIT 5;\n", "1|AC/DC\n", "2|Accept\n", "3|Aerosmith\n", "4|Alanis Morissette\n", "5|Alice In Chains\n", "\u001b[?2004hsqlite> .exit\n" ] } ], "source": [ "!sqlite3 Chinook.db" ] }, { "cell_type": "markdown", "id": "16f66a1b-02eb-4c34-b6cc-460af14b8ef3", "metadata": { "id": "16f66a1b-02eb-4c34-b6cc-460af14b8ef3" }, "source": [ " # Set up clients" ] }, { "cell_type": "code", "execution_count": 12, "id": "1768f764-3bac-4e15-b090-fc4e04d94c2c", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1768f764-3bac-4e15-b090-fc4e04d94c2c", "outputId": "88c23196-2587-4f67-fe95-4d23828d7207" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Type your API Key··········\n" ] } ], "source": [ "api_key= getpass(\"Type your API Key\")\n", "client = Mistral(api_key=api_key)\n", "uri = \"sqlite:///Chinook.db\"" ] }, { "cell_type": "markdown", "id": "5dba2bd9-5ce4-4c75-98ef-520297927b5f", "metadata": { "id": "5dba2bd9-5ce4-4c75-98ef-520297927b5f" }, "source": [ "# Interract with the Chinook database\n", "\n", "We are defining two functions :\n", "- run_sql_query that runs sql code on Chinook\n", "- get_sql_schema_of_table that returns the schema of a table specified as input" ] }, { "cell_type": "code", "execution_count": 22, "id": "2a186021-474d-4754-b585-0688f2d4bf38", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 53 }, "id": "2a186021-474d-4754-b585-0688f2d4bf38", "outputId": "d0841348-aa89-43f8-d3e8-ad50ec53f1d2" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "\"[(1, 'AC/DC'), (2, 'Accept'), (3, 'Aerosmith'), (4, 'Alanis Morissette'), (5, 'Alice In Chains'), (6, 'Antônio Carlos Jobim'), (7, 'Apocalyptica'), (8, 'Audioslave'), (9, 'BackBeat'), (10, 'Billy Cobham')]\"" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def run_sql_query(sql_code):\n", " \"\"\"\n", " Executes the given SQL query against the database and returns the results.\n", "\n", " Args:\n", " sql_code (str): The SQL query to be executed.\n", "\n", " Returns:\n", " result: The results of the SQL query.\n", " \"\"\"\n", " db = SQLDatabase.from_uri(\"sqlite:///Chinook.db\")\n", " return db.run(sql_code)\n", "\n", "run_sql_query(\"SELECT * FROM Artist LIMIT 10;\")" ] }, { "cell_type": "code", "execution_count": 23, "id": "ddb4ead0-b222-43f8-b89d-437e4e199f96", "metadata": { "id": "ddb4ead0-b222-43f8-b89d-437e4e199f96" }, "outputs": [], "source": [ "def get_sql_schema_of_table(table):\n", " \"\"\"\n", " Returns the schema of a table.\n", "\n", " Args:\n", " table (str): Name of the table to be described\n", "\n", " Returns:\n", " result: Column names and types of the table\n", " \"\"\"\n", " if table == \"Album\":\n", " return \"\"\" The table Album was created with the following code :\n", "\n", "CREATE TABLE [Album]\n", "(\n", " [AlbumId] INTEGER NOT NULL,\n", " [Title] NVARCHAR(160) NOT NULL,\n", " [ArtistId] INTEGER NOT NULL,\n", " CONSTRAINT [PK_Album] PRIMARY KEY ([AlbumId]),\n", " FOREIGN KEY ([ArtistId]) REFERENCES [Artist] ([ArtistId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n", ");\n", " \"\"\"\n", "\n", " if table == \"Artist\":\n", " return \"\"\" The table Artist was created with the following code :\n", "\n", "CREATE TABLE [Artist]\n", "(\n", " [ArtistId] INTEGER NOT NULL,\n", " [Name] NVARCHAR(120),\n", " CONSTRAINT [PK_Artist] PRIMARY KEY ([ArtistId])\n", ");\n", " \"\"\"\n", "\n", " if table == \"Customer\":\n", " return \"\"\" The table Customer was created with the following code :\n", "\n", "CREATE TABLE [Customer]\n", "(\n", " [CustomerId] INTEGER NOT NULL,\n", " [FirstName] NVARCHAR(40) NOT NULL,\n", " [LastName] NVARCHAR(20) NOT NULL,\n", " [Company] NVARCHAR(80),\n", " [Address] NVARCHAR(70),\n", " [City] NVARCHAR(40),\n", " [State] NVARCHAR(40),\n", " [Country] NVARCHAR(40),\n", " [PostalCode] NVARCHAR(10),\n", " [Phone] NVARCHAR(24),\n", " [Fax] NVARCHAR(24),\n", " [Email] NVARCHAR(60) NOT NULL,\n", " [SupportRepId] INTEGER,\n", " CONSTRAINT [PK_Customer] PRIMARY KEY ([CustomerId]),\n", " FOREIGN KEY ([SupportRepId]) REFERENCES [Employee] ([EmployeeId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n", ");\n", " \"\"\"\n", "\n", " if table == \"Employee\":\n", " return \"\"\" The table Employee was created with the following code :\n", "\n", "CREATE TABLE [Employee]\n", "(\n", " [EmployeeId] INTEGER NOT NULL,\n", " [LastName] NVARCHAR(20) NOT NULL,\n", " [FirstName] NVARCHAR(20) NOT NULL,\n", " [Title] NVARCHAR(30),\n", " [ReportsTo] INTEGER,\n", " [BirthDate] DATETIME,\n", " [HireDate] DATETIME,\n", " [Address] NVARCHAR(70),\n", " [City] NVARCHAR(40),\n", " [State] NVARCHAR(40),\n", " [Country] NVARCHAR(40),\n", " [PostalCode] NVARCHAR(10),\n", " [Phone] NVARCHAR(24),\n", " [Fax] NVARCHAR(24),\n", " [Email] NVARCHAR(60),\n", " CONSTRAINT [PK_Employee] PRIMARY KEY ([EmployeeId]),\n", " FOREIGN KEY ([ReportsTo]) REFERENCES [Employee] ([EmployeeId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n", ");\n", " \"\"\"\n", "\n", " if table == \"Genre\":\n", " return \"\"\" The table Genre was created with the following code :\n", "\n", " CREATE TABLE [Genre]\n", "(\n", " [GenreId] INTEGER NOT NULL,\n", " [Name] NVARCHAR(120),\n", " CONSTRAINT [PK_Genre] PRIMARY KEY ([GenreId])\n", ");\n", " \"\"\"\n", "\n", " if table == \"Invoice\":\n", " return \"\"\" The table Invoice was created with the following code :\n", "\n", "CREATE TABLE [Invoice]\n", "(\n", " [InvoiceId] INTEGER NOT NULL,\n", " [CustomerId] INTEGER NOT NULL,\n", " [InvoiceDate] DATETIME NOT NULL,\n", " [BillingAddress] NVARCHAR(70),\n", " [BillingCity] NVARCHAR(40),\n", " [BillingState] NVARCHAR(40),\n", " [BillingCountry] NVARCHAR(40),\n", " [BillingPostalCode] NVARCHAR(10),\n", " [Total] NUMERIC(10,2) NOT NULL,\n", " CONSTRAINT [PK_Invoice] PRIMARY KEY ([InvoiceId]),\n", " FOREIGN KEY ([CustomerId]) REFERENCES [Customer] ([CustomerId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n", ");\n", " \"\"\"\n", "\n", " if table == \"InvoiceLine\":\n", " return \"\"\" The table InvoiceLine was created with the following code :\n", "\n", "CREATE TABLE [InvoiceLine]\n", "(\n", " [InvoiceLineId] INTEGER NOT NULL,\n", " [InvoiceId] INTEGER NOT NULL,\n", " [TrackId] INTEGER NOT NULL,\n", " [UnitPrice] NUMERIC(10,2) NOT NULL,\n", " [Quantity] INTEGER NOT NULL,\n", " CONSTRAINT [PK_InvoiceLine] PRIMARY KEY ([InvoiceLineId]),\n", " FOREIGN KEY ([InvoiceId]) REFERENCES [Invoice] ([InvoiceId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION,\n", " FOREIGN KEY ([TrackId]) REFERENCES [Track] ([TrackId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n", ");\n", " \"\"\"\n", "\n", "\n", " if table == \"MediaType\":\n", " return \"\"\" The table MediaType was created with the following code :\n", "\n", "CREATE TABLE [MediaType]\n", "(\n", " [MediaTypeId] INTEGER NOT NULL,\n", " [Name] NVARCHAR(120),\n", " CONSTRAINT [PK_MediaType] PRIMARY KEY ([MediaTypeId])\n", ");\n", " \"\"\"\n", "\n", " if table == \"Playlist\":\n", " return \"\"\" The table Playlist was created with the following code :\n", "\n", "CREATE TABLE [Playlist]\n", "(\n", " [PlaylistId] INTEGER NOT NULL,\n", " [Name] NVARCHAR(120),\n", " CONSTRAINT [PK_Playlist] PRIMARY KEY ([PlaylistId])\n", ");\n", " \"\"\"\n", "\n", " if table == \"PlaylistTrack\":\n", " return \"\"\" The table PlaylistTrack was created with the following code :\n", "\n", "CREATE TABLE [PlaylistTrack]\n", "(\n", " [PlaylistId] INTEGER NOT NULL,\n", " [TrackId] INTEGER NOT NULL,\n", " CONSTRAINT [PK_PlaylistTrack] PRIMARY KEY ([PlaylistId], [TrackId]),\n", " FOREIGN KEY ([PlaylistId]) REFERENCES [Playlist] ([PlaylistId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION,\n", " FOREIGN KEY ([TrackId]) REFERENCES [Track] ([TrackId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n", ");\n", " \"\"\"\n", "\n", " if table == \"Track\":\n", " return \"\"\" The table Track was created with the following code :\n", "\n", "CREATE TABLE [Track]\n", "(\n", " [TrackId] INTEGER NOT NULL,\n", " [Name] NVARCHAR(200) NOT NULL,\n", " [AlbumId] INTEGER,\n", " [MediaTypeId] INTEGER NOT NULL,\n", " [GenreId] INTEGER,\n", " [Composer] NVARCHAR(220),\n", " [Milliseconds] INTEGER NOT NULL,\n", " [Bytes] INTEGER,\n", " [UnitPrice] NUMERIC(10,2) NOT NULL,\n", " CONSTRAINT [PK_Track] PRIMARY KEY ([TrackId]),\n", " FOREIGN KEY ([AlbumId]) REFERENCES [Album] ([AlbumId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION,\n", " FOREIGN KEY ([GenreId]) REFERENCES [Genre] ([GenreId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION,\n", " FOREIGN KEY ([MediaTypeId]) REFERENCES [MediaType] ([MediaTypeId])\n", "\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n", ");\n", " \"\"\"\n", "\n", " return f\"The table {table} does not exist in the Chinook database\"" ] }, { "cell_type": "markdown", "id": "06eb23c7-e5c6-433e-8ab3-45444911eab0", "metadata": { "id": "06eb23c7-e5c6-433e-8ab3-45444911eab0" }, "source": [ "# Build agent" ] }, { "cell_type": "code", "execution_count": 24, "id": "cc76a90f-83d8-4d77-8713-6460afcdfd62", "metadata": { "id": "cc76a90f-83d8-4d77-8713-6460afcdfd62" }, "outputs": [], "source": [ "def get_response(question, verbose=True):\n", " \"\"\"\n", " Answer question about the Chinook database.\n", "\n", " Args:\n", " question (str): The question asked by the user.\n", " verbose (bool): If True, prints intermediate steps and results.\n", "\n", " Returns:\n", " str: The response to the user's question.\n", " \"\"\"\n", "\n", " # Define the tools available for the AI assistant\n", " tools = [\n", " {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": \"get_sql_schema_of_table\",\n", " \"description\": \"Get the schema of a table in the Chinook database\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"table\": {\n", " \"type\": \"string\",\n", " \"enum\": [\"Album\", \"Artist\", \"Customer\", \"Employee\", \"Genre\", \"Invoice\", \"InvoiceLine\", \"MediaType\", \"Playlist\", \"PlaylistTrack\", \"Track\"],\n", " \"description\": \"The question asked by the user\",\n", " },\n", " },\n", " \"required\": [\"table\"],\n", " },\n", " },\n", " },\n", " {\n", " \"type\": \"function\",\n", " \"function\": {\n", " \"name\": \"run_sql_query\",\n", " \"description\": \"Run an SQL query on the Chinook database\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"sql_code\": {\n", " \"type\": \"string\",\n", " \"description\": \"SQL code to be run\",\n", " },\n", " },\n", " \"required\": [\"sql_code\"],\n", " },\n", " },\n", " }\n", " ]\n", "\n", " # System prompt for the AI assistant\n", " system_prompt = \"\"\"\n", " You are an AI assistant.\n", " Your job is to reply to questions related to the Chinook database.\n", " The Chinook data model represents a digital media store, including tables for artists, albums, media tracks, invoices, and customers.\n", "\n", " To answer user questions, you have two tools at your disposal.\n", "\n", " Firstly, a function called \"get_sql_schema_of_table\" which has a single parameter named \"table\" whose value is an element\n", " of the following list: [\"Album\", \"Artist\", \"Customer\", \"Employee\", \"Genre\", \"Invoice\", \"InvoiceLine\", \"MediaType\", \"Playlist\", \"PlaylistTrack\", \"Track\"].\n", "\n", " Secondly, a function called \"run_sql_query\" which has a single parameter named \"sql_code\".\n", " It will run SQL code on the Chinook database. The SQL dialect is SQLite.\n", "\n", " For a given question, your job is to:\n", " 1. Get the schemas of the tables that might help you answer the question using the \"get_sql_schema_of_table\" function.\n", " 2. Run a SQLite query on the relevant tables using the \"run_sql_query\" function.\n", " 3. Answer the user based on the result of the SQL query.\n", "\n", " You will always remain factual, you will not hallucinate, and you will say that you don't know if you don't know.\n", " You will politely ask the user to shoot another question if the question is not related to the Chinook database.\n", " \"\"\"\n", "\n", " # Initialize chat history with system prompt and user question\n", " chat_history = [\n", " {\n", " \"role\": \"system\",\n", " \"content\": system_prompt\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": question\n", " }\n", " ]\n", "\n", " if verbose:\n", " print(f\"User: {question}\\n\")\n", "\n", " used_run_sql = False\n", " used_get_sql_schema_of_table = False\n", "\n", " # Function to determine tool choice based on usage\n", " def tool_choice(used_run_sql, used_get_sql_schema_of_table):\n", " # If the question is out of topic the agent is not expected to run a tool call\n", " if not used_get_sql_schema_of_table:\n", " return \"auto\"\n", " # The agent is expected to run \"used_run_sql\" after getting the specifications of the tables of interest\n", " if used_get_sql_schema_of_table and not used_run_sql:\n", " return \"any\"\n", " # The agent is not expected to run a tool call after querying the SQL table\n", " if used_run_sql and used_get_sql_schema_of_table:\n", " return \"none\"\n", " return \"auto\"\n", "\n", " iteration = 0\n", " max_iteration = 10\n", "\n", " # Main loop to process the question\n", " while iteration < max_iteration:\n", " inference = client.chat.complete(\n", " model=\"mistral-large-latest\",\n", " temperature=0.3,\n", " messages=chat_history,\n", " tools=tools,\n", " tool_choice=tool_choice(used_run_sql, used_get_sql_schema_of_table)\n", " )\n", "\n", " chat_history.append(inference.choices[0].message)\n", "\n", " tool_calls = inference.choices[0].message.tool_calls\n", "\n", " if not tool_calls:\n", " if verbose:\n", " print(f\"Assistant: {inference.choices[0].message.content}\\n\")\n", " return inference.choices[0].message.content\n", "\n", " for tool_call in tool_calls:\n", " function_name = tool_call.function.name\n", " function_params = json.loads(tool_call.function.arguments)\n", "\n", " if function_name == \"get_sql_schema_of_table\":\n", " function_result = get_sql_schema_of_table(function_params['table'])\n", " if verbose:\n", " print(f\"Tool: Getting SQL schema of table {function_params['table']}\\n\")\n", " used_get_sql_schema_of_table = True\n", "\n", " if function_name == \"run_sql_query\":\n", " function_result = run_sql_query(function_params['sql_code'])\n", " if verbose:\n", " print(f\"Tool: Running code {function_params['sql_code']}\\n\")\n", " used_run_sql = True\n", "\n", " chat_history.append({\"role\": \"tool\", \"name\": function_name, \"content\": function_result, \"tool_call_id\": tool_call.id})\n", "\n", " iteration += 1\n", " return" ] }, { "cell_type": "markdown", "id": "e63c99a7-7dd0-4b94-87d0-b61b0436bba0", "metadata": { "id": "e63c99a7-7dd0-4b94-87d0-b61b0436bba0" }, "source": [ "# Test the agent\n", "\n", "Let's test the agent and ask a few random questions of increasing complexity" ] }, { "cell_type": "code", "execution_count": 25, "id": "7441fad2-88e8-43cb-8e9a-0a90722ca2ab", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7441fad2-88e8-43cb-8e9a-0a90722ca2ab", "outputId": "8983dd7c-dbec-4a5a-fe6f-d9a9a04f0023" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User: What is the oldest player in the NBA?\n", "\n", "Assistant: I'm sorry, but I can only answer questions related to the Chinook database, which is a digital media store. I don't have information about the NBA or any other topics outside of the Chinook database. Please ask another question.\n", "\n" ] } ], "source": [ "# Lets start by checking how the model reacts with out of topic questions!\n", "response = get_response('What is the oldest player in the NBA?')" ] }, { "cell_type": "code", "execution_count": 28, "id": "c871ab51-f08b-4fe4-83ce-5636fba29935", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c871ab51-f08b-4fe4-83ce-5636fba29935", "outputId": "c32400d3-81a9-4587-9a3c-1646e375b904" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User: What are the genre in the store?\n", "\n", "Tool: Getting SQL schema of table Genre\n", "\n", "Tool: Running code SELECT Name FROM Genre\n", "\n", "Assistant: Here are the genres available in the store:\n", "\n", "1. Rock\n", "2. Jazz\n", "3. Metal\n", "4. Alternative & Punk\n", "5. Rock And Roll\n", "6. Blues\n", "7. Latin\n", "8. Reggae\n", "9. Pop\n", "10. Soundtrack\n", "11. Bossa Nova\n", "12. Easy Listening\n", "13. Heavy Metal\n", "14. R&B/Soul\n", "15. Electronica/Dance\n", "16. World\n", "17. Hip Hop/Rap\n", "18. Science Fiction\n", "19. TV Shows\n", "20. Sci Fi & Fantasy\n", "21. Drama\n", "22. Comedy\n", "23. Alternative\n", "24. Classical\n", "25. Opera\n", "\n" ] } ], "source": [ "response = get_response('What are the genre in the store?')" ] }, { "cell_type": "code", "execution_count": 29, "id": "c7c9d0e3-4728-48d2-9bc1-0c9d879aaa61", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c7c9d0e3-4728-48d2-9bc1-0c9d879aaa61", "outputId": "88fe718d-4d63-49d1-8aa5-28f19b825583" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User: What are the albums of the rock band U2?\n", "\n", "Tool: Getting SQL schema of table Album\n", "\n", "Tool: Getting SQL schema of table Artist\n", "\n", "Tool: Running code SELECT Title FROM Album WHERE ArtistId = (SELECT ArtistId FROM Artist WHERE Name = 'U2');\n", "\n", "Assistant: The albums of the rock band U2 are :\n", "\n", "- Achtung Baby\n", "- All That You Can't Leave Behind\n", "- B-Sides 1980-1990\n", "- How To Dismantle An Atomic Bomb\n", "- Pop\n", "- Rattle And Hum\n", "- The Best Of 1980-1990\n", "- War\n", "- Zooropa\n", "- Instant Karma: The Amnesty International Campaign to Save Darfur\n", "\n" ] } ], "source": [ "response = get_response('What are the albums of the rock band U2?')" ] }, { "cell_type": "code", "execution_count": 31, "id": "7da7d15c-0a93-42ac-8159-9eac3e33dc85", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7da7d15c-0a93-42ac-8159-9eac3e33dc85", "outputId": "8faffab2-3a91-4946-9c97-4b4c2f9aa274" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User: What is the shortest song that the rock band U2 ever composed?\n", "\n", "Tool: Getting SQL schema of table Artist\n", "\n", "Tool: Getting SQL schema of table Album\n", "\n", "Tool: Getting SQL schema of table Track\n", "\n", "Tool: Running code SELECT Name, Milliseconds FROM Track WHERE Composer = 'U2' ORDER BY Milliseconds ASC LIMIT 1;\n", "\n", "Assistant: The shortest song that the rock band U2 ever composed is \"Trash, Trampoline And The Party Girl\".\n", "\n" ] } ], "source": [ "response = get_response('What is the shortest song that the rock band U2 ever composed?')" ] }, { "cell_type": "code", "execution_count": 33, "id": "13ea678c-e262-420d-9725-e6573a9d5720", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "13ea678c-e262-420d-9725-e6573a9d5720", "outputId": "b1201719-4c3f-46de-f7ba-9c9ff5ca14bb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User: Which track from U2 is the most sold?\n", "\n", "Tool: Getting SQL schema of table Track\n", "\n", "Tool: Getting SQL schema of table Artist\n", "\n", "Tool: Getting SQL schema of table InvoiceLine\n", "\n", "Tool: Running code SELECT Track.Name, SUM(InvoiceLine.Quantity) as TotalSold FROM Track JOIN InvoiceLine ON Track.TrackId = InvoiceLine.TrackId JOIN Album ON Track.AlbumId = Album.AlbumId JOIN Artist ON Album.ArtistId = Artist.ArtistId WHERE Artist.Name = 'U2' GROUP BY Track.Name ORDER BY TotalSold DESC LIMIT 1;\n", "\n", "Assistant: The most sold track from U2 is \"Sweetest Thing\" with 3 copies sold.\n", "\n" ] } ], "source": [ "response = get_response('Which track from U2 is the most sold?')" ] }, { "cell_type": "code", "execution_count": 34, "id": "b8f08dc7-2157-4ad6-82c5-6e80a8babdcc", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "b8f08dc7-2157-4ad6-82c5-6e80a8babdcc", "outputId": "d4acb20a-5a88-4cd5-e59f-7ffb56fa78a2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User: Which consumer bought the biggest amound of U2 songs?\n", "\n", "Tool: Getting SQL schema of table Track\n", "\n", "Tool: Getting SQL schema of table InvoiceLine\n", "\n", "Tool: Getting SQL schema of table Invoice\n", "\n", "Tool: Getting SQL schema of table Customer\n", "\n", "Tool: Getting SQL schema of table Artist\n", "\n", "Tool: Running code SELECT Customer.CustomerId, Customer.FirstName, Customer.LastName, SUM(InvoiceLine.Quantity) as total_quantity FROM Customer JOIN Invoice ON Customer.CustomerId = Invoice.CustomerId JOIN InvoiceLine ON Invoice.InvoiceId = InvoiceLine.InvoiceId JOIN Track ON InvoiceLine.TrackId = Track.TrackId JOIN Album ON Track.AlbumId = Album.AlbumId JOIN Artist ON Album.ArtistId = Artist.ArtistId WHERE Artist.Name = 'U2' GROUP BY Customer.CustomerId ORDER BY total_quantity DESC LIMIT 1;\n", "\n", "Assistant: The consumer who bought the biggest amount of U2 songs is Madalena Sampaio. She bought 9 songs.\n", "\n" ] } ], "source": [ "response = get_response('Which consumer bought the biggest amound of U2 songs?')" ] }, { "cell_type": "code", "execution_count": 35, "id": "b09892eb-aedf-4dd5-8ada-99a111c13e38", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "b09892eb-aedf-4dd5-8ada-99a111c13e38", "outputId": "3a280ded-7373-4a6a-fe8d-2962eed6b85c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User: List all artist that have a color in their name\n", "\n", "Tool: Getting SQL schema of table Artist\n", "\n", "Tool: Running code SELECT Name FROM Artist WHERE Name LIKE '%red%' OR Name LIKE '%blue%' OR Name LIKE '%green%' OR Name LIKE '%yellow%' OR Name LIKE '%black%' OR Name LIKE '%white%' OR Name LIKE '%purple%' OR Name LIKE '%pink%' OR Name LIKE '%orange%' OR Name LIKE '%brown%' OR Name LIKE '%grey%' OR Name LIKE '%gray%';\n", "\n", "Assistant: Here are the artists that have a color in their name:\n", "\n", "1. Black Label Society\n", "2. Black Sabbath\n", "3. Pedro Luís & A Parede\n", "4. Banda Black Rio\n", "5. Green Day\n", "6. Deep Purple\n", "7. James Brown\n", "8. Pink Floyd\n", "9. Red Hot Chili Peppers\n", "10. Simply Red\n", "11. The Black Crowes\n", "12. Whitesnake\n", "13. Jackson Browne\n", "14. Black Eyed Peas\n", "15. Charlie Brown Jr.\n", "16. Pedro Luís E A Parede\n", "\n" ] } ], "source": [ "response = get_response('List all artist that have a color in their name')" ] }, { "cell_type": "code", "execution_count": 36, "id": "4f4dfb26-24cb-4f42-a7af-4fa2c65759ef", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4f4dfb26-24cb-4f42-a7af-4fa2c65759ef", "outputId": "ad293500-8589-4605-b20d-c2de42f8cad1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User: Who are our top Customers according to Invoices?\n", "\n", "Tool: Getting SQL schema of table Customer\n", "\n", "Tool: Getting SQL schema of table Invoice\n", "\n", "Tool: Running code SELECT Customer.CustomerId, Customer.FirstName, Customer.LastName, SUM(Invoice.Total) as TotalSpent FROM Customer JOIN Invoice ON Customer.CustomerId = Invoice.CustomerId GROUP BY Customer.CustomerId ORDER BY TotalSpent DESC LIMIT 10;\n", "\n", "Assistant: Here are our top 10 customers according to invoices :\n", "\n", "1. Helena Holý : 49.62\n", "2. Richard Cunningham : 47.62\n", "3. Luis Rojas : 46.62\n", "4. Ladislav Kovács : 45.62\n", "5. Hugh O'Reilly : 45.62\n", "6. Julia Barnett : 43.62\n", "7. Frank Ralston : 43.62\n", "8. Fynn Zimmermann : 43.62\n", "9. Astrid Gruber : 42.62\n", "10. Victor Stevens : 42.62\n", "\n" ] } ], "source": [ "response = get_response('Who are our top Customers according to Invoices?')" ] }, { "cell_type": "markdown", "id": "09a1f14e-3eb8-4366-835b-cca497204c63", "metadata": { "id": "09a1f14e-3eb8-4366-835b-cca497204c63" }, "source": [ "# Evaluating\n", "\n", "Let's try to evaluate the agent in a more formal way.\n", "\n", "We will build a test set based on the questions from this Medium article [Chinook question/answers](https://medium.com/@raufrukayat/chinook-database-querying-a-digital-music-store-database-8c98cf0f8611)\n", "\n", "We will evaluate answers via LLM as a judge through the framework [DeepEval](https://docs.confident-ai.com/) from which the image here below is taken.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 37, "id": "115559d3-5a08-453e-a313-8e12958120a4", "metadata": { "id": "115559d3-5a08-453e-a313-8e12958120a4" }, "outputs": [], "source": [ "class CustomMistralLarge(DeepEvalBaseLLM):\n", " def __init__(self, api_key):\n", " self.client = Mistral(api_key=api_key)\n", " self.model_name = \"mistral-large-latest\"\n", "\n", " def get_model_name(self):\n", " return \"Mistral-large-latest\"\n", "\n", " def load_model(self):\n", " # Since we are using the Mistral API, we don't need to load a model object.\n", " return self.client\n", "\n", " def generate(self, prompt: str) -> str:\n", " chat_response = self.client.chat.complete(\n", " model=self.model_name,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": prompt,\n", " },\n", " ]\n", " )\n", " return chat_response.choices[0].message.content\n", "\n", " async def a_generate(self, prompt: str) -> str:\n", " # Reusing the synchronous generate method for simplicity.\n", " return self.generate(prompt)" ] }, { "cell_type": "code", "execution_count": 38, "id": "8bed9b48-ff27-44bb-85e7-12aec3ead6fb", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "8bed9b48-ff27-44bb-85e7-12aec3ead6fb", "outputId": "ca11ed4f-5d28-47fe-b681-f19e995a4f16" }, "outputs": [ { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">✨ You're running DeepEval's latest <span style=\"color: #6a00ff; text-decoration-color: #6a00ff\">Correctness </span><span style=\"color: #6a00ff; text-decoration-color: #6a00ff; font-weight: bold\">(</span><span style=\"color: #6a00ff; text-decoration-color: #6a00ff\">GEval</span><span style=\"color: #6a00ff; text-decoration-color: #6a00ff; font-weight: bold\">)</span><span style=\"color: #6a00ff; text-decoration-color: #6a00ff\"> Metric</span>! <span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">(</span><span style=\"color: #374151; text-decoration-color: #374151\">using Mistral-large-latest, </span><span style=\"color: #374151; text-decoration-color: #374151\">strict</span><span style=\"color: #374151; text-decoration-color: #374151\">=</span><span style=\"color: #374151; text-decoration-color: #374151; font-style: italic\">False</span><span style=\"color: #374151; text-decoration-color: #374151\">, </span>\n", "<span style=\"color: #374151; text-decoration-color: #374151\">async_mode</span><span style=\"color: #374151; text-decoration-color: #374151\">=</span><span style=\"color: #374151; text-decoration-color: #374151; font-style: italic\">True</span><span style=\"color: #374151; text-decoration-color: #374151; font-weight: bold\">)</span><span style=\"color: #374151; text-decoration-color: #374151\">...</span>\n", "</pre>\n" ], "text/plain": [ "✨ You're running DeepEval's latest \u001b[38;2;106;0;255mCorrectness \u001b[0m\u001b[1;38;2;106;0;255m(\u001b[0m\u001b[38;2;106;0;255mGEval\u001b[0m\u001b[1;38;2;106;0;255m)\u001b[0m\u001b[38;2;106;0;255m Metric\u001b[0m! \u001b[1;38;2;55;65;81m(\u001b[0m\u001b[38;2;55;65;81musing Mistral-large-latest, \u001b[0m\u001b[38;2;55;65;81mstrict\u001b[0m\u001b[38;2;55;65;81m=\u001b[0m\u001b[3;38;2;55;65;81mFalse\u001b[0m\u001b[38;2;55;65;81m, \u001b[0m\n", "\u001b[38;2;55;65;81masync_mode\u001b[0m\u001b[38;2;55;65;81m=\u001b[0m\u001b[3;38;2;55;65;81mTrue\u001b[0m\u001b[1;38;2;55;65;81m)\u001b[0m\u001b[38;2;55;65;81m...\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Evaluating 7 test case(s) in parallel: |██████████|100% (7/7) [Time Taken: 01:11, 10.16s/test case]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "======================================================================\n", "\n", "Metrics Summary\n", "\n", " - ✅ Correctness (GEval) (score: 1.0, threshold: 0.5, strict: False, evaluation model: Mistral-large-latest, reason: The actual output does not contradict any facts in the expected output and directly answers the question., error: None)\n", "\n", "For test case:\n", "\n", " - input: Which Employee has the Highest Total Number of Customers?\n", " - actual output: The employee with the highest total number of customers is Jane Peacock, with 21 customers.\n", " - expected output: Peacock Jane has the most customers (she has 21 customers)\n", " - context: None\n", " - retrieval context: None\n", "\n", "======================================================================\n", "\n", "Metrics Summary\n", "\n", " - ✅ Correctness (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: Mistral-large-latest, reason: The actual output correctly identifies the top customers and their spending amounts, aligning with the facts in the expected output. However, it does not directly answer the question by explicitly stating the top five customers, which is a minor deviation from the expected output., error: None)\n", "\n", "For test case:\n", "\n", " - input: Who are our top Customers according to Invoices?\n", " - actual output: Here are our top customers according to invoices :\n", "\n", "| CustomerId | FirstName | LastName | TotalSpent |\n", "|------------|-----------|------------|------------|\n", "| 6 | Helena | Holý | 49.62 |\n", "| 26 | Richard | Cunningham | 47.62 |\n", "| 57 | Luis | Rojas | 46.62 |\n", "| 45 | Ladislav | Kovács | 45.62 |\n", "| 46 | Hugh | O'Reilly | 45.62 |\n", "| 28 | Julia | Barnett | 43.62 |\n", "| 24 | Frank | Ralston | 43.62 |\n", "| 37 | Fynn | Zimmermann | 43.62 |\n", "| 7 | Astrid | Gruber | 42.62 |\n", "| 25 | Victor | Stevens | 42.62 |\n", " - expected output: Helena Holy, Richard Cunningham, Luis Rojas, Ladislav Kovacs, and Hugh O’Reilly are the top five customers who have spent the highest amount of money according to the invoice\n", " - context: None\n", " - retrieval context: None\n", "\n", "======================================================================\n", "\n", "Metrics Summary\n", "\n", " - ✅ Correctness (GEval) (score: 0.9, threshold: 0.5, strict: False, evaluation model: Mistral-large-latest, reason: The actual output directly answers the question and does not contradict the expected output., error: None)\n", "\n", "For test case:\n", "\n", " - input: How many Rock music listeners are there?\n", " - actual output: There are 59 Rock music listeners.\n", " - expected output: We found out that all 59 customers in the database have listened to Rock Music.\n", " - context: None\n", " - retrieval context: None\n", "\n", "======================================================================\n", "\n", "Metrics Summary\n", "\n", " - ✅ Correctness (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: Mistral-large-latest, reason: The actual output does not contradict any facts in the expected output and directly answers the question. However, the format and some details differ from the expected output., error: None)\n", "\n", "For test case:\n", "\n", " - input: What artists have written most rock music songs?\n", " - actual output: The artists who have written the most rock music songs are:\n", "\n", "1. **Led Zeppelin** with 114 songs\n", "2. **U2** with 112 songs\n", "3. **Deep Purple** with 92 songs\n", "4. **Iron Maiden** with 81 songs\n", "5. **Pearl Jam** with 54 songs\n", "6. **Van Halen** with 52 songs\n", "7. **Queen** with 45 songs\n", "8. **The Rolling Stones** with 41 songs\n", "9. **Creedence Clearwater Revival** with 40 songs\n", "10. **Kiss** with 35 songs\n", " - expected output: Led Zeppelin tops the list of Artists who have written the most Rock Music with 114 songs followed Closely by U2 with 112 music.\n", " - context: None\n", " - retrieval context: None\n", "\n", "======================================================================\n", "\n", "Metrics Summary\n", "\n", " - ✅ Correctness (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: Mistral-large-latest, reason: The actual output correctly identifies Iron Maiden as the artist who earned the most, matching the expected output. However, it uses 'approximately' and 'units of currency' instead of the exact amount and currency symbol '$' as in the expected output., error: None)\n", "\n", "For test case:\n", "\n", " - input: Which artist has earned the most according to the Invoice Lines? How much is it?\n", " - actual output: The artist who earned the most according to the invoice lines is Iron Maiden, with a total earning of approximately 138.60 units of currency.\n", " - expected output: The Artist who has earned the most according to the invoice lines is Iron Maiden with a total of $138.6.\n", " - context: None\n", " - retrieval context: None\n", "\n", "======================================================================\n", "\n", "Metrics Summary\n", "\n", " - ✅ Correctness (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: Mistral-large-latest, reason: The actual output does not contradict any facts in the expected output and answers the question. However, it lacks additional details provided in the expected output, such as the total number of songs and the average music length., error: None)\n", "\n", "For test case:\n", "\n", " - input: How many tracks have a song length greater than the average song length?\n", " - actual output: There are 494 tracks with a song length greater than the average song length.\n", " - expected output: Out of 3503 songs in the database, we found out that 494 of these songs have length more than the average music length of 393,599.21 milliseconds.\n", " - context: None\n", " - retrieval context: None\n", "\n", "======================================================================\n", "\n", "Metrics Summary\n", "\n", " - ✅ Correctness (GEval) (score: 0.5, threshold: 0.5, strict: False, evaluation model: Mistral-large-latest, reason: The actual output correctly identifies Rock as the most popular genre, matching the expected output. However, it provides additional details about other genres and their sales, which are not present in the expected output. It answers the question, but the additional information is not required., error: None)\n", "\n", "For test case:\n", "\n", " - input: What is the most popular genre for Australia?\n", " - actual output: The most popular genre in Australia is Rock, with a total sales of approximately $21.78. This is followed by Metal with approximately $7.92 in sales. Other genres such as Heavy Metal, Reggae, Latin, and Blues have lower sales ranging from approximately $3.00 to $1.00.\n", " - expected output: Rock is the most popular song for Australia\n", " - context: None\n", " - retrieval context: None\n", "\n", "======================================================================\n", "\n", "Overall Metric Pass Rates\n", "\n", "Correctness (GEval): 100.00% pass rate\n", "\n", "======================================================================\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #05f58d; text-decoration-color: #05f58d\">✓</span> Tests finished 🎉! Run <span style=\"color: #008000; text-decoration-color: #008000\">'deepeval login'</span> to save and analyze evaluation results on Confident AI. \n", "‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's metrics directly on Confident AI \n", "instead.\n", "</pre>\n" ], "text/plain": [ "\u001b[38;2;5;245;141m✓\u001b[0m Tests finished 🎉! Run \u001b[32m'deepeval login'\u001b[0m to save and analyze evaluation results on Confident AI. \n", "‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's metrics directly on Confident AI \n", "instead.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=1.0, reason='The actual output does not contradict any facts in the expected output and directly answers the question.', strict_mode=False, evaluation_model='Mistral-large-latest', error=None, evaluation_cost=None, verbose_logs='Criteria:\\nDetermine whether the actual output is factually correct based on the expected output. \\n \\nEvaluation Steps:\\n[\\n \"Check whether the facts in \\'actual output\\' contradicts any facts in \\'expected output\",\\n \"You should also heavily penalize situations where the actual output does not answers the question\"\\n]')], conversational=False, multimodal=False, input='Which Employee has the Highest Total Number of Customers?', actual_output='The employee with the highest total number of customers is Jane Peacock, with 21 customers.', expected_output='Peacock Jane has the most customers (she has 21 customers)', context=None, retrieval_context=None), TestResult(name='test_case_1', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.8, reason='The actual output correctly identifies the top customers and their spending amounts, aligning with the facts in the expected output. However, it does not directly answer the question by explicitly stating the top five customers, which is a minor deviation from the expected output.', strict_mode=False, evaluation_model='Mistral-large-latest', error=None, evaluation_cost=None, verbose_logs='Criteria:\\nDetermine whether the actual output is factually correct based on the expected output. \\n \\nEvaluation Steps:\\n[\\n \"Check whether the facts in \\'actual output\\' contradicts any facts in \\'expected output\",\\n \"You should also heavily penalize situations where the actual output does not answers the question\"\\n]')], conversational=False, multimodal=False, input='Who are our top Customers according to Invoices?', actual_output=\"Here are our top customers according to invoices :\\n\\n| CustomerId | FirstName | LastName | TotalSpent |\\n|------------|-----------|------------|------------|\\n| 6 | Helena | Holý | 49.62 |\\n| 26 | Richard | Cunningham | 47.62 |\\n| 57 | Luis | Rojas | 46.62 |\\n| 45 | Ladislav | Kovács | 45.62 |\\n| 46 | Hugh | O'Reilly | 45.62 |\\n| 28 | Julia | Barnett | 43.62 |\\n| 24 | Frank | Ralston | 43.62 |\\n| 37 | Fynn | Zimmermann | 43.62 |\\n| 7 | Astrid | Gruber | 42.62 |\\n| 25 | Victor | Stevens | 42.62 |\", expected_output='Helena Holy, Richard Cunningham, Luis Rojas, Ladislav Kovacs, and Hugh O’Reilly are the top five customers who have spent the highest amount of money according to the invoice', context=None, retrieval_context=None), TestResult(name='test_case_2', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.9, reason='The actual output directly answers the question and does not contradict the expected output.', strict_mode=False, evaluation_model='Mistral-large-latest', error=None, evaluation_cost=None, verbose_logs='Criteria:\\nDetermine whether the actual output is factually correct based on the expected output. \\n \\nEvaluation Steps:\\n[\\n \"Check whether the facts in \\'actual output\\' contradicts any facts in \\'expected output\",\\n \"You should also heavily penalize situations where the actual output does not answers the question\"\\n]')], conversational=False, multimodal=False, input='How many Rock music listeners are there?', actual_output='There are 59 Rock music listeners.', expected_output='We found out that all 59 customers in the database have listened to Rock Music.', context=None, retrieval_context=None), TestResult(name='test_case_3', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.8, reason='The actual output does not contradict any facts in the expected output and directly answers the question. However, the format and some details differ from the expected output.', strict_mode=False, evaluation_model='Mistral-large-latest', error=None, evaluation_cost=None, verbose_logs='Criteria:\\nDetermine whether the actual output is factually correct based on the expected output. \\n \\nEvaluation Steps:\\n[\\n \"Check whether the facts in \\'actual output\\' contradicts any facts in \\'expected output\",\\n \"You should also heavily penalize situations where the actual output does not answers the question\"\\n]')], conversational=False, multimodal=False, input='What artists have written most rock music songs?', actual_output='The artists who have written the most rock music songs are:\\n\\n1. **Led Zeppelin** with 114 songs\\n2. **U2** with 112 songs\\n3. **Deep Purple** with 92 songs\\n4. **Iron Maiden** with 81 songs\\n5. **Pearl Jam** with 54 songs\\n6. **Van Halen** with 52 songs\\n7. **Queen** with 45 songs\\n8. **The Rolling Stones** with 41 songs\\n9. **Creedence Clearwater Revival** with 40 songs\\n10. **Kiss** with 35 songs', expected_output='Led Zeppelin tops the list of Artists who have written the most Rock Music with 114 songs followed Closely by U2 with 112 music.', context=None, retrieval_context=None), TestResult(name='test_case_4', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.8, reason=\"The actual output correctly identifies Iron Maiden as the artist who earned the most, matching the expected output. However, it uses 'approximately' and 'units of currency' instead of the exact amount and currency symbol '$' as in the expected output.\", strict_mode=False, evaluation_model='Mistral-large-latest', error=None, evaluation_cost=None, verbose_logs='Criteria:\\nDetermine whether the actual output is factually correct based on the expected output. \\n \\nEvaluation Steps:\\n[\\n \"Check whether the facts in \\'actual output\\' contradicts any facts in \\'expected output\",\\n \"You should also heavily penalize situations where the actual output does not answers the question\"\\n]')], conversational=False, multimodal=False, input='Which artist has earned the most according to the Invoice Lines? How much is it?', actual_output='The artist who earned the most according to the invoice lines is Iron Maiden, with a total earning of approximately 138.60 units of currency.', expected_output='The Artist who has earned the most according to the invoice lines is Iron Maiden with a total of $138.6.', context=None, retrieval_context=None), TestResult(name='test_case_5', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.8, reason='The actual output does not contradict any facts in the expected output and answers the question. However, it lacks additional details provided in the expected output, such as the total number of songs and the average music length.', strict_mode=False, evaluation_model='Mistral-large-latest', error=None, evaluation_cost=None, verbose_logs='Criteria:\\nDetermine whether the actual output is factually correct based on the expected output. \\n \\nEvaluation Steps:\\n[\\n \"Check whether the facts in \\'actual output\\' contradicts any facts in \\'expected output\",\\n \"You should also heavily penalize situations where the actual output does not answers the question\"\\n]')], conversational=False, multimodal=False, input='How many tracks have a song length greater than the average song length?', actual_output='There are 494 tracks with a song length greater than the average song length.', expected_output='Out of 3503 songs in the database, we found out that 494 of these songs have length more than the average music length of 393,599.21 milliseconds.', context=None, retrieval_context=None), TestResult(name='test_case_6', success=True, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=True, score=0.5, reason='The actual output correctly identifies Rock as the most popular genre, matching the expected output. However, it provides additional details about other genres and their sales, which are not present in the expected output. It answers the question, but the additional information is not required.', strict_mode=False, evaluation_model='Mistral-large-latest', error=None, evaluation_cost=None, verbose_logs='Criteria:\\nDetermine whether the actual output is factually correct based on the expected output. \\n \\nEvaluation Steps:\\n[\\n \"Check whether the facts in \\'actual output\\' contradicts any facts in \\'expected output\",\\n \"You should also heavily penalize situations where the actual output does not answers the question\"\\n]')], conversational=False, multimodal=False, input='What is the most popular genre for Australia?', actual_output='The most popular genre in Australia is Rock, with a total sales of approximately $21.78. This is followed by Metal with approximately $7.92 in sales. Other genres such as Heavy Metal, Reggae, Latin, and Blues have lower sales ranging from approximately $3.00 to $1.00.', expected_output='Rock is the most popular song for Australia', context=None, retrieval_context=None)], confident_link=None)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Specify questions\n", "questions = [\n", " \"Which Employee has the Highest Total Number of Customers?\",\n", " \"Who are our top Customers according to Invoices?\",\n", " \"How many Rock music listeners are there?\",\n", " \"What artists have written most rock music songs?\",\n", " \"Which artist has earned the most according to the Invoice Lines? How much is it?\",\n", " \"How many tracks have a song length greater than the average song length?\",\n", " \"What is the most popular genre for Australia?\"\n", "]\n", "\n", "# Specify associated answers\n", "golden_answers = [\n", " \"Peacock Jane has the most customers (she has 21 customers)\",\n", " \"Helena Holy, Richard Cunningham, Luis Rojas, Ladislav Kovacs, and Hugh O’Reilly are the top five customers who have spent the highest amount of money according to the invoice\",\n", " \"We found out that all 59 customers in the database have listened to Rock Music.\",\n", " \"Led Zeppelin tops the list of Artists who have written the most Rock Music with 114 songs followed Closely by U2 with 112 music.\",\n", " \"The Artist who has earned the most according to the invoice lines is Iron Maiden with a total of $138.6.\",\n", " \"Out of 3503 songs in the database, we found out that 494 of these songs have length more than the average music length of 393,599.21 milliseconds.\",\n", " \"Rock is the most popular song for Australia\"\n", "\n", "]\n", "\n", "# Build evaluation dataset\n", "test_cases = [\n", " LLMTestCase(\n", " input=question,\n", " actual_output=get_response(question, verbose=False),\n", " expected_output=golden\n", " ) for question, golden in zip(questions, golden_answers)\n", "]\n", "dataset = EvaluationDataset(test_cases=test_cases)\n", "\n", "# Specify evaluation metric\n", "correctness_metric = GEval(\n", " name=\"Correctness\",\n", " criteria=\"Determine whether the actual output is factually correct based on the expected output.\",\n", " # NOTE: you can only provide either criteria or evaluation_steps, and not both\n", " evaluation_steps=[\n", " \"Check whether the facts in 'actual output' contradicts any facts in 'expected output\",\n", " \"You should also heavily penalize situations where the actual output does not answers the question\",\n", " ],\n", " evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],\n", " model=CustomMistralLarge(api_key)\n", ")\n", "\n", "# Evaluate agent\n", "dataset.evaluate([correctness_metric])" ] }, { "cell_type": "code", "execution_count": null, "id": "o4YUNyrKBKwJ", "metadata": { "id": "o4YUNyrKBKwJ" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "mistral_3", "language": "python", "name": "mistral_3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" } }, "nbformat": 4, "nbformat_minor": 5 }