arize phoenix evaluate rag

Details

File: third_party/Phoenix/arize_phoenix_evaluate_rag.ipynb
Type: Jupyter Notebook
Use Cases: Evaluation, RAG
Integrations: Phoenix, Arize
Content

Notebook content (JSON format):
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
    "        <br>\n",
    "        <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n",
    "        |\n",
    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "        |\n",
    "        <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n",
    "    </p>\n",
    "</center>\n",
    "<h1 align=\"center\"> RAG Observability with Mistral AI and Phoenix </h1>\n",
    "\n",
    "In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.\n",
    "\n",
    "It has the the following sections:\n",
    "\n",
    "1. Understanding Retrieval Augmented Generation (RAG).\n",
    "2. Building RAG (with the help of a framework such as LlamaIndex).\n",
    "3. Evaluating RAG with Phoenix Evals."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieval Augmented Generation (RAG)\n",
    "\n",
    "LLMs are trained on vast datasets, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.\n",
    "\n",
    "In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.\n",
    "\n",
    "RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build a RAG system \n",
    "\n",
    "Now that we have understood the stages of RAG, let's build a pipeline. We will use [LlamaIndex](https://www.llamaindex.ai/) for RAG and [Phoenix Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) for evaluation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qq arize-phoenix gcsfs nest_asyncio openinference-instrumentation-llama_index\n",
    "!pip install -q llama-index-embeddings-mistralai\n",
    "!pip install -q llama-index-llms-mistralai\n",
    "!pip install -qq \"mistralai>=1.0.0\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "from typing import Any, Dict\n",
    "\n",
    "import pandas as pd\n",
    "import phoenix as px\n",
    "from phoenix.otel import register\n",
    "from mistralai import Mistral\n",
    "from openinference.instrumentation.llama_index import LlamaIndexInstrumentor\n",
    "\n",
    "import nest_asyncio\n",
    "nest_asyncio.apply()\n",
    "\n",
    "import pandas as pd\n",
    "from llama_index.core import SimpleDirectoryReader, VectorStoreIndex\n",
    "from llama_index.core.node_parser import SimpleNodeParser\n",
    "from llama_index.llms.mistralai import MistralAI\n",
    "from llama_index.embeddings.mistralai import MistralAIEmbedding\n",
    "\n",
    "pd.set_option(\"display.max_colwidth\", None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this tutorial we will be using Mistral to create synthetic data as well as for evaluation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not (api_key := os.getenv(\"MISTRAL_API_KEY\")):\n",
    "    api_key = getpass(\"🔑 Enter your Mistral AI API key: \")\n",
    "os.environ[\"MISTRAL_API_KEY\"] = api_key\n",
    "client = Mistral(api_key=api_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "session =px.launch_app()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Enable Phoenix tracing via `LlamaIndexInstrumentor`. Phoenix uses OpenInference traces - an open-source standard for capturing and storing LLM application traces that enables LLM applications to seamlessly integrate with LLM observability solutions such as Phoenix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracer_provider = register()\n",
    "LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Data and Build an Index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use an [essay by Paul Graham](https://www.paulgraham.com/worked.html) to build our RAG pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tempfile\n",
    "from urllib.request import urlretrieve\n",
    "\n",
    "with tempfile.NamedTemporaryFile() as tf:\n",
    "    urlretrieve(\n",
    "        \"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt\",\n",
    "        tf.name,\n",
    "    )\n",
    "    documents = SimpleDirectoryReader(input_files=[tf.name]).load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import Settings\n",
    "\n",
    "# Define an LLM\n",
    "llm = MistralAI()\n",
    "embed_model = MistralAIEmbedding()\n",
    "Settings.llm = llm\n",
    "Settings.embed_model = embed_model\n",
    "\n",
    "# Build index with a chunk_size of 512\n",
    "node_parser = SimpleNodeParser.from_defaults(chunk_size=512)\n",
    "nodes = node_parser.get_nodes_from_documents(documents)\n",
    "vector_index = VectorStoreIndex(nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build a QueryEngine and start querying."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = vector_index.as_query_engine()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response_vector = query_engine.query(\"What did the author do growing up?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the response that you get from the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response_vector.response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"phoenix URL\", session.url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simulate usage of our application\n",
    "\n",
    "For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.\n",
    "\n",
    "For this tutorial, let's manually generate some questions for our pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "questions_list = [\n",
    "    \"What did the author do growing up?\",\n",
    "    \"What was the author's major?\",\n",
    "    \"What was the author's minor?\",\n",
    "    \"What was the author's favorite class?\",\n",
    "    \"What was the author's least favorite class?\",\n",
    "]\n",
    "\n",
    "for question in questions_list:\n",
    "    response_vector = query_engine.query(question)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluate our Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the [docs](https://docs.arize.com/phoenix/how-to/extract-data-from-spans))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.session.evaluation import get_retrieved_documents\n",
    "\n",
    "retrieved_documents_df = get_retrieved_documents(px.Client())\n",
    "retrieved_documents_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on `explanations` which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.evals import (\n",
    "    MistralAIModel,\n",
    "    RelevanceEvaluator,\n",
    "    run_evals,\n",
    ")\n",
    "\n",
    "relevance_evaluator = RelevanceEvaluator(MistralAIModel)\n",
    "\n",
    "retrieved_documents_relevance_df = run_evals(\n",
    "    evaluators=[relevance_evaluator],\n",
    "    dataframe=retrieved_documents_df,\n",
    "    provide_explanation=True,\n",
    "    concurrency=20,\n",
    ")[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "retrieved_documents_relevance_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents_with_relevance_df = pd.concat(\n",
    "    [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix(\"eval_\")], axis=1\n",
    ")\n",
    "documents_with_relevance_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Response Evaluation\n",
    "\n",
    "The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.session.evaluation import get_qa_with_reference\n",
    "\n",
    "qa_with_reference_df = get_qa_with_reference(px.Client())\n",
    "qa_with_reference_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the [LLM Evals documentation](https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/q-and-a-on-retrieved-data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.evals import (\n",
    "    HallucinationEvaluator,\n",
    "    MistralAIModel,\n",
    "    QAEvaluator,\n",
    "    run_evals,\n",
    ")\n",
    "\n",
    "qa_evaluator = QAEvaluator(MistralAIModel())\n",
    "hallucination_evaluator = HallucinationEvaluator(MistralAIModel())\n",
    "\n",
    "qa_correctness_eval_df, hallucination_eval_df = run_evals(\n",
    "    evaluators=[qa_evaluator, hallucination_evaluator],\n",
    "    dataframe=qa_with_reference_df,\n",
    "    provide_explanation=True,\n",
    "    concurrency=20,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qa_correctness_eval_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hallucination_eval_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from phoenix.trace import SpanEvaluations, DocumentEvaluations\n",
    "\n",
    "px.Client().log_evaluations(\n",
    "    SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name=\"Q&A Correctness\"),\n",
    "    SpanEvaluations(dataframe=hallucination_eval_df, eval_name=\"Hallucination\"),\n",
    "    DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name=\"relevance\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"phoenix URL\", session.url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Evaluations](./images/evals-1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines. \n",
    "\n",
    "Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the [LLM Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) documentation."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "phoenix",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}