mongodb mistral

Details

File: third_party/MongoDB/mongodb_mistral.ipynb
Type: Jupyter Notebook
Use Cases:
Integrations: Mongodb
Content

Notebook content (JSON format):
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "92ddc4f4-f7bf-4e0e-b5a5-5abd8a008b21",
   "metadata": {},
   "source": [
    "# RAG with Mistral AI and MongoDB \n",
    "\n",
    "Creating a LLM GenAI application integrates the power of Mistral AI with the robustness of an enterprise-grade vector store like MongoDB. Below is a detailed step-by-step guide to implementing this innovative system:\n",
    "\n",
    "\n",
    "![MongoDB - Mistral](../../images/mongomistral5.jpg)\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2843bde7",
   "metadata": {},
   "source": [
    "\n",
    "* Set `MISTRAL_API_KEY` and set up Subscription to activate it.\n",
    "* Get `MONGO_URI` from MongoDB Atlas cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13f1a274",
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "export MONGO_URI=\"Your_cluster_connection_string\"\n",
    "export MISTRAL_API_KEY=\"Your_MISTRAL_API_KEY\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "820c2b08",
   "metadata": {},
   "source": [
    "## Import needed libraries\n",
    "This section shows the versions of the required libraries. Personally, I run my code in VScode. So you need to install the following libraries beforehand. Here is the version at the moment I’m running the following code.\n",
    "\n",
    "mistralai                                         0.0.8\n",
    "\n",
    "pymongo                                           4.3.3\n",
    "\n",
    "gradio                                            4.10.0\n",
    "\n",
    "gradio_client                                     0.7.3\n",
    "\n",
    "langchain                                         0.0.348\n",
    "\n",
    "langchain-core                                    0.0.12\n",
    "\n",
    "pandas                                            2.0.3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a806b5f",
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [],
   "source": [
    "# Install necessary packages\n",
    "!pip install mistralai==0.0.8\n",
    "!pip install pymongo==4.3.3\n",
    "!pip install gradio==4.10.0\n",
    "!pip install gradio_client==0.7.3\n",
    "!pip install langchain==0.0.348\n",
    "!pip install langchain-core==0.0.12\n",
    "!pip install pandas==2.0.3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed6fae41",
   "metadata": {},
   "source": [
    "These include libraries for data processing, web scraping, AI models, and database interactions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "422cc85e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import gradio as gr\n",
    "import os\n",
    "import pymongo\n",
    "import pandas as pd\n",
    "from mistralai.client import MistralClient\n",
    "from mistralai.models.chat_completion import ChatMessage\n",
    "from langchain.document_loaders import PyPDFLoader\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62b18e96",
   "metadata": {},
   "source": [
    "You can use your API keys exported from shell commande."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06248ae1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check API keys\n",
    "import os\n",
    "mistral_api_key = os.environ[\"MISTRAL_API_KEY\"]\n",
    "mongo_url = os.environ[\"MONGO_URI\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bf8e412",
   "metadata": {},
   "source": [
    "## Data preparation\n",
    "The data_prep() function loads data from a PDF, a document, or a specified URL. It extracts text content from a webpage/documentation, removes unwanted elements, and then splits the data into manageable chunks. Once the data is chunked, we use the Mistral AI embedding endpoint to compute embeddings for every chunk and save them in the document. Afterward, each document is added to a MongoDB collection.\n",
    "\n",
    "![MongoDB - Mistral](../../images/mongomistral1.jpg)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1512efe1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def data_prep(file):\n",
    "    # Set up Mistral client\n",
    "    api_key = os.environ[\"MISTRAL_API_KEY\"]\n",
    "    client = MistralClient(api_key=api_key)\n",
    "\n",
    "    # Process the uploaded file\n",
    "    loader = PyPDFLoader(file.name)\n",
    "    pages = loader.load_and_split()\n",
    "\n",
    "    # Split data\n",
    "    text_splitter = RecursiveCharacterTextSplitter(\n",
    "        chunk_size=100,\n",
    "        chunk_overlap=20,\n",
    "        separators=[\"\\n\\n\", \"\\n\", \"(?<=\\. )\", \" \", \"\"],\n",
    "        length_function=len,\n",
    "    )\n",
    "    docs = text_splitter.split_documents(pages)\n",
    "\n",
    "    # Calculate embeddings and store into MongoDB\n",
    "    text_chunks = [text.page_content for text in docs]\n",
    "    df = pd.DataFrame({'text_chunks': text_chunks})\n",
    "    df['embedding'] = df.text_chunks.apply(lambda x: get_embedding(x, client))\n",
    "\n",
    "    collection = connect_mongodb()\n",
    "    df_dict = df.to_dict(orient='records')\n",
    "    collection.insert_many(df_dict)\n",
    "\n",
    "    return \"PDF processed and data stored in MongoDB.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec8b01da",
   "metadata": {},
   "source": [
    "## Connecting to MongoDB server\n",
    "The connect_mongodb() function establishes a connection to a MongoDB server. It returns a collection object that can be used to interact with the database. This function will be called in the data_prep() function.\n",
    "In order to get your MongoDB connection string, you can go to your MongoDB Atlas console, click the “Connect” button on your cluster, and choose the Python driver.\n",
    "![MongoDB - Mistral](../../images/mongomistral2.jpg)\n",
    "![MongoDB - Mistral](../../images/mongomistral3.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3a119a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "def connect_mongodb(mongo_url):\n",
    "    # Your MongoDB connection string\n",
    "    client = pymongo.MongoClient(mongo_url)\n",
    "    db = client[\"mistralpdf\"]\n",
    "    collection = db[\"pdfRAG\"]\n",
    "    return collection"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66ec1830",
   "metadata": {},
   "source": [
    "## Getting the embeddings\n",
    "The get_embedding(text) function generates an embedding for a given text. It replaces newline characters and then uses Mistral AI “La plateforme” embedding endpoints to get the embedding. This function will be called in both data preparation and question and answering processes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc61a8d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_embedding(text, client):\n",
    "    text = text.replace(\"\\n\", \" \")\n",
    "    embeddings_batch_response = client.embeddings(\n",
    "        model=\"mistral-embed\",\n",
    "        input=text,\n",
    "    )\n",
    "    return embeddings_batch_response.data[0].embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3bae678",
   "metadata": {},
   "source": [
    "## The last configuration on the MongoDB vector search index\n",
    "In order to run a vector search query, you only need to create a vector search index in MongoDB Atlas as follows. (You can also learn more about \n",
    "how to create a vector search index https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/\n",
    ".)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1d8c023",
   "metadata": {},
   "outputs": [],
   "source": [
    "{\n",
    " \"type\": \"vectorSearch\",\n",
    " \"fields\": [\n",
    "   {\n",
    "     \"numDimensions\": 1536,\n",
    "     \"path\": \"'embedding'\",\n",
    "     \"similarity\": \"cosine\",\n",
    "     \"type\": \"vector\"\n",
    "   }\n",
    " ]\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5eccd98a",
   "metadata": {},
   "source": [
    "## Finding similar documents\n",
    "The find_similar_documents(embedding) function runs the vector search query in a MongoDB collection. This function will be called when the user asks a question. We will use this function to find similar documents to the questions in the question and answering process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77000b3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_similar_documents(embedding):\n",
    "    collection = connect_mongodb()\n",
    "    documents = list(\n",
    "        collection.aggregate([\n",
    "            {\n",
    "                \"$vectorSearch\": {\n",
    "                    \"index\": \"vector_index\",\n",
    "                    \"path\": \"embedding\",\n",
    "                    \"queryVector\": embedding,\n",
    "                    \"numCandidates\": 20,\n",
    "                    \"limit\": 10\n",
    "                }\n",
    "            },\n",
    "            {\"$project\": {\"_id\": 0, \"text_chunks\": 1}}\n",
    "        ]))\n",
    "    return documents"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55c99e56",
   "metadata": {},
   "source": [
    "## Question and answer function\n",
    "This function is the core of the program. It processes a user's question and creates a response using the context supplied by Mistral AI.\n",
    "Question and answer process\n",
    "This process involves several key steps. Here’s how it works:\n",
    "Firstly, we generate a numerical representation, called an embedding, through a Mistral AI embedding endpoint, for the user’s question.\n",
    "Next, we run a vector search in the MongoDB collection to identify the documents similar to the user’s question.\n",
    "It then constructs a contextual background by combining chunks of text from these similar documents. We prepare an assistant instruction by combining all this information.\n",
    "The user’s question and the assistant’s instruction are prepared into a prompt for the Mistral AI model.\n",
    "Finally, Mistral AI will generate responses to the user thanks to the retrieval-augmented generation process.\n",
    "![MongoDB - Mistral](images/mongomistral4.jpg)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4aa15fe8",
   "metadata": {},
   "outputs": [],
   "source": [
    "def qna(users_question):\n",
    "    # Set up Mistral client\n",
    "    api_key = os.environ[\"MISTRAL_API_KEY\"]\n",
    "    client = MistralClient(api_key=api_key)\n",
    "\n",
    "    question_embedding = get_embedding(users_question, client)\n",
    "    print(\"-----Here is user question------\")\n",
    "    print(users_question)\n",
    "    documents = find_similar_documents(question_embedding)\n",
    "    \n",
    "    print(\"-----Retrieved documents------\")\n",
    "    print(documents)\n",
    "    for doc in documents:\n",
    "        doc['text_chunks'] = doc['text_chunks'].replace('\\n', ' ')\n",
    "    \n",
    "    for document in documents:\n",
    "        print(str(document) + \"\\n\")\n",
    "\n",
    "    context = \" \".join([doc[\"text_chunks\"] for doc in documents])\n",
    "    template = f\"\"\"\n",
    "    You are an expert who loves to help people! Given the following context sections, answer the\n",
    "    question using only the given context. If you are unsure and the answer is not\n",
    "    explicitly written in the documentation, say \"Sorry, I don't know how to help with that.\"\n",
    "\n",
    "    Context sections:\n",
    "    {context}\n",
    "\n",
    "    Question:\n",
    "    {users_question}\n",
    "\n",
    "    Answer:\n",
    "    \"\"\"\n",
    "    messages = [ChatMessage(role=\"user\", content=template)]\n",
    "    chat_response = client.chat(\n",
    "        model=\"mistral-large-latest\",\n",
    "        messages=messages,\n",
    "    )\n",
    "    formatted_documents = '\\n'.join([doc['text_chunks'] for doc in documents])\n",
    "\n",
    "    return chat_response.choices[0].message, formatted_documents"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}