Reach out
← Back to Cookbook

data generation refining news

Details

File: mistral/data_generation/data_generation_refining_news.ipynb

Type: Jupyter Notebook

Use Cases: Data generation

Content

Notebook content (JSON format):

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "J3rdWWphfXhq"
   },
   "source": [
    "# Data Generation: Refining News Articles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7USghSLJya4k"
   },
   "source": [
    "In this cookbook, we will dig into the process of generating data to fine-tune a model for rewriting articles in a specific, refined format. We will utilize a two-step pipeline for this purpose. First, we will generate critiques about the articles making use of guides that our model should respect and use as reference. Then, using these critiques, we will produce new, refined articles. The goal is to create a dataset that includes at least the original article and its refined version, which could potentially be used to fine-tune a model in the future or other purposes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can download the guides for this notebook with the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://github.com/mistralai/cookbook/blob/main/mistral/data_generation/external_files/guide_1.txt\n",
    "!wget https://github.com/mistralai/cookbook/blob/main/mistral/data_generation/external_files/guide_2.txt\n",
    "!wget https://github.com/mistralai/cookbook/blob/main/mistral/data_generation/external_files/guide_3.txt\n",
    "!wget https://github.com/mistralai/cookbook/blob/main/mistral/data_generation/external_files/guide_4.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iosTze_oiGdF"
   },
   "source": [
    "First step is to install `mistralai` and create a client with your api key!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: mistralai==0.4.2 in /usr/local/lib/python3.10/dist-packages (0.4.2)\n",
      "Requirement already satisfied: httpx<1,>=0.25 in /usr/local/lib/python3.10/dist-packages (from mistralai==0.4.2) (0.27.0)\n",
      "Requirement already satisfied: orjson<3.11,>=3.9.10 in /usr/local/lib/python3.10/dist-packages (from mistralai==0.4.2) (3.10.6)\n",
      "Requirement already satisfied: pydantic<3,>=2.5.2 in /usr/local/lib/python3.10/dist-packages (from mistralai==0.4.2) (2.8.0)\n",
      "Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai==0.4.2) (3.7.1)\n",
      "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai==0.4.2) (2024.6.2)\n",
      "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai==0.4.2) (1.0.5)\n",
      "Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai==0.4.2) (3.7)\n",
      "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.25->mistralai==0.4.2) (1.3.1)\n",
      "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<1,>=0.25->mistralai==0.4.2) (0.14.0)\n",
      "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=2.5.2->mistralai==0.4.2) (0.7.0)\n",
      "Requirement already satisfied: pydantic-core==2.20.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=2.5.2->mistralai==0.4.2) (2.20.0)\n",
      "Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=2.5.2->mistralai==0.4.2) (4.12.2)\n",
      "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx<1,>=0.25->mistralai==0.4.2) (1.2.1)\n"
     ]
    }
   ],
   "source": [
    "!pip install mistralai==0.4.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mistralai.client import MistralClient\n",
    "\n",
    "# Other imports we will need\n",
    "from tqdm.contrib.concurrent import process_map\n",
    "import secrets\n",
    "import time\n",
    "import random\n",
    "import json\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "CLIENT = MistralClient(api_key=\"api_key\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GSe89uMDiTAj"
   },
   "source": [
    "The next step is to download the dataset. We will be making use of a dataset available on Hugging Face, but you could provide your own!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.20.0)\n",
      "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.15.4)\n",
      "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.25.2)\n",
      "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (16.1.0)\n",
      "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)\n",
      "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n",
      "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.0.3)\n",
      "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)\n",
      "Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.4)\n",
      "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)\n",
      "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n",
      "Requirement already satisfied: fsspec[http]<=2024.5.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n",
      "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.5)\n",
      "Requirement already satisfied: huggingface-hub>=0.21.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.23.4)\n",
      "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.1)\n",
      "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)\n",
      "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n",
      "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)\n",
      "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n",
      "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n",
      "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n",
      "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n",
      "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.2->datasets) (4.12.2)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.3.2)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.7)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.0.7)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.6.2)\n",
      "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n",
      "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.4)\n",
      "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)\n",
      "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n"
     ]
    }
   ],
   "source": [
    "!pip install datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UcUKCkrfijoV"
   },
   "source": [
    "For this example, we will be generating 100 pairs of the original articles and the refined ones, but you are free to generate as many as you require."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total Articles: 32218\n",
      "Sampled: 100\n"
     ]
    }
   ],
   "source": [
    "import datasets\n",
    "news_articles = list(datasets.load_dataset(\"AyoubChLin/CNN_News_Articles_2011-2022\", split=\"train\"))\n",
    "\n",
    "random.shuffle(news_articles)\n",
    "\n",
    "print(\"Total Articles:\", len(news_articles))\n",
    "\n",
    "n_sample = 100\n",
    "news_articles = random.sample(news_articles, n_sample)\n",
    "\n",
    "print(\"Sampled:\", n_sample)\n",
    "\n",
    "with open(\"./news.jsonl\", \"w\") as f:\n",
    "  for news in news_articles:\n",
    "    f.write(json.dumps({\"news\": news[\"text\"]}) + \"\\n\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "R95riV-5yHWq"
   },
   "source": [
    "Our pipeline will consist of two steps. First, we will generate critiques using a style guideline of our choice. Here, we have four different guidelines that are more or less the same, but you could rewrite your own.\n",
    "\n",
    "Once the critiques have been generated, we will use them to generate the new rewritten articles!\n",
    "\n",
    "Let's get started with the criticism!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ktT106SvjBA5"
   },
   "source": [
    "Let's create a folder where we will cache our data as we generate it. This can be handy for debugging and to have a backup in case something goes wrong."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "newpath = r'./data'\n",
    "if not os.path.exists(newpath):\n",
    "    os.makedirs(newpath)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JfJOuUe6jKdK"
   },
   "source": [
    "Now, let's define the first process. We will make use of `mistral-large-latest` capabilities to both criticize and rewrite our articles, but you are free to use any combination of your choice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_critique(args):\n",
    "    line, systems, guides = args\n",
    "    record = json.loads(line)\n",
    "\n",
    "    news_article = record.get(\"news\")\n",
    "\n",
    "    part = random.choice(list(range(len(guides))))\n",
    "    guide = guides[part]\n",
    "\n",
    "    part = random.choice(list(range(len(systems))))\n",
    "    system = systems[part].format(guide)\n",
    "\n",
    "    time.sleep(1)\n",
    "    try:\n",
    "        answer = CLIENT.chat(\n",
    "            model=\"mistral-large-latest\",\n",
    "            messages=[\n",
    "                {\"role\": \"system\", \"content\": system},\n",
    "                {\"role\": \"user\", \"content\": news_article},\n",
    "            ],\n",
    "            temperature=0.2,\n",
    "            max_tokens=2048\n",
    "        )\n",
    "        critique = answer.choices[0].message.content\n",
    "\n",
    "        result = json.dumps({\"news\": news_article, \"critique\": critique, \"status\": \"SUCCESS\"})\n",
    "\n",
    "    except Exception as e:\n",
    "        result = json.dumps({\"news\": news_article, \"critique\": str(e), \"status\": \"ERROR\"})\n",
    "\n",
    "    random_hash = secrets.token_hex(4)\n",
    "\n",
    "    with open(f\"./data/news_critique_{random_hash}.jsonl\", \"w\") as f:\n",
    "        f.write(result)\n",
    "\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GXaIgWxcEpdZ"
   },
   "source": [
    "To generate diverse output each time, it might be a good idea to have multiple system prompts instead of a single one. Here, we provide a few system prompts that are all very similar but overall different."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "systems_variations = [\n",
    "    \"As a 'News Article Editor' adhering to a specific style guide, your responsibility is to polish and restructure news articles to align them with the high standards of clarity, accuracy, and elegance set by the guide:\\n\\n {} \\n\\n You are presented with a news article. Identify the ten (or fewer) most significant stylistic concerns and provide examples of how they can be enhanced.\",\n",
    "    \"As a 'News Content Refiner' committed to the guide, your role is to revise and perfect news articles to ensure they meet the exceptional standards of lucidity, exactness, and refinement synonymous with the guide:\\n\\n {} \\n\\n You have a news article at hand. Pinpoint the sixteen (or less) most crucial stylistic problems and suggest examples of how they might be improved.\",\n",
    "    \"As a 'News Piece Stylist' in accordance with the style guide, your duty is to amend and enrich news articles to guarantee they adhere to the rigorous standards of clarity, precision, and sophistication embodied by the style guide:\\n\\n {} \\n\\n You are handed a news piece. Highlight the fourteen (or fewer) most pressing stylistic errors and offer examples of how they could be rectified.\",\n",
    "    \"As a 'News Article Enhancer' following the principles of the guide, your mission is to modify and elevate news articles to match the high-quality standards of clarity, precision, and eloquence established by the style guide:\\n\\n {} \\n\\n You are given a news article to work on. Specify the twenty (or less) most notable stylistic flaws and provide examples of how they can be bettered.\",\n",
    "    \"As a 'News Prose Stylist' abiding by the style guide, your assignment is to correct and embellish news articles to ensure they meet the distinguished standards of clarity, precision, and sophistication upheld by the guide:\\n\\n {} \\n\\n You are provided with a news article for evaluation. Indicate the twenty (or fewer) most important stylistic issues and propose examples of how they may be optimized.\",\n",
    "    \"As a 'News Report Stylist' in compliance with the guide, your job is to revise and improve news articles to guarantee they align with the high benchmarks of clarity, precision, and sophistication set forth by the guide:\\n\\n {} \\n\\n You are tasked with reviewing a news report. List the fifteen (or less) most critical stylistic shortcomings and provide examples of how they might be amended.\",\n",
    "    \"As a 'News Writing Stylist' in line with the guide, your responsibility is to edit and refine news articles to ensure they meet the superior standards of clarity, precision, and sophistication inherent to the style guide:\\n\\n {} \\n\\n You are assigned to edit a news article. Identify the sixteen (or fewer) most prominent stylistic inconsistencies and suggest examples of how they can be enhanced.\",\n",
    "    \"As a 'News Text Stylist' adhering to the style guide, your role is to amend and perfect news articles to ensure they meet the high-caliber standards of clarity, precision, and sophistication characteristic of the guide:\\n\\n {} \\n\\n You are given a news text to evaluate. Highlight the nineteen (or less) most significant stylistic discrepancies and provide examples of how they might be improved.\",\n",
    "    \"As a 'News Copy Stylist' in accordance with the guide, your duty is to revise and enrich news articles to guarantee they adhere to the exacting standards of clarity, precision, and sophistication embodied by the style guide:\\n\\n {} \\n\\n You are tasked with reviewing a news copy. List the eleven (or fewer) most crucial stylistic errors and propose examples of how they can be rectified.\",\n",
    "    \"As a 'News Article Stylist and Editor' committed to the style guide, your mission is to refine, rewrite, and edit news articles to ensure they meet the high standards of clarity, precision, and sophistication synonymous with the guide:\\n\\n {} \\n\\n You are given a news article to refine and edit. Identify the seventeen (or fewer) most pressing stylistic concerns and provide examples of how they can be improved.\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EWFGZsiZjeVC"
   },
   "source": [
    "Now, it's time to generate. Let's get the guides we made and start the generation using `process_map`, which will create multiple workers to generate the new data in parallel and more efficiently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b42132d091fb45bf93dbbe92f27bb3c2",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/100 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "guides = []\n",
    "for pick in range(1, 5):\n",
    "    jsonl_file_path = f\"./guide_{pick}.txt\"\n",
    "\n",
    "    with open(jsonl_file_path, \"r\") as f:\n",
    "        guide = f.read()\n",
    "        guides.append(guide)\n",
    "\n",
    "data_path = \"./news.jsonl\"\n",
    "with open(data_path, \"r\") as f:\n",
    "    lines = f.readlines()\n",
    "    lines = [(line, systems_variations, guides) for line in lines]\n",
    "\n",
    "    results = process_map(process_critique, lines, max_workers=20, chunksize=1)\n",
    "\n",
    "with open(\"./generated_news_critiques.jsonl\", \"w\") as f:\n",
    "    for result in results:\n",
    "        f.write(result + \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "C9zJIJKyjq15"
   },
   "source": [
    "Perfect! Critiques generated, now it's time to refine and rewrite our articles using the feedback!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_refined_news(args):\n",
    "    line, system, instruction = args\n",
    "    record = json.loads(line)\n",
    "\n",
    "    news_article = record.get(\"news\")\n",
    "    critique= record.get(\"critique\")\n",
    "    status = record.get(\"status\")\n",
    "\n",
    "    time.sleep(1)\n",
    "\n",
    "    try:\n",
    "      if status == \"SUCCESS\":\n",
    "\n",
    "        answer = CLIENT.chat(\n",
    "            model=\"mistral-large-latest\",\n",
    "            messages= [\n",
    "                {\"role\": \"system\", \"content\": system},\n",
    "                {\"role\": \"user\", \"content\": news_article},\n",
    "                {\"role\": \"assistant\", \"content\": critique},\n",
    "                {\"role\": \"user\", \"content\": instruction},\n",
    "            ],\n",
    "            temperature=0.2,\n",
    "            max_tokens=2048\n",
    "        )\n",
    "        new_news = answer.choices[0].message.content\n",
    "\n",
    "        result = json.dumps({\"news\": news_article, \"critique\": critique, \"refined_news\": new_news, \"status\": \"SUCCESS\"})\n",
    "\n",
    "      else:\n",
    "        result = json.dumps({\"news\": news_article, \"critique\": critique, \"refined_news\": critique, \"status\": \"ERROR\"})\n",
    "    except Exception as e:\n",
    "        result = json.dumps({\"news\": news_article, \"critique\": critique, \"refined_news\": str(e), \"status\": \"ERROR\"})\n",
    "\n",
    "    random_hash = secrets.token_hex(4)\n",
    "\n",
    "    with open(f\"./data/refined_news_{random_hash}.jsonl\", \"w\") as f:\n",
    "        f.write(result)\n",
    "\n",
    "    return result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BWxHkkhqj2B1"
   },
   "source": [
    "We will replace our multiple system variations with a generalized one to give it context, but the key part of our second step is our instruction to rewrite the article with the provided feedback. This instruction might require a lot of changes depending on your requirements!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3b508f1ea308402cb56e0b2fcc9e761d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/100 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "system = \"Polish and restructure the news articles to align them with the high standards of clarity, accuracy, and elegance set by the style guide. You are presented with a news article. Identify the ten (or fewer) most significant stylistic concerns and provide examples of how they can be enhanced.\"\n",
    "\n",
    "instruction = \"\"\"\n",
    "Now, I want you to incorporate the feedback and critiques into the news article and respond with the enhanced version, focusing solely on stylistic improvements without altering the content.\n",
    "You must provide the entire article enhanced.\n",
    "Do not make ANY comments, only provide the new article improved.\n",
    "Do not tell me what you changed, only provide the new article taking into consideration the feedback you provided.\n",
    "The new article needs to have all the content of the original article but with the feedback into account.\n",
    "\"\"\"\n",
    "\n",
    "data_path = \"./generated_news_critiques.jsonl\"\n",
    "with open(data_path, \"r\") as f:\n",
    "    lines = f.readlines()\n",
    "    lines = [(line, system, instruction) for line in lines]\n",
    "\n",
    "    results = process_map(process_refined_news, lines, max_workers=20, chunksize=1)\n",
    "\n",
    "with open(\"./generated_refined_news.jsonl\", \"w\") as f:\n",
    "    for result in results:\n",
    "        f.write(result + \"\\n\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "q6ouOPB5kFj-"
   },
   "source": [
    "Articles generated! Let's take a look at them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'critique': '1. Use consistent capitalization in headlines: \"Liverpool 5-2 '\n",
      "             'Roma\" should be \"Liverpool 5-2 Roma\"\\n'\n",
      "             '2. Use active voice: \"Mohamed Salah put in a magical '\n",
      "             'performance\" could be \"Mohamed Salah delivered a magical '\n",
      "             'performance\"\\n'\n",
      "             '3. Avoid using colloquial language: \"led the rout\" should be '\n",
      "             '\"led the charge\"\\n'\n",
      "             '4. Use proper punctuation: \"Follow @cnnsport\" should be removed '\n",
      "             'or integrated into the sentence\\n'\n",
      "             '5. Use serial comma: \"two goals and two assists helped Liverpool '\n",
      "             'to a commanding first-leg Champions League semifinal lead\" '\n",
      "             'should be \"two goals, two assists, and a commanding first-leg '\n",
      "             'lead in the Champions League semifinal\"\\n'\n",
      "             '6. Use proper spelling: \"defence\" should be \"defense\" (depending '\n",
      "             \"on the publication's preferred spelling)\\n\"\n",
      "             '7. Use consistent verb tense: \"Salah apologizes\" should be '\n",
      "             '\"Salah apologized\"\\n'\n",
      "             '8. Use proper punctuation: \"Read More\" should be removed or '\n",
      "             'integrated into the sentence\\n'\n",
      "             '9. Use active voice: \"Roma were catching a glimpse\" should be '\n",
      "             '\"Roma caught a glimpse\"\\n'\n",
      "             '10. Use proper punctuation: \"READ: A day in the life of Mo '\n",
      "             'Salah, Liverpool\\'s \\'Egyptian King\\'\" should be removed or '\n",
      "             'integrated into the sentence\\n'\n",
      "             '11. Use proper punctuation: \"JUST WATCHED Mohamed Salah: I love '\n",
      "             'the Egyptian King chant\" should be removed or integrated into '\n",
      "             'the sentence\\n'\n",
      "             '12. Use proper punctuation: \"MUST WATCH Mohamed Salah: I love '\n",
      "             'the Egyptian King chant 01:34\" should be removed or integrated '\n",
      "             'into the sentence\\n'\n",
      "             '13. Use proper punctuation: \"After a closely contested opening '\n",
      "             '35 minutes, it was a case of who would blink first\" should be '\n",
      "             'rephrased as a complete sentence\\n'\n",
      "             '14. Use proper punctuation: \"Following a brief interlude after '\n",
      "             \"not one, but two linesman's flags had fallen apart and had to be \"\n",
      "             \"repaired, Roberto Firmino got in behind Roma's defence for the \"\n",
      "             'first time and flashed a low shot across the face of goal\" '\n",
      "             'should be rephrased as multiple sentences\\n'\n",
      "             '15. Use proper punctuation: \"The Brazilian forward looked to '\n",
      "             'have been inches offside, not that the linesman had a working '\n",
      "             'flag to make the decision\" should be rephrased as multiple '\n",
      "             'sentences\\n'\n",
      "             '16. Use proper punctuation: \"Jurgen Klopp\\'s plans were then '\n",
      "             'dealt an early blow as the industrious Alex Oxlade-Chamberlain '\n",
      "             'was forced off with a knee injury sustained in a tackle with '\n",
      "             'Aleksandar Kolorov\" should be rephrased as multiple sentences\\n'\n",
      "             '17. Use proper punctuation: \"READ: Mo Salah gets suited and '\n",
      "             'booted ahead of PFA Player of the Year awards\" should be removed '\n",
      "             'or integrated into the sentence\\n'\n",
      "             '18. Use proper punctuation: \"READ: Salah -- \\'There\\'s something '\n",
      "             'very special about playing for Liverpool\\'\" should be removed or '\n",
      "             'integrated into the sentence\\n'\n",
      "             '19. Use proper punctuation: \"JUST WATCHED Mo Salah and Becky '\n",
      "             'Anderson go to the Docks\" should be removed or integrated into '\n",
      "             'the sentence\\n'\n",
      "             '20. Use proper punctuation: \"MUST WATCH Mo Salah and Becky '\n",
      "             'Anderson go to the Docks 01:06\" should be removed or integrated '\n",
      "             'into the sentence.',\n",
      " 'news': 'Story highlightsLiverpool 5-2 RomaMo Salah has hand in four '\n",
      "         'goalsLiverpool led 5-0 before Roma pulled two late goals back '\n",
      "         '(CNN)Mohamed Salah put in a magical performance against former club '\n",
      "         'Roma, as two goals and two assists helped Liverpool to a commanding '\n",
      "         'first-leg Champions League semifinal lead.The Egyptian led the rout '\n",
      "         'from the first whistle as Liverpool found themselves 5-0 up with '\n",
      "         'just 10 minutes of the game left to play.Follow @cnnsport\\n'\n",
      "         '\\n'\n",
      "         \"However, a late Roma rally as Liverpool's defence went to sleep saw \"\n",
      "         'Edin Dzeko and Diego Perotti scored two late goals to give the '\n",
      "         'Italians a glimmer of hope going into the second leg.The two sides '\n",
      "         \"will meet again at the Stadio Olimpico in eight days' time.Salah \"\n",
      "         'apologizes to the Roma fans after opening to scoring for '\n",
      "         'Liverpool.When Salah signed for Liverpool in June of 2017, Roma '\n",
      "         'supporters were understandably sorry to see him go.Read MoreThanks '\n",
      "         'to individual moments of magic and no shortage of goals, the '\n",
      "         'Egyptian became a firm fan favorite during his time in the Italian '\n",
      "         'capital.Fast forward to Anfield, little less than a year later, Roma '\n",
      "         'were catching a glimpse of their former star for the first time and, '\n",
      "         'now, they truly were sorry he ever left.READ: A day in the life of '\n",
      "         \"Mo Salah, Liverpool's 'Egyptian King'READ: Mo Salah faces emotional \"\n",
      "         'return to RomaIt felt scripted -- Salah scoring two and providing '\n",
      "         \"two of Liverpool's five-goal blitz -- though, in reality, his recent \"\n",
      "         'scintillating form meant this was very much predictable.Many of '\n",
      "         'these Roma players had trained and played with Salah during his two '\n",
      "         'years at the club, but in truth the 25-year-old never hit these '\n",
      "         \"kinds of heights in Italy.If they had heard tales of Salah's \"\n",
      "         'exploits since his move to England, tonight they were experiencing '\n",
      "         'them first hand.Liverpool had not tasted a first-leg Champions '\n",
      "         'League semifinal win in their previous three attempts, while Roma '\n",
      "         'have one of the worst away records in the competition.JUST '\n",
      "         'WATCHEDMohamed Salah: I love the Egyptian King chantReplayMore '\n",
      "         'Videos ...MUST WATCHMohamed Salah: I love the Egyptian King chant '\n",
      "         '01:34After a closely contested opening 35 minutes, it was a case of '\n",
      "         'who would blink first.Following a brief interlude after not one, but '\n",
      "         \"two linesman's flags had fallen apart and had to be repaired, \"\n",
      "         \"Roberto Firmino got in behind Roma's defence for the first time and \"\n",
      "         'flashed a low shot across the face of goal.The Brazilian forward '\n",
      "         'looked to have been inches offside, not that the linesman had a '\n",
      "         \"working flag to make the decision.Jurgen Klopp's plans were then \"\n",
      "         'dealt an early blow as the industrious Alex Oxlade-Chamberlain was '\n",
      "         'forced off with a knee injury sustained in a tackle with Aleksandar '\n",
      "         'Kolorov.READ: Mo Salah gets suited and booted ahead of PFA Player of '\n",
      "         \"the Year awardsREAD: Salah -- 'There's something very special about \"\n",
      "         \"playing for Liverpool'Georginio Wijnaldum came on in his place and \"\n",
      "         'perhaps the substitution unsettled the home side as Roma almost took '\n",
      "         \"a surprise lead.Cengiz Ünder's outswinging corner narrowly missed \"\n",
      "         \"Džeko's head, before the ball fell to Kolorov on the edge of the \"\n",
      "         'box.The former Manchester City left back, renowned for his powerful '\n",
      "         'left foot during his time in England, fired a shot at goal which '\n",
      "         'goalkeeper Loris Karius fumbled fortuitously onto the underside of '\n",
      "         'the crossbar.Karius, who has spent much of his Liverpool career '\n",
      "         'playing second fiddle to Simon Mignolet, has regularly given his own '\n",
      "         'teammates and fans similarly heart-stopping moments.JUST WATCHEDMo '\n",
      "         'Salah and Becky Anderson go to the DocksReplayMore Videos ...MUST '\n",
      "         'WATCHMo Salah and Becky Anderson go to the Docks 01:06 As the '\n",
      "         'clocked ticked towards half an hour, Sadio Mane had two gilt-edged '\n",
      "         'chances in the space of 49 seconds to give Liverpool the lead.First, '\n",
      "         \"Firmino's clever flick and pass was latched onto by Mane, whose \"\n",
      "         'exceptional first touch allowed him to get away from Federico Fazio, '\n",
      "         \"but the Senegalese blazed his finish high over Alisson's \"\n",
      "         \"crossbar.Moments later, Roma's high defensive line was again \"\n",
      "         'exposed, this time as Firmino got in down the right to square the '\n",
      "         'ball for Mane, but again his shot went high into the stands.It '\n",
      "         \"wasn't long before Mane did finally have the ball in the net, \"\n",
      "         \"prodding home Andy Robertson's low cross, though this time he was \"\n",
      "         \"met with the sight of the linesman's, now fully functioning, flag. \"\n",
      "         \"That first Mane chance, though it wasn't taken, felt like the \"\n",
      "         'watershed moment in the first half.Wave after wave of red shirts '\n",
      "         \"began descending on Roma's back line; every lunge increasingly \"\n",
      "         \"desperate, every tackle increasingly last ditch.It wasn't long \"\n",
      "         \"before Liverpool's star man, Salah, got in on the act, cutting \"\n",
      "         'inside and curling a shot which was palmed away by Alisson at full '\n",
      "         \"stretch.Roma naively didn't heed that warning, again allowing Salah \"\n",
      "         'to cut inside onto that magical left foot.JUST WATCHEDMo Salah: We '\n",
      "         'can win the Champions LeagueReplayMore Videos ...MUST WATCHMo Salah: '\n",
      "         'We can win the Champions League 02:59This time it was inch perfect, '\n",
      "         'the ball kissing the underside of the crossbar as the scrambling '\n",
      "         'Alisson looked helplessly upwards.There was a brief intake of '\n",
      "         'breath, a split second of silence as the 54,000 watching fans heard '\n",
      "         'the noise of ball against woodwork, before Anfield -- which up to '\n",
      "         'this point had been simmering nicely -- exploded into life.Salah, '\n",
      "         'facing his former employers for the first time since his summer '\n",
      "         'transfer to Merseyside, immediately turned to where the away fans '\n",
      "         'were congregated and put his hands together, asking for '\n",
      "         'forgiveness.Bar the small pocket of Roma fans in the corner of the '\n",
      "         'crowd, Anfield was delirious and the Liverpool players responded to '\n",
      "         'a crowd which now was baying for more Roma blood.Again they came '\n",
      "         'forward, this time Firmino crashed through a weak tackle from Kostas '\n",
      "         \"Manolas -- Roma's hero from their second-leg quarterfinal comeback \"\n",
      "         'against Barcelona -- and threaded a pass through to Salah.With his '\n",
      "         'first touch, the Egyptian dragged the ball into his path; with his '\n",
      "         'second, he nonchalantly clipped it over the onrushing Alisson.For a '\n",
      "         'moment it looked as though the backtracking Juan Jesus might be able '\n",
      "         'to save Roma, but the wet turf quickly coaxed the ball over the '\n",
      "         'line.Roma made a substitution at the start of the second half, '\n",
      "         'Patrik Schick replacing Under, in what proved to be a futile attempt '\n",
      "         \"to stem the tide.JUST WATCHEDCOPA90: Mo Salah, Liverpool's Egyptian \"\n",
      "         \"KingReplayMore Videos ...MUST WATCHCOPA90: Mo Salah, Liverpool's \"\n",
      "         'Egyptian King 03:04With less than 10 minutes on the clock, Salah '\n",
      "         'turned provider, putting a pass on a plate for Mane to divert the '\n",
      "         \"ball into the far corner.Roma spent Monday night in Liverpool's \"\n",
      "         'Titanic Hotel, and the Italians were now sinking much faster.With '\n",
      "         'Liverpool toying with their opponents and with one foot in the final '\n",
      "         'in Kiev, Salah again burst down the right and this time played a '\n",
      "         'pass across the face of goal which Firmino tapped into an empty '\n",
      "         'net.Despite two goals and two assists to his name, Salah was still '\n",
      "         'being allowed to roam free down the right flank with barely a Roma '\n",
      "         'defender in sight.Liverpool then looked to hammer the final nail '\n",
      "         \"into Roma's coffin, as Firmino headed home his second goal of the \"\n",
      "         \"game from James Milner's corner.Based on the mood around the ground \"\n",
      "         'and the swagger with which the Liverpool players were strutting '\n",
      "         'around the pitch, you got the sense this tie was over.Perhaps it was '\n",
      "         'that complacency, then, that allowed Roma to inexplicably breathe '\n",
      "         'life into this semifinal.Liverpool fans welcome the team bus to '\n",
      "         'Anfield before the game.With barely a shot to their name in the '\n",
      "         'entire match, two goals in the space of four minutes -- the first a '\n",
      "         \"tidy finish from Dzeko, the second Perotti's penalty -- put a spring \"\n",
      "         'in the step of those Roma fans who had looked utterly dejected up '\n",
      "         'until this point.The mood inside Anfield had now shifted completely. '\n",
      "         'Barely minutes ago, the place had been rocking. Now there were '\n",
      "         \"nerves as Liverpool's much-maligned leaky defence again began to \"\n",
      "         \"show its fragility.Klopp's final change, defender Ragnar Klavan for \"\n",
      "         'two-goal Firmino, mirrored the growing anxiety around the '\n",
      "         \"ground.When the final whistle blew, it wasn't met with the same \"\n",
      "         'euphoria had it been blown just 10 minutes earlier, but Liverpool '\n",
      "         \"can still certainly be pleased with their night's work.Though Roma's \"\n",
      "         'two late goals slightly soured an otherwise incredible Champions '\n",
      "         'League night at Anfield, it will take a monumental effort if the '\n",
      "         'Italians are to overturn a second consecutive three-goal deficit -- '\n",
      "         'in particular with Salah in the opposition line up.But they did it '\n",
      "         \"against Lionel Messi and Barcelona. Who is to say they can't do it \"\n",
      "         'again?',\n",
      " 'refined_news': 'Liverpool Secures Commanding 5-2 Victory Over Roma in '\n",
      "                 'Champions League Semifinal First Leg\\n'\n",
      "                 '\\n'\n",
      "                 'Liverpool took a significant step towards the Champions '\n",
      "                 'League final with a dominant 5-2 victory over Roma in the '\n",
      "                 'first leg of their semifinal matchup. Mohamed Salah led the '\n",
      "                 'charge with two goals and two assists against his former '\n",
      "                 'club.\\n'\n",
      "                 '\\n'\n",
      "                 'The Egyptian delivered a magical performance from the first '\n",
      "                 'whistle, helping Liverpool establish a commanding 5-0 lead '\n",
      "                 'with just 10 minutes remaining. However, Roma scored two '\n",
      "                 'late goals to give themselves a glimmer of hope heading into '\n",
      "                 'the second leg.\\n'\n",
      "                 '\\n'\n",
      "                 'Salah apologized to the Roma fans after opening the scoring '\n",
      "                 'for Liverpool. He spent two years at the club before joining '\n",
      "                 'the Reds in June 2017.\\n'\n",
      "                 '\\n'\n",
      "                 'The game was closely contested in the opening 35 minutes '\n",
      "                 \"until Roberto Firmino got in behind Roma's defense for the \"\n",
      "                 'first time and flashed a low shot across the face of goal. '\n",
      "                 'The Brazilian forward looked to have been inches offside, '\n",
      "                 'but the linesman did not have a working flag to make the '\n",
      "                 'decision.\\n'\n",
      "                 '\\n'\n",
      "                 \"Liverpool's plans were dealt an early blow when Alex \"\n",
      "                 'Oxlade-Chamberlain was forced off with a knee injury '\n",
      "                 'sustained in a tackle with Aleksandar Kolorov. Georginio '\n",
      "                 'Wijnaldum replaced him.\\n'\n",
      "                 '\\n'\n",
      "                 \"Roma almost took a surprise lead when Cengiz Ünder's \"\n",
      "                 \"outswinging corner narrowly missed Edin Dzeko's head, and \"\n",
      "                 'the ball fell to Kolorov on the edge of the box. The former '\n",
      "                 'Manchester City left-back fired a shot at goal, but '\n",
      "                 'goalkeeper Loris Karius fumbled it onto the underside of the '\n",
      "                 'crossbar.\\n'\n",
      "                 '\\n'\n",
      "                 'Karius, who has spent much of his Liverpool career playing '\n",
      "                 'second fiddle to Simon Mignolet, has regularly given his own '\n",
      "                 'teammates and fans heart-stopping moments.\\n'\n",
      "                 '\\n'\n",
      "                 'Sadio Mane had two gilt-edged chances in the space of 49 '\n",
      "                 \"seconds to give Liverpool the lead. First, Firmino's clever \"\n",
      "                 'flick and pass were latched onto by Mane, whose exceptional '\n",
      "                 'first touch allowed him to get away from Federico Fazio, but '\n",
      "                 \"the Senegalese blazed his finish high over Alisson's \"\n",
      "                 \"crossbar. Moments later, Roma's high defensive line was \"\n",
      "                 'again exposed, this time as Firmino got in down the right to '\n",
      "                 'square the ball for Mane, but again his shot went high into '\n",
      "                 'the stands.\\n'\n",
      "                 '\\n'\n",
      "                 'Mane did finally have the ball in the net, prodding home '\n",
      "                 \"Andy Robertson's low cross, though this time he was met with \"\n",
      "                 \"the sight of the linesman's, now fully functioning, flag. \"\n",
      "                 \"That first Mane chance, though it wasn't taken, felt like \"\n",
      "                 'the watershed moment in the first half.\\n'\n",
      "                 '\\n'\n",
      "                 \"Wave after wave of red shirts began descending on Roma's \"\n",
      "                 'back line; every lunge increasingly desperate, every tackle '\n",
      "                 \"increasingly last ditch. It wasn't long before Liverpool's \"\n",
      "                 'star man, Salah, got in on the act, cutting inside and '\n",
      "                 'curling a shot which was palmed away by Alisson at full '\n",
      "                 'stretch.\\n'\n",
      "                 '\\n'\n",
      "                 \"Roma naively didn't heed that warning, again allowing Salah \"\n",
      "                 'to cut inside onto that magical left foot. This time it was '\n",
      "                 'inch perfect, the ball kissing the underside of the crossbar '\n",
      "                 'as the scrambling Alisson looked helplessly upwards.\\n'\n",
      "                 '\\n'\n",
      "                 'There was a brief intake of breath, a split second of '\n",
      "                 'silence as the 54,000 watching fans heard the noise of ball '\n",
      "                 'against woodwork, before Anfield exploded into life. Salah, '\n",
      "                 'facing his former employers for the first time since his '\n",
      "                 'summer transfer to Merseyside, immediately turned to where '\n",
      "                 'the away fans were congregated and put his hands together, '\n",
      "                 'asking for forgiveness.\\n'\n",
      "                 '\\n'\n",
      "                 'Bar the small pocket of Roma fans in the corner of the '\n",
      "                 'crowd, Anfield was delirious and the Liverpool players '\n",
      "                 'responded to a crowd which now was baying for more Roma '\n",
      "                 'blood. Again they came forward, this time Firmino crashed '\n",
      "                 'through a weak tackle from Kostas Manolas and threaded a '\n",
      "                 'pass through to Salah. With his first touch, the Egyptian '\n",
      "                 'dragged the ball into his path; with his second, he '\n",
      "                 'nonchalantly clipped it over the onrushing Alisson.\\n'\n",
      "                 '\\n'\n",
      "                 'For a moment it looked as though the backtracking Juan Jesus '\n",
      "                 'might be able to save Roma, but the wet turf quickly coaxed '\n",
      "                 'the ball over the line. Roma made a substitution at the '\n",
      "                 'start of the second half, Patrik Schick replacing Cengiz '\n",
      "                 'Ünder, in what proved to be a futile attempt to stem the '\n",
      "                 'tide.\\n'\n",
      "                 '\\n'\n",
      "                 'With less than 10 minutes on the clock, Salah turned '\n",
      "                 'provider, putting a pass on a plate for Mane to divert the '\n",
      "                 'ball into the far corner. Roma spent Monday night in '\n",
      "                 \"Liverpool's Titanic Hotel, and the Italians were now sinking \"\n",
      "                 'much faster.\\n'\n",
      "                 '\\n'\n",
      "                 'With Liverpool toying with their opponents and with one foot '\n",
      "                 'in the final in Kiev, Salah again burst down the right and '\n",
      "                 'this time played a pass across the face of goal which '\n",
      "                 'Firmino tapped into an empty net. Despite two goals and two '\n",
      "                 'assists to his name, Salah was still being allowed to roam '\n",
      "                 'free down the right flank with barely a Roma defender in '\n",
      "                 'sight.\\n'\n",
      "                 '\\n'\n",
      "                 \"Liverpool then looked to hammer the final nail into Roma's \"\n",
      "                 'coffin, as Firmino headed home his second goal of the game '\n",
      "                 \"from James Milner's corner. Based on the mood around the \"\n",
      "                 'ground and the swagger with which the Liverpool players were '\n",
      "                 'strutting around the pitch, you got the sense this tie was '\n",
      "                 'over.\\n'\n",
      "                 '\\n'\n",
      "                 'Perhaps it was that complacency, then, that allowed Roma to '\n",
      "                 'inexplicably breathe life into this semifinal. Liverpool '\n",
      "                 'fans welcome the team bus to Anfield before the game. With '\n",
      "                 'barely a shot to their name in the entire match, two goals '\n",
      "                 'in the space of four minutes -- the first a tidy finish from '\n",
      "                 \"Edin Dzeko, the second Diego Perotti's penalty -- put a \"\n",
      "                 'spring in the step of those Roma fans who had looked utterly '\n",
      "                 'dejected up until this point.\\n'\n",
      "                 '\\n'\n",
      "                 'The mood inside Anfield had now shifted completely. Barely '\n",
      "                 'minutes ago, the place had been rocking. Now there were '\n",
      "                 \"nerves as Liverpool's much-maligned leaky defense again \"\n",
      "                 \"began to show its fragility. Klopp's final change, defender \"\n",
      "                 'Ragnar Klavan for two-goal Firmino, mirrored the growing '\n",
      "                 'anxiety around the ground.\\n'\n",
      "                 '\\n'\n",
      "                 \"When the final whistle blew, it wasn't met with the same \"\n",
      "                 'euphoria had it been blown just 10 minutes earlier, but '\n",
      "                 \"Liverpool can still certainly be pleased with their night's \"\n",
      "                 \"work. Though Roma's two late goals slightly soured an \"\n",
      "                 'otherwise incredible Champions League night at Anfield, it '\n",
      "                 'will take a monumental effort if the Italians are to '\n",
      "                 'overturn a second consecutive three-goal deficit -- in '\n",
      "                 'particular with Salah in the opposition line up. But they '\n",
      "                 'did it against Lionel Messi and Barcelona. Who is to say '\n",
      "                 \"they can't do it again?\",\n",
      " 'status': 'SUCCESS'}\n"
     ]
    }
   ],
   "source": [
    "from pprint import pprint\n",
    "\n",
    "with open(\"./generated_refined_news.jsonl\", \"r\") as f:\n",
    "  l = json.loads(f.readlines()[12])\n",
    "\n",
    "pprint(l)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}