{"id":34220,"date":"2024-02-12T06:00:00","date_gmt":"2024-02-12T00:30:00","guid":{"rendered":"https:\/\/debuggercafe.com\/?p=34220"},"modified":"2024-09-15T21:20:53","modified_gmt":"2024-09-15T15:50:53","slug":"getting-started-with-grammar-correction","status":"publish","type":"post","link":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/","title":{"rendered":"Getting Started with Grammar Correction using Hugging Face Transformers"},"content":{"rendered":"\n<p><strong>Grammar Correction<\/strong> is one of the major problems in NLP (Natural Language Processing). Tools like Grammarly that help with automated grammar correction are invaluable in modern online writing. A lot of online tools like Grammarly pop up almost every few months. And guess what? They are all powered by AI, or NLP to be precise. <em>But how do they work?<\/em> Although it is difficult to pinpoint how grammar correction tools work, we can take some safe guesses. Most probably, they have a Transformer model under the hood. Grammarly has an official blog post on how Transformers help in GEC (Grammatical Error Correction). In this article, although we will not be building any state-of-the-art grammar correction model, we will train a very simple model using T5.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-horizontal is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-499968f5 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-outline is-style-outline--1\"><a class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background wp-element-button\" href=\"#download-code\"><strong>Jump to Download Code<\/strong><\/a><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-using-t5.gif\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"500\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-using-t5.gif\" alt=\"Example of Grammar correction using T5\" class=\"wp-image-34325\"\/><\/a><figcaption class=\"wp-element-caption\">Figure 1. Example of grammar correction using T5.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>We covered <strong><a href=\"https:\/\/debuggercafe.com\/spelling-correction-using-hugging-face-transformers\/\" target=\"_blank\" rel=\"noreferrer noopener\">spelling correction using T5<\/a><\/strong> in one of the previous articles. It was a minimal example to show how we can use Transformers for spelling correction. Similarly, in this article, we will touch upon every point briefly. This includes the dataset, the dataset preparation process, and the training. Our main focus is on creating a working solution with a <strong><em>code-first approach to grammar correction using Transformers<\/em><\/strong>. This will lead the way to future articles where we will dive deeper into this topic.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><em>We will cover the following points for grammar correction using Hugging Face Transformers<\/em><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>We will start with the dataset discussion. To be precise, we will use the First Certificate in English (FCE) dataset in this article.<\/em><\/li>\n\n\n\n<li><em>Next is the dataset preparation part. We need to prepare the dataset in such a way that we can feed it to the T5 model easily for training.<\/em><\/li>\n\n\n\n<li><em>Then comes the training of the T5 <strong><a href=\"https:\/\/debuggercafe.com\/transformer-neural-network\/\" target=\"_blank\" rel=\"noreferrer noopener\">Transformer model<\/a><\/strong> for grammar correction.<\/em><\/li>\n\n\n\n<li><em>Finally, we will run inference using the trained model.<\/em><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The FCE Dataset<\/h2>\n\n\n\n<p>The FCE (First Certificate in English) dataset is a subset of the Cambridge Learner Corpus (CLC). It is a part of the  <strong><a href=\"https:\/\/www.cl.cam.ac.uk\/research\/nl\/bea2019st\/\" target=\"_blank\" rel=\"noreferrer noopener\">Building Educational Applications 2019 Shared Task: Grammatical Error Correction<\/a><\/strong> competition. The website hosts other datasets as well but we are interested in the <strong><a href=\"https:\/\/www.cl.cam.ac.uk\/research\/nl\/bea2019st\/data\/fce_v2.1.bea19.tar.gz\" target=\"_blank\" rel=\"noreferrer noopener\">FCE v2.1<\/a><\/strong> under the <strong>Data<\/strong> section.<\/p>\n\n\n\n<p>Downloading and extracting the dataset will reveal the following format.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">fce_v2.1.bea19\n\u2514\u2500\u2500 fce\n    \u251c\u2500\u2500 json\n    \u2502\u00a0\u00a0 \u251c\u2500\u2500 fce.dev.json\n    \u2502\u00a0\u00a0 \u251c\u2500\u2500 fce.test.json\n    \u2502\u00a0\u00a0 \u2514\u2500\u2500 fce.train.json\n    \u251c\u2500\u2500 json_to_m2.py\n    \u251c\u2500\u2500 licence.txt\n    \u251c\u2500\u2500 m2\n    \u2502\u00a0\u00a0 \u251c\u2500\u2500 fce.dev.gold.bea19.m2\n    \u2502\u00a0\u00a0 \u251c\u2500\u2500 fce.test.gold.bea19.m2\n    \u2502\u00a0\u00a0 \u2514\u2500\u2500 fce.train.gold.bea19.m2\n    \u2514\u2500\u2500 readme.txt<\/pre>\n\n\n\n<p>All the data files will be extracted into <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">fce_v2.1.bea19\/fce<\/code> directory. There is a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">json<\/code> and an <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">m2<\/code> subdirectory. However, we will deal with the JSON format of the dataset in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">json<\/code> subdirectory.<\/p>\n\n\n\n<p>It contains a <strong>training<\/strong>, a <strong>dev<\/strong>, and a <strong>test<\/strong> set. Now, let&#8217;s look at an example from the training set.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{\"text\": \"Dear Sir or Madam,\\n\\nI am writing in order to express my \ndisappointment about your musical show \\\"Over the Rainbow\\\".\\n\\nI \nsaws the show's advertisement hanging up of a wall in London where \nI was spending my holiday with some friends. I convinced them to go\nthere with me because I had heard good references about your \nCompany and, above all, about the main star, Danny Brook.\\n\\nThe \nproblems started in the box office, where we asked for the discounts\nyou announced in the advertisement, and the man who was selling \nthe tickets said that they didn't exist.\\n\\nMoreover, the show was \ndelayed forty-five minutes and the worst of all was that Danny Brook\nhad been replaced by another actor.\\n\\nOn the other hand, the \ntheatre restaurant was closed because unknown reasons.\\n\\nYou \npromised a perfect evening but it became a big disastrous!\\n\\nI \nwould like some kind of explanation and receive my money back. If \nyou don't agree, I will act consequently.\\n\\nI look forward to \nhearing from you.\\n\\nYours faithfully,\", \"age\": \"21-25\", \"q\": \"1\",\n\"script-s\": \"31\", \"edits\": [[0, [[71, 76, \"with\", \"RT\"], \n[118, 122, \"saw\", \"IV\"], [159, 161, \"on\", \"RT\"], [292, 302, \n\"reviews\", \"RN\"], [303, 308, \"of\", \"RT\"], [338, 343, \n\"because of\", \"RT\"], [394, 396, \"at\", \"RT\"], [681, 698, \n\"In addition\", \"ID\"], [734, 741, \"for\", \"R\"], \n[811, 821, \"disaster\", \"DN\"], [866, 873, \"to get\", \"FV\"], \n[920, 932, \"\", \"UY\"]]]], \"l1\": \"ca\", \"id\": \"TR1*0102*2000*01\", \n\"answer-s\": \"4.3\"}<\/pre>\n\n\n\n<p>Each sample is in a dictionary format with several key-value pairs. Among them, we are interested in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">\"text\"<\/code> and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">\"edits\"<\/code> key-value pairs.<\/p>\n\n\n\n<p>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">\"text\"<\/code> key contains the essay as it was originally written by the author with a few grammatical errors. The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">\"edits\"<\/code> key contains the correction edits in the following format:<\/p>\n\n\n\n<p><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">[[annotator_id, [[char_start_offset, char_end_offset, correction], \u2026]], \u2026]<\/code><\/p>\n\n\n\n<p>We are most interested in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">char_start_offset<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">char_end_offset<\/code>, and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">correction<\/code> values. The former two values indicate the character indices where the word is wrong excluding the newline symbols (<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">'\\n'<\/code>). For example, in the first case, the characters from 71 to 76 correspond to the word <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">about<\/code> which should be replaced with <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">with<\/code>.<\/p>\n\n\n\n<p>As such, there are <strong>2116 samples in the training set<\/strong>, <strong>159 samples in the dev set<\/strong>, and <strong>194 samples in the test set<\/strong>.<\/p>\n\n\n\n<p>In the current format, it isn&#8217;t easy to write the dataset preparation code to be fed to the T5 model. For this reason, we will preprocess the dataset into an easier format which we will deal with while preparing the dataset for the model. We will do this in the coding section of the article.<\/p>\n\n\n\n<p>For now, you can go ahead and download the dataset.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Project Directory Structure<\/h2>\n\n\n\n<p>Let&#8217;s take a look at the complete directory structure.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\u251c\u2500\u2500 final_model_t5_small\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 added_tokens.json\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 config.json\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 generation_config.json\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 pytorch_model.bin\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 special_tokens_map.json\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 spiece.model\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 tokenizer_config.json\n\u251c\u2500\u2500 input\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 fce_v2.1.bea19\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 final\n\u2502\u00a0\u00a0     \u251c\u2500\u2500 test.json\n\u2502\u00a0\u00a0     \u251c\u2500\u2500 train.json\n\u2502\u00a0\u00a0     \u2514\u2500\u2500 valid.json\n\u251c\u2500\u2500 results_t5_small\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 checkpoint-5500\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 checkpoint-6500\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 events.out.tfevents.1703464284.sovitdl.18962.0\n\u251c\u2500\u2500 preprocess_fce.py\n\u2514\u2500\u2500 t5_small.ipynb<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">final_model_t5_small<\/code> and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">results_t5_small<\/code> contain the trained model and tokenizer after the training is done.<\/li>\n\n\n\n<li>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">input<\/code> directory contains the FCE dataset that we explored in the previous section. Along with that, it contains a final directory with three JSON files. We will obtain these after executing the preprocessing script.<\/li>\n\n\n\n<li>Directly inside the project directory, we have two files:\n<ul class=\"wp-block-list\">\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">t5_small.ipynb<\/code> which contains the code to train the T5 model for grammar correction and run inference.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">preprocess_fce.py<\/code> script that we will use to obtain the JSON files in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">input\/final<\/code> directory.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#ffb76a\"><strong><em>You can download the Jupyter Notebook for training &amp; inference along with the best weights via the &#8220;Download Code&#8221; section.<\/em><\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Dependencies<\/h2>\n\n\n\n<p>Before we move forward, we need to ensure that the environment is properly set up. We need the <strong>PyTorch<\/strong> framework for running the code in this article. Please go ahead and install it according to your configuration from the <strong><a href=\"https:\/\/pytorch.org\/get-started\/locally\/\" target=\"_blank\" rel=\"noreferrer noopener\">official site<\/a><\/strong>.<\/p>\n\n\n\n<p>Along with that, we need to install Hugging Face <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">transformers<\/code> and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">datasets<\/code> libraries. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install transformers<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install datasets<\/pre>\n\n\n\n<p>That&#8217;s it. We are done with all the major dependencies that we need.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"download-code\">Download Code<\/h3>\n\n\n\n<div class=\"wp-block-button is-style-outline center\"><a data-sumome-listbuilder-id=\"31822ec6-32dd-42d5-aeb8-29dc852bb7bd\" class=\"wp-block-button__link has-black-color has-luminous-vivid-orange-background-color has-text-color has-background\"><b>Download the Source Code for this Tutorial<\/b><\/a><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Grammar Correction using Hugging Face Transformers and the T5 Model<\/h2>\n\n\n\n<p>Let&#8217;s get into the coding part of the article. The first step, as we discussed earlier to bring the FCE dataset into a simpler format. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Preprocessing the FCE Dataset<\/h3>\n\n\n\n<p>For that, we have a simple <strong><em>preprocessing script in the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">preprocess_fce.py<\/code> file<\/em><\/strong>. Here is the entire content of the file.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"preprocess_fce.py\" data-enlighter-group=\"preprocess_fce_1\">import json\nimport os\n\nROOTS = [\n    'input\/fce_v2.1.bea19\/fce\/json\/fce.train.json',\n    'input\/fce_v2.1.bea19\/fce\/json\/fce.dev.json',\n    'input\/fce_v2.1.bea19\/fce\/json\/fce.test.json'\n]\nSPLITS = [\n    'train', \n    'valid', \n    'test'\n]\n\nsave_dir = 'input\/final'\nos.makedirs(save_dir, exist_ok=True)\n\ndef replace_multiple_substrings(original_string, replacements):\n    # replacements is expected to be a list of tuples, each containing:\n    # (start_index, end_index, new_substring)\n\n    # Sort replacements by start_index to handle replacements in order\n    replacements.sort(key=lambda x: x[0])\n\n    result = original_string\n    offset = 0  # This offset is necessary because the string length may change\n\n    for start_index, end_index, new_substring in replacements:\n        # Adjust indices based on the current offset\n        adjusted_start = start_index + offset\n        adjusted_end = end_index + offset\n\n        # Check for invalid indices\n        if adjusted_start &lt; 0 or adjusted_end > len(result) or adjusted_start > adjusted_end:\n            print(f\"Error: Invalid indices for replacement '{new_substring}'. Skipping.\")\n            continue\n\n        # Replace the specified part of the string\n        result = result[:adjusted_start] + str(new_substring) + result[adjusted_end:]\n\n        # Update the offset based on how the length of the string has changed\n        offset += len(str(new_substring)) - (end_index - start_index)\n\n    return result\n\nfor root, split in zip(ROOTS, SPLITS):\n    data = []\n    data_points = []\n    with open(root, 'r') as f:\n        for line in f:\n            data.append(json.loads(line))\n    \n    for i in range(len(data)):\n        str_data = data[i]['text']\n        re_data = data[i]['edits'][0][1]\n        # print('STR: ', str_data)\n        # print('RE: ', re_data)\n        modified_string = replace_multiple_substrings(str_data, [data[:3] for data in re_data])\n        # print(modified_string)\n        \n        data_point = {\n            'original': str_data,\n            'corrected': modified_string\n        }\n        data_points.append(data_point)\n    \n    with open(os.path.join(save_dir, split+'.json'), 'w') as f:\n        json.dump(data_points, f, indent=4)<\/pre>\n\n\n\n<p>This is a simple script that converts the original FCE dataset into a much simpler format. A simpler dataset format will later reduce the code that we need to write while training the model. <\/p>\n\n\n\n<p>In short, the above script:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Takes the original train, dev, and test files of the FCE dataset.<\/li>\n\n\n\n<li>Reads the original text and the edits from the JSON files.<\/li>\n\n\n\n<li>According to the edits, it creates a new text with the corrected words in place.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>After executing the above script, you will find <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">train.json<\/code>, <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">valid.json<\/code>, and <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">test.json<\/code> files inside <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">input\/final<\/code> directory.<\/p>\n\n\n\n<p>Following is a sample from the training split.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{\n        \"original\": \"Dear Sir or Madam,\\n\\nI am writing in order to express my \ndisappointment about your musical show \\\"Over the Rainbow\\\".\\n\\nI saws the \nshow's advertisement hanging up of a wall in London where I was spending my \nholiday with some friends. I convinced them to go there with me because I had \nheard good references about your Company and, above all, about the main star, \nDanny Brook.\\n\\nThe problems started in the box office, where we asked for the \ndiscounts you announced in the advertisement, and the man who was selling the \ntickets said that they didn't exist.\\n\\nMoreover, the show was delayed forty-five \nminutes and the worst of all was that Danny Brook had been replaced by another \nactor.\\n\\nOn the other hand, the theatre restaurant was closed because unknown \nreasons.\\n\\nYou promised a perfect evening but it became a big disastrous!\\n\\nI \nwould like some kind of explanation and receive my money back. If you don't agree, \nI will act consequently.\\n\\nI look forward to hearing from you.\\n\\nYours faithfully,\",\n        \"corrected\": \"Dear Sir or Madam,\\n\\nI am writing in order to express my \ndisappointment with your musical show \\\"Over the Rainbow\\\".\\n\\nI saw the show's \nadvertisement hanging up on a wall in London where I was spending my holiday with \nsome friends. I convinced them to go there with me because I had heard good reviews \nof your Company and, above all, because of the main star, Danny Brook.\\n\\nThe \nproblems started at the box office, where we asked for the discounts you announced \nin the advertisement, and the man who was selling the tickets said that they didn't \nexist.\\n\\nMoreover, the show was delayed forty-five minutes and the worst of all \nwas that Danny Brook had been replaced by another actor.\\n\\nIn addition, the theatre \nrestaurant was closed for unknown reasons.\\n\\nYou promised a perfect evening but it \nbecame a big disaster!\\n\\nI would like some kind of explanation and to get my money \nback. If you don't agree, I will act .\\n\\nI look forward to hearing from you.\\n\\nYours \nfaithfully,\"\n}<\/pre>\n\n\n\n<p>So, for each sample, we now have a dictionary-like format with an <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">\"original\"<\/code> key and a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">\"corrected\"<\/code> key.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">T5 for Grammar Correction<\/h3>\n\n\n\n<p>Now, let&#8217;s jump into the actual training notebook. The code here follows the content in the <strong><em><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">t5_small.ipynb<\/code><\/em><\/strong> Jupyter Notebook.<\/p>\n\n\n\n<p>In case you want a brief about the T5 Transformer model, please take a look at the <a href=\"https:\/\/debuggercafe.com\/spelling-correction-using-hugging-face-transformers\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>spelling correction<\/strong><\/a> article. In that article, we used the T5 model for single-word spelling correction, and it may be a good starting point if you are new to this topic.<\/p>\n\n\n\n<p>Let&#8217;s start with the import statements.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import torch\n\nfrom transformers import (\n    T5Tokenizer,\n    T5ForConditionalGeneration,\n    TrainingArguments, \n    Trainer\n)\nfrom datasets import load_dataset<\/pre>\n\n\n\n<p>From the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">transformers<\/code> library, we import:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">T5Tokenizer<\/code>: To tokenize the dataset which consists of grammatically incorrect and correct sentences.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">T5ForConditionalGeneration<\/code>: This is for the loading of the T5 model.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">TrainingArguments<\/code>: This class initializes all the training arguments before starting the training.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">Trainer<\/code>: To initialize the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">Trainer<\/code> object so that we can train the T5 model.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>We also import the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">load_dataset<\/code> function from the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">datasets<\/code> library to load the prepared JSON files in a format that is directly compatible with the rest of the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">transformers<\/code> pipeline.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Loading the Dataset<\/h3>\n\n\n\n<p>Next, we load the preprocessed training, validation, and test datasets.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">dataset_train = load_dataset(\n    'json', \n    data_files='input\/final\/train.json', \n    split='train'\n)\ndataset_valid = load_dataset(\n    'json', \n    data_files='input\/final\/valid.json', \n    split='train'\n)\ndataset_test = load_dataset(\n    'json', \n    data_files='input\/final\/test.json', \n    split='train'\n)<\/pre>\n\n\n\n<p>When using the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">load_dataset<\/code> function, the first argument is always the type of dataset that we are loading. As our dataset is in JSON format, so, we pass <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">'json'<\/code>.<\/p>\n\n\n\n<p>One other point to note here is the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">split<\/code> argument. When loading external datasets, it becomes mandatory to give the <code>split<\/code> as <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">'train'<\/code>. However, that does not change any attribute of the dataset, so, we can use each split as originally intended.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Defining Dataset and Training Configurations<\/h3>\n\n\n\n<p>The following code block contains a few dataset and training related configurations.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">MODEL = 't5-small'\nBATCH_SIZE = 16\nMAX_LENGTH = 256\nEPOCHS = 50\nNUM_WORKERS = 8\nOUT_DIR = 'results_t5_small'<\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">MODEL<\/code>: This is the model name that we will pass while loading the tokenizer and the model weights. For our grammar correction use case, we are using the <strong>T5 Small<\/strong> model.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">BATCH_SIZE<\/code>: We are using a <strong>batch size of 16<\/strong> for the data loaders.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">MAX_LENGTH<\/code>: This is the maximum context length to consider for each sample in the JSON files. Beyond this length, the text <strong>samples will be truncated<\/strong> and <strong>smaller samples will be padded<\/strong>.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">EPOCHS<\/code>: The number of epochs to train the model for.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">NUM_WORKERS<\/code>: The number of parallel workers for the data loaders.<\/li>\n\n\n\n<li><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">OUT_DIR<\/code>: This is the output directory to save intermediate results.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tokenizing the FCE Dataset<\/h3>\n\n\n\n<p>Tokenization is assigning an integer value to each word and breaking down a word into simpler ones if necessary. This is a much simpler explanation of what goes on inside. However, explaining the entire process of tokenization is out of the scope of this article.<\/p>\n\n\n\n<p>Let&#8217;s see how we can tokenize the dataset that we have just loaded above.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tokenizer = T5Tokenizer.from_pretrained(MODEL)\n\n# Function to convert text data into model inputs and targets\ndef preprocess_function(examples):\n    inputs = [f\"rectify: {inc}\" for inc in examples['original']]\n    model_inputs = tokenizer(\n        inputs, \n        max_length=MAX_LENGTH, \n        truncation=True,\n        padding='max_length'\n    )\n\n    # Set up the tokenizer for targets\n    with tokenizer.as_target_tokenizer():\n        labels = tokenizer(\n            examples['corrected'], \n            max_length=MAX_LENGTH, \n            truncation=True,\n            padding='max_length'\n        )\n\n    model_inputs[\"labels\"] = labels[\"input_ids\"]\n    return model_inputs\n\n# Apply the function to the whole dataset\ntokenized_train = dataset_train.map(\n    preprocess_function, \n    batched=True,\n    num_proc=8\n)\ntokenized_valid = dataset_valid.map(\n    preprocess_function, \n    batched=True,\n    num_proc=8\n)\ntokenized_test = dataset_test.map(\n    preprocess_function, \n    batched=True,\n    num_proc=8\n)<\/pre>\n\n\n\n<p>The first step is loading the tokenizer. We load the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">T5Tokenizer<\/code> on the first line of the above code cell while passing the model name that we defined earlier.<\/p>\n\n\n\n<p>In the second step, we have a <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">preprocess_function<\/code>. This accepts samples from the loaded dataset. Each sample consists of the <strong>original grammatically incorrect<\/strong> and the <strong>modified grammatically correct<\/strong> text. Note that we are appending the <strong>rectify<\/strong> text to each of the original incorrect text. T5 models work best when assigning a starting token based on the task. As we are correcting grammatical errors here, we have passed the above text. <\/p>\n\n\n\n<p>Theoretically, it is possible to pass any string as a starting token. However, using something that aligns with the task is much better.<\/p>\n\n\n\n<p>The inputs to the T5 model are going to be the incorrect sentences and the targets (<code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">labels<\/code>) will be the correct sentences. Finally, we return a dictionary that contains both, the tokenized input and the tokenized targets.<\/p>\n\n\n\n<p>The third step involves mapping all three splits to the preprocessing function. The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">num_proc<\/code> argument defines how many parallel processes are being used for tokenization.<\/p>\n\n\n\n<p>If you are new to NLP, then you can start with the following text classification articles which will help you better understand the pipeline of tokenization.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/debuggercafe.com\/text-classification-using-pytorch\/\" target=\"_blank\" rel=\"noreferrer noopener\">Getting Started with Text Classification using Pytorch, NLP, and Deep Learning<\/a><\/strong><\/li>\n\n\n\n<li><strong><a href=\"https:\/\/debuggercafe.com\/disaster-tweet-classification-using-pytorch\/\" target=\"_blank\" rel=\"noreferrer noopener\">Disaster Tweet Classification using PyTorch<\/a><\/strong><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Loading the T5 Model<\/h3>\n\n\n\n<p>Now, let&#8217;s load the T5 model and check the number of trainable parameters.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Load the pre-trained BART model\nmodel = T5ForConditionalGeneration.from_pretrained(MODEL)\n\n# Specify the device\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\n# Total parameters and trainable parameters.\ntotal_params = sum(p.numel() for p in model.parameters())\nprint(f\"{total_params:,} total parameters.\")\ntotal_trainable_params = sum(\n    p.numel() for p in model.parameters() if p.requires_grad)\nprint(f\"{total_trainable_params:,} training parameters.\")<\/pre>\n\n\n\n<p>We are using the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">from_pretrained<\/code> method of the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">T5ForConditionalGeneration<\/code> class to load the pretrained T5 Small model. It contains around <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">60.5 million parameters<\/code> which is enough for getting started with our journey of GEC (Grammatical Error Correction).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Defining the Training Arguments<\/h3>\n\n\n\n<p>We will use the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">TrainingArguments<\/code> class to initialize all the training arguments.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Define the training arguments\ntraining_args = TrainingArguments(\n    output_dir=OUT_DIR,          \n    num_train_epochs=EPOCHS,\n    per_device_train_batch_size=BATCH_SIZE,\n    per_device_eval_batch_size=BATCH_SIZE*2,\n    warmup_steps=500,\n    weight_decay=0.01,\n    logging_dir=OUT_DIR,\n    evaluation_strategy='steps',\n    save_steps=500,\n    eval_steps=500,\n    load_best_model_at_end=True,\n    save_total_limit=2,\n    report_to='tensorboard',\n    dataloader_num_workers=NUM_WORKERS\n)\n<\/pre>\n\n\n\n<p>It accepts several arguments (more than 100 to be precise). However, in the above code block, we only pass the ones necessary for our usage.<\/p>\n\n\n\n<p>According to the arguments:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The model will be evaluated and saved every 500 steps. But only two saved models will be preserved and others will be overwritten.<\/li>\n\n\n\n<li>The best model will be loaded at the end so that we can save it one final time before proceeding to the inference section.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Starting the Training for Grammar Correction using T5<\/h3>\n\n\n\n<p>Before starting the training, we need to initialize the <strong>Trainer API<\/strong> as well. The next code cell does that and starts the training.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Create the Trainer instance\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataset=tokenized_train,\n    eval_dataset=tokenized_valid,\n)\n\n# Start training\nhistory = trainer.train()<\/pre>\n\n\n\n<p>The <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">Trainer<\/code> class accepts the model, the above defined training arguments, and training &amp; validation datasets.<\/p>\n\n\n\n<p>We invoke the <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">train<\/code> method of the instance to start the training.<\/p>\n\n\n\n<p>Here are the training logs after 50 epochs.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-training-logs.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"309\" height=\"406\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-training-logs.png\" alt=\"T5 training logs for grammar correction.\" class=\"wp-image-34330\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-training-logs.png 309w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-training-logs-228x300.png 228w\" sizes=\"auto, (max-width: 309px) 100vw, 309px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 2. T5 training logs for grammar correction.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>The model was able to reach the best loss at 5500 steps after which it began to deteriorate. But as we are loading the best model after training, we can again save the best final model and tokenizer to disk to use at inference time. Let&#8217;s do that.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tokenizer.save_pretrained('final_model_t5_small')\nmodel.save_pretrained('final_model_t5_small')<\/pre>\n\n\n\n<p>Next, we can evaluate the model on the test set as well.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">trainer.evaluate(tokenized_test)<\/pre>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-evaluation-logs.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"391\" height=\"128\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-evaluation-logs.png\" alt=\"Grammar correction evaluation loss logs using the trained T5 model.\" class=\"wp-image-34332\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-evaluation-logs.png 391w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-evaluation-logs-300x98.png 300w\" sizes=\"auto, (max-width: 391px) 100vw, 391px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 3. Grammar correction evaluation loss logs using the trained T5 model.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>The evaluation loss on the test set is 0.47. One important point to keep note of here is that we are evaluating the grammar correction model based on the validation loss which is not entirely correct. In future posts, we will explore more accurate metrics for grammar correction models.<\/p>\n\n\n\n<p>Following is the evaluation loss graph from the above training.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-small-eval-loss.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"631\" height=\"436\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-small-eval-loss.png\" alt=\"Evaluation loss graph after training T5 on FCE dataset.\" class=\"wp-image-34335\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-small-eval-loss.png 631w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-small-eval-loss-300x207.png 300w\" sizes=\"auto, (max-width: 631px) 100vw, 631px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 4. Evaluation loss graph after training T5 on FCE dataset.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>As we can see, the loss was mostly reducing till the end of the training. To continue training further, most probably, we will have to apply a learning rate scheduling technique.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Grammar Correction Inference using the Trained T5 Model<\/h2>\n\n\n\n<p>Let&#8217;s move on to the inference phase now. Following are the steps to carry out for grammar correction inference:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>First, we will load the best model weights and tokenizer from the disk.<\/li>\n\n\n\n<li>Second, we will write a helper function for grammar correction inference.<\/li>\n\n\n\n<li>Third, we will pass a list of sentences to the helper function to get the corrected sentences.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from transformers import T5ForConditionalGeneration, T5Tokenizer\n    \nmodel_path = 'final_model_t5_small'  # the path where you saved your model\nmodel = T5ForConditionalGeneration.from_pretrained(model_path)\ntokenizer = T5Tokenizer.from_pretrained(model_path)<\/pre>\n\n\n\n<p>We load the final model and tokenizer from the disk in the above code block.<\/p>\n\n\n\n<p>Now, let&#8217;s write a simple helper function called <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">do_correction<\/code>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def do_correction(text, model, tokenizer):\n    input_text = f\"rectify: {text}\"\n    inputs = tokenizer.encode(\n        input_text,\n        return_tensors='pt',\n        max_length=256,\n        padding='max_length',\n        truncation=True\n    )\n\n    # Get correct sentence ids.\n    corrected_ids = model.generate(\n        inputs,\n        max_length=384,\n        num_beams=5,\n        early_stopping=True\n    )\n\n    # Decode.\n    corrected_sentence = tokenizer.decode(\n        corrected_ids[0],\n        skip_special_tokens=True\n    )\n    return corrected_sentence<\/pre>\n\n\n\n<p>It simply processes the input text, generates the text IDs by forward passing through the model, and decodes the IDs to return the final text.<\/p>\n\n\n\n<p>Please note that we are adding the same <strong>rectify<\/strong> text here as well before each sentence.<\/p>\n\n\n\n<p>Finally, define a few sentences in a list and pass through the model.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">sentences = [\n    \"He don't like to eat vegetables.\",\n    \"They was going to the store yesterday.\",\n    \"She don't sings very well.\",\n    \"Between you and I, the decision not well received.\",\n    \"The book I borrowed from the library, it was really interesting.\",\n    \"Despite of the rain, they went for a picnic.\"\n]\n\nfor sentence in sentences:\n    corrected_sentence = do_correction(sentence, model, tokenizer)\n    print(f\"ORIG: {sentence}\\nCORRECT: {corrected_sentence}\")<\/pre>\n\n\n\n<p>Here are the outputs.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-inference.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"556\" height=\"205\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-inference.png\" alt=\"Grammar correction inference results using the trained T5 model.\" class=\"wp-image-34337\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-inference.png 556w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-t5-inference-300x111.png 300w\" sizes=\"auto, (max-width: 556px) 100vw, 556px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 5. Grammar correction inference results using the trained T5 model.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>The results are really good. The model can correct all the grammatical mistakes in the sentences. However, the T5 Small model does not do very well on long sentences with multiple errors. Here is an example.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-inference-bad-performance-example.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"600\" src=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-inference-bad-performance-example.png\" alt=\"T5 grammar correction sub-optimal performance.\" class=\"wp-image-34339\" srcset=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-inference-bad-performance-example.png 600w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-inference-bad-performance-example-300x300.png 300w, https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/grammar-correction-inference-bad-performance-example-150x150.png 150w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 6. An example of T5 grammar correction sub-optimal performance.<\/figcaption><\/figure>\n<\/div>\n\n\n<p>There are two errors in the corrected sentence. The model failed to rectify &#8220;<strong><em>data it<\/em><\/strong>&#8221; and in the final sentence &#8220;<strong><em>Its<\/em><\/strong>&#8221; should have been either &#8220;<strong><em>It is<\/em><\/strong>&#8221; or &#8220;<strong><em>It&#8217;s<\/em><\/strong>&#8220;.<\/p>\n\n\n\n<p>In future articles, we will see how larger models with better training strategies can handle such cases.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary and Conclusion<\/h2>\n\n\n\n<p>In this article, we went through a code-first approach for grammar correction using Hugging Face Transformers. We trained the T5 Small model on the FCE dataset and ran inference on some unseen sentences. In the end, we also checked whether the model lacks, that is, long text with multiple errors. We will tackle these issues in future articles. I hope this article was worth your time.<\/p>\n\n\n\n<p>If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.<\/p>\n\n\n\n<p>You can contact me using the <strong><a aria-label=\"Contact (opens in a new tab)\" href=\"https:\/\/debuggercafe.com\/contact-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">Contact<\/a><\/strong> section. You can also find me on <strong><a aria-label=\"LinkedIn (opens in a new tab)\" href=\"https:\/\/www.linkedin.com\/in\/sovit-rath\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a><\/strong>, and <strong><a href=\"https:\/\/x.com\/SovitRath5\" target=\"_blank\" rel=\"noreferrer noopener\">X<\/a><\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Further Reading<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/debuggercafe.com\/character-level-text-generation-using-lstm\/\" target=\"_blank\" rel=\"noreferrer noopener\">Character Level Text Generation using LSTM<\/a><\/strong><\/li>\n\n\n\n<li><strong><a href=\"https:\/\/debuggercafe.com\/word-level-text-generation-using-lstm\/\" target=\"_blank\" rel=\"noreferrer noopener\">Word Level Text Generation using LSTM<\/a><\/strong><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we train the T5 Transformer model for Grammar Correction on the FCE dataset.<\/p>\n","protected":false},"author":1,"featured_media":34343,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[727,409,434],"tags":[730,731,735,736,734,732,728,733,738,737],"class_list":["post-34220","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-grammar-error-correction-gec","category-nlp","category-transformer","tag-gec","tag-grammar-correction","tag-grammar-correction-model","tag-grammar-correction-on-fce-dataset","tag-grammar-correction-using-hugging-face-transformers","tag-grammar-correction-using-t5","tag-grammar-error-correction","tag-t5-for-grammar-correction","tag-t5-grammar-correction","tag-training-t5-for-grammar-correction"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Grammar Correction using Hugging Face Transformers T5 on FCE Dataset<\/title>\n<meta name=\"description\" content=\"Grammar Correction using T5 on the FCE dataset using Hugging Face Transformers and the PyTorch deep learning framework.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Grammar Correction using Hugging Face Transformers T5 on FCE Dataset\" \/>\n<meta property=\"og:description\" content=\"Grammar Correction using T5 on the FCE dataset using Hugging Face Transformers and the PyTorch deep learning framework.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/\" \/>\n<meta property=\"og:site_name\" content=\"DebuggerCafe\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/profile.php?id=100013731104496\" \/>\n<meta property=\"article:published_time\" content=\"2024-02-12T00:30:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-15T15:50:53+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"563\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sovit Ranjan Rath\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:site\" content=\"@SovitRath5\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sovit Ranjan Rath\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"24 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/\"},\"author\":{\"name\":\"Sovit Ranjan Rath\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"headline\":\"Getting Started with Grammar Correction using Hugging Face Transformers\",\"datePublished\":\"2024-02-12T00:30:00+00:00\",\"dateModified\":\"2024-09-15T15:50:53+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/\"},\"wordCount\":2339,\"commentCount\":1,\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png\",\"keywords\":[\"GEC\",\"Grammar Correction\",\"Grammar Correction Model\",\"Grammar Correction on FCE Dataset\",\"Grammar Correction using Hugging Face Transformers\",\"Grammar Correction using T5\",\"Grammar Error Correction\",\"T5 for Grammar Correction\",\"T5 Grammar Correction\",\"Training T5 for Grammar Correction\"],\"articleSection\":[\"Grammar Error Correction (GEC)\",\"NLP\",\"Transformer\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/\",\"url\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/\",\"name\":\"Grammar Correction using Hugging Face Transformers T5 on FCE Dataset\",\"isPartOf\":{\"@id\":\"https:\/\/debuggercafe.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png\",\"datePublished\":\"2024-02-12T00:30:00+00:00\",\"dateModified\":\"2024-09-15T15:50:53+00:00\",\"author\":{\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\"},\"description\":\"Grammar Correction using T5 on the FCE dataset using Hugging Face Transformers and the PyTorch deep learning framework.\",\"breadcrumb\":{\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#primaryimage\",\"url\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png\",\"contentUrl\":\"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png\",\"width\":1000,\"height\":563,\"caption\":\"Getting Started with Grammar Correction using Hugging Face Transformers\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/debuggercafe.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Getting Started with Grammar Correction using Hugging Face Transformers\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/debuggercafe.com\/#website\",\"url\":\"https:\/\/debuggercafe.com\/\",\"name\":\"DebuggerCafe\",\"description\":\"Machine Learning and Deep Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/debuggercafe.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752\",\"name\":\"Sovit Ranjan Rath\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g\",\"caption\":\"Sovit Ranjan Rath\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Grammar Correction using Hugging Face Transformers T5 on FCE Dataset","description":"Grammar Correction using T5 on the FCE dataset using Hugging Face Transformers and the PyTorch deep learning framework.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/","og_locale":"en_US","og_type":"article","og_title":"Grammar Correction using Hugging Face Transformers T5 on FCE Dataset","og_description":"Grammar Correction using T5 on the FCE dataset using Hugging Face Transformers and the PyTorch deep learning framework.","og_url":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/","og_site_name":"DebuggerCafe","article_publisher":"https:\/\/www.facebook.com\/profile.php?id=100013731104496","article_published_time":"2024-02-12T00:30:00+00:00","article_modified_time":"2024-09-15T15:50:53+00:00","og_image":[{"width":1000,"height":563,"url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png","type":"image\/png"}],"author":"Sovit Ranjan Rath","twitter_card":"summary_large_image","twitter_creator":"@SovitRath5","twitter_site":"@SovitRath5","twitter_misc":{"Written by":"Sovit Ranjan Rath","Est. reading time":"24 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#article","isPartOf":{"@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/"},"author":{"name":"Sovit Ranjan Rath","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"headline":"Getting Started with Grammar Correction using Hugging Face Transformers","datePublished":"2024-02-12T00:30:00+00:00","dateModified":"2024-09-15T15:50:53+00:00","mainEntityOfPage":{"@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/"},"wordCount":2339,"commentCount":1,"image":{"@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png","keywords":["GEC","Grammar Correction","Grammar Correction Model","Grammar Correction on FCE Dataset","Grammar Correction using Hugging Face Transformers","Grammar Correction using T5","Grammar Error Correction","T5 for Grammar Correction","T5 Grammar Correction","Training T5 for Grammar Correction"],"articleSection":["Grammar Error Correction (GEC)","NLP","Transformer"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/","url":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/","name":"Grammar Correction using Hugging Face Transformers T5 on FCE Dataset","isPartOf":{"@id":"https:\/\/debuggercafe.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#primaryimage"},"image":{"@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#primaryimage"},"thumbnailUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png","datePublished":"2024-02-12T00:30:00+00:00","dateModified":"2024-09-15T15:50:53+00:00","author":{"@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752"},"description":"Grammar Correction using T5 on the FCE dataset using Hugging Face Transformers and the PyTorch deep learning framework.","breadcrumb":{"@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#primaryimage","url":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png","contentUrl":"https:\/\/debuggercafe.com\/wp-content\/uploads\/2023\/12\/Getting-Started-with-Grammar-Correction-using-Hugging-Face-Transformers-e1703810061918.png","width":1000,"height":563,"caption":"Getting Started with Grammar Correction using Hugging Face Transformers"},{"@type":"BreadcrumbList","@id":"https:\/\/debuggercafe.com\/getting-started-with-grammar-correction\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/debuggercafe.com\/"},{"@type":"ListItem","position":2,"name":"Getting Started with Grammar Correction using Hugging Face Transformers"}]},{"@type":"WebSite","@id":"https:\/\/debuggercafe.com\/#website","url":"https:\/\/debuggercafe.com\/","name":"DebuggerCafe","description":"Machine Learning and Deep Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/debuggercafe.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/27719b14d930bd4a88ade40d18b0a752","name":"Sovit Ranjan Rath","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/debuggercafe.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f71ca13ec56d630e7d8045e8b846396068791aa204936c3d74d721c6dd2b4d3c?s=96&d=mm&r=g","caption":"Sovit Ranjan Rath"}}]}},"_links":{"self":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/34220","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/comments?post=34220"}],"version-history":[{"count":149,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/34220\/revisions"}],"predecessor-version":[{"id":38138,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/posts\/34220\/revisions\/38138"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media\/34343"}],"wp:attachment":[{"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/media?parent=34220"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/categories?post=34220"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/debuggercafe.com\/wp-json\/wp\/v2\/tags?post=34220"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}