Previously, I introduced you to the BLEU score evaluation metric and it's modified version. I used it to assess the performance of machine translation models. I also showed you some drawbacks that's arise because that metric ignores semantic and sentence structure. In this video, I'll talk about the ROUGE score, another performance metric that tends to estimate the quality of machine translation systems. I'll introduce You now to a family of metrics called ROUGE. It stands for Recall-Oriented Understudy of Gisting Evaluation, which is a mouthful. But lets you know, right off the bat, that it's more recall-oriented by default. That means that ROUGE cares about how much of the human created references appear in the candidate translation. In contrast, BLEU is precision oriented. Since you have to determine how many words from the candidates appear on the references. ROUGE was initially developed to evaluate the quality of the machine summarized texts, but is also helpful in assessing the quality of machine translation. It works by comparing the machine candidates against reference translations provided by humans. There are many versions of the ROUGE score, but also the one called ROUGE-N. For the ROUGE-N score, You have to get the counts of the n-gram overlaps between the candidates and the reference translations, which is somewhat similar to what you have to do for the BLEU score. To see the difference between the two metrics, I'll show You an example of how ROUGE-N works with uni-grams. To get the basic version of the ROUGE-N score based only on recall so you must count word matches between the reference and the candidates, and divide by the number of words in the reference. If you had multiple references, you would need to get a ROUGE-N score using each reference and get the maximum. Now, let's go through the example that you already solved for the BLEU score. Your candidate has the words I two times, the word M, and the word I again, for a total of four words. You also have a reference translation. Younes said, "I am hungry" and another slightly different reference. He said, "I'm hungry." Each reference has five words in total. You have to count matches between the references and the candidate translations, similar to what you did for the BLEU score. Let's start with the first reference. The word Younes, doesn't match any of the uni-grams in the candidates, so you don't add anything to the counts. The word said doesn't match any word and the candidates either. The word I, has multiple matches, but you need the first one. For this match, you add only one to your counts. The word M has a match in the candidates so your increment your counts. Now, the final word of the first reference, hungry, doesn't match any of the words from the candidates. You don't add anything to your counts. If you repeat this process for the second reference, you get a counts equal to 2. Finally, you divide these counts by the number of words in each reference and get the maximum value, which for this example is equal to 0.4. This basic version of the ROUGE-N score is based on recall while the BLEU score you saw in the previous lectures is precision. But why not combine both to get a metric like an F1 score? Recall, pun intended, from your introductory machine learning courses that the F1 score is given by this formula, two times the product of precision and recall, divided by the sum of both metrics. You get the following formula, if you replace precision by the modified version of the BLEU score and recall by the ROUGE-N score. For this example, you have a BLEU score equal to 0.5, which you got in previous lectures. You have a ROUGE-N score equivalent to 0.4, that you calculated before. With these values, you will have an F1 score equal to 4 over 9, close to 0.44. You have now seen how to compute the modified BLEU and the sample ROUGE-N scores to evaluate your model. You can view these metrics like precision and recall. Therefore, you can use both to get an F1 score that's could better assess the performance of your machine translation model. In many applications, you will see reported and F-score along with the BLEU and ROUGE-N metric. However, you must note that's all the evaluation metrics you have seen so far, don't consider the sentence structure and semantics, only accounts for matching n-grams between candidates and the reference translations. You now have seen how to compute the modified BLEU and the simple ROUGE-N scores to evaluate your model. You can view these metrics like precision and recall. Therefore, you can use both to get an F1 score that's good, better assess the performance of your machine translation model. In many applications, you'll see reported an F-score along with the BLEU and the ROUGE-N metrics. However, you must note that all the evaluation metrics you have seen so far don't consider the sentence structure and semantics. They only account for matching n-grams between the candidates and reference translations.