After building and training your model, it is essential to assess how well it performs. For machine translation, you have different metrics that were engineered just for this task. In this lecture, I will show you the BLEU Score and some of its issues for evaluating machine translation models. The BLEU Score, a bilingual evaluation under study, is an algorithm designed to evaluate some of the most challenging problems in NLP, including machine translation. It evaluates the quality of machine translated text by comparing a candidate translation to one or more references, which are often human translations. The closer the BLEU Score is to one, the better your model is, and the closer to zero, the worse it is. With that said, what is BLEU Score, and why is this an important metric? To get the BLEU Score, you have to compute the precision of the candidates by comparing its engrams with referenced translations. To demonstrate, I'll use unigrams as an example. Let's say that you have a candidate sequence that you got from your model composed of I, I, am, I. You also have one referenced translation which contains the words Younes said am hungry, and a second referenced translation that includes the words he said I am hungry. To get the BLEU Score, you count how many words from the candidates appear in any of the references, and divide that count by the total number of words in the candidate translation. You can view it as a precision metric. You have to go through all the words in the candidate translation. First, you have the word I which appears in both reference translations, so you add one to your count. Then, you have again the word I which you already know appears on both references, and you add one to your count. After that, you have the word am which also appears in both references, so you add that to your counts. At the end, you have the word I again which appears on both references, so you can add the one to your count. Finally, you can get the BLEU score by dividing your count by the number of words in the candidate translation, which in this case is equal to 4. The whole process gives you a BLEU score equal to 1. Weird, right? This translation that is far from being equal to the references got a perfect score. With this vanilla BLEU Score, a model that always outputs common words will do great. Let's try a modified version that will give you a better estimates of your model's performance. For the modified version of the BLEU Score, after you find a word from the candidates in one or more of the references, you stop considering that word from the reference for the following words in the candidates. In other words, you exhaust the words in the references after you match them with the words in the candidates. Let's start from the beginning of the candidate translation. You have the word I that appears in both references, so you add one to your account and exhaust the word I from both references. Then you have the word I again, but you don't have that's worth in the references because it was taken out for the previous word in the candidates, so you don't add anything to your account. Then you have the word am which appears in both references, so you add one to your account and eliminate the word am from both references. After that, you have the word the I again, but no left occurrences in the references, so you don't add anything to your account. Finally, you divide your account by the number of words in the candidate translation to get BLUE score of two over four or 0.5. As you can note, this version of the BLEU score makes more sense than the vanilla implementation. However, like anything in life, using the BLEU score as an evaluation metric has some caveats. For one, it doesn't consider the semantic meaning of the words. It also doesn't consider the structure of the sentence. Imagine getting this translation, ate I was hungry because. If the reference sentence is I ate because I was hungry, this would get a perfect BLEU Score. BLEU Score is the most widely adopted evaluation metric for machine translation, but you should be aware of these drawbacks before using it. You now know how to evaluate your machine translation model using the BLEU score. I also showed you that this metric has some issues because it doesn't care about semantics and sentence structure. In the following video, you'll see another metric for machine translation, and that metric could be used to better estimate your model performance.