0:00

One of the challenges of machine translation is that,

Â given a French sentence, there could be multiple English translations that

Â are equally good translations of that French sentence.

Â So how do you evaluate a machine translation system if there are multiple

Â equally good answers,

Â unlike, say, image recognition where there's one right answer?

Â You just measure accuracy.

Â If there are multiple great answers, how do you measure accuracy?

Â The way this is done conventionally is through something called the BLEU score.

Â So, in this optional video, I want to share with you,

Â I want to give you a sense of how the BLEU score works.

Â 0:37

Let's say you are given a French sentence Le chat est sur le tapis.

Â And you are given a reference, human generated translation of this,

Â which is the the cat is on the mat.

Â But there are multiple, pretty good translations of this.

Â So a different human,

Â different person might translate it as there is a cat on the mat.

Â And both of these are actually just perfectly fine translations of

Â the French sentence.

Â What the BLEU score does is given a machine generated translation,

Â it allows you to automatically compute a score that

Â measures how good is that machine translation.

Â 1:45

Understudy.

Â So in the theater world,

Â an understudy is someone that learns the role of a more senior actor so

Â they can take over the role of the more senior actor, if necessary.

Â And motivation for BLEU is that, whereas you could ask human

Â evaluators to evaluate the machine translation system,

Â the BLEU score is an understudy, could be a substitute for

Â having humans evaluate every output of a machine translation system.

Â 2:22

So the BLEU score was due to Kishore Papineni, Salim Roukos,

Â Todd Ward, and Wei-Jing Zhu.

Â This paper has been incredibly influential, and is,

Â actually, quite a readable paper.

Â So I encourage you to take a look if you have time.

Â So, the intuition behind the BLEU score is we're going to look

Â at the machine generated output and see if the types of words it

Â generates appear in at least one of the human generated references.

Â And so these human generated references would be provided as part

Â of the depth set or as part of the test set.

Â Now, let's look at a somewhat extreme example.

Â Let's say that the machine translation system abbreviating

Â machine translation is MT.

Â So the machine translation, or the MT output, is the the the the the the the.

Â So this is clearly a pretty terrible translation.

Â 3:23

So one way to measure how good the machine translation output is,

Â is to look at each the words in the output and see if it appears in the references.

Â And so, this would be called a precision of the machine translation output.

Â And in this case, there are seven words in the machine translation output.

Â And every one of these 7 words appears in either Reference 1 or Reference 2, right?

Â So the word the appears in both references.

Â So each of these words looks like a pretty good word to include.

Â So this will have a precision of 7 over 7.

Â It looks like it was a great precision.

Â So this is why the basic precision measure of what fraction of

Â the words in the MT output also appear in the references.

Â This is not a particularly useful measure,

Â because it seems to imply that this MT output has very high precision.

Â So instead, what we're going to use is a modified precision

Â measure in which we will give each word credit only up to the maximum

Â number of times it appears in the reference sentences.

Â So in Reference 1, the word, the, appears twice.

Â In Reference 2, the word, the, appears just once.

Â So 2 is bigger than 1, and so we're going to say that the word,

Â the, gets credit up to twice.

Â So, with a modified precision, we will say that,

Â it gets a score of 2 out of 7, because out of 7 words,

Â we'll give it a 2 credits for appearing.

Â 5:10

So here, the denominator is the count of the number of times the word,

Â the, appears of 7 words in total.

Â And the numerator is the count of the number of times the word, the, appears.

Â We clip this count, we take a max, or we clip this count, at 2.

Â 5:32

So this gives us the modified precision measure.

Â Now, so far, we've been looking at words in isolation.

Â In the BLEU score, you don't want to just look at isolated words.

Â You maybe want to look at pairs of words as well.

Â Let's define a portion of the BLEU score on bigrams.

Â And bigrams just means pairs of words appearing next to each other.

Â So now, let's see how we could use bigrams to define the BLEU score.

Â And this will just be a portion of the final BLEU score.

Â And we'll take unigrams, or single words, as well as bigrams, which means pairs

Â of words into account as well as maybe even longer sequences of words,

Â such as trigrams, which means three words pairing together.

Â So, let's continue our example from before.

Â We have to same Reference 1 and Reference 2.

Â But now let's say the machine translation or

Â the MT System has a slightly better output.

Â The cat the cat on the mat.

Â Still not a great translation, but maybe better than the last one.

Â 6:36

So here, the possible bigrams are, well there's the cat, but ignore case.

Â And then there's cat the, that's another bigram.

Â And then there's the cat again, but I've already had that, so let's skip that.

Â And then cat on is the next one.

Â And then on the, and the mat.

Â So these are the bigrams in the machine translation output.

Â 7:17

And then finally, let's define the clipped count, so count, and then subscript clip.

Â And to define that, let's take this column of numbers, but

Â give our algorithm credit only up to the maximum number of times

Â that that bigram appears in either Reference 1 or Reference 2.

Â So the cat appears a maximum of once in either of the references.

Â So I'm going to clip that count to 1.

Â Cat the, well, it doesn't appear in Reference 1 or Reference 2, so

Â I clip that to 0.

Â Cat on, yep, that appears once.

Â We give it credit for once.

Â On the appears once, give that credit for once, and the mat appears once.

Â So these are the clipped counts.

Â We're taking all the counts and clipping them, really reducing them to be no

Â more than the number of times that bigram appears in at least one of the references.

Â 8:19

And then, finally,

Â our modified bigram precision will be the sum of the count clipped.

Â So that's 1, 2, 3, 4 divided by the total number of bigrams.

Â That's 2, 3, 4, 5, 6, so 4 out of 6 or

Â two-thirds is the modified precision on bigrams.

Â 8:43

So let's just formalize this a little bit further.

Â With what we had developed on unigrams,

Â we defined this modified precision computed on unigrams as P subscript 1.

Â The P stands for precision and

Â the subscript 1 here means that we're referring to unigrams.

Â But that is defined as sum over the unigrams.

Â So that just means sum over the words that appear in the machine translation output.

Â So this is called y hat of count clip, Of that unigram.

Â 9:29

Divided by sum of our unigrams in the machine translation output of count,

Â number of counts of that unigram, right?

Â And so this is what we had gotten I guess

Â is 2 out of 7, 2 slides back.

Â So the 1 here refers to unigram,

Â meaning we're looking at single words in isolation.

Â You can also define Pn as the n-gram version,

Â 10:40

And so these precisions, or these modified precision scores,

Â measured on unigrams or on bigrams, which we did on a previous slide,

Â or on trigrams, which are triples of words,

Â or even higher values of n for other n-grams.

Â This allows you to measure the degree to which the machine translation

Â output is similar or maybe overlaps with the references.

Â 11:14

And one thing that you could probably convince yourself of is if the MT

Â output is exactly the same as either Reference 1 or Reference 2,

Â then all of these values P1, and P2 and so on, they'll all be equal to 1.0.

Â So to get a modified precision of 1.0,

Â you just have to be exactly equal to one of the references.

Â And sometimes it's possible to achieve this even if you aren't

Â exactly the same as any of the references.

Â But you kind of combine them in a way that hopefully still

Â results in a good translation.

Â 11:57

Finally, Finally,

Â let's put this together to form the final BLEU score.

Â So P subscript n is the BLEU score computed on n-grams only.

Â Also the modified precision computed on n-grams only.

Â And by convention to compute one number, you compute P1,

Â P2, P3 and P4, and combine them together using the following formula.

Â It's going to be the average, so sum from n = 1 to 4 of Pn and divide that by 4.

Â So basically taking the average.

Â 12:45

By convention the BLEU score is defined as, e to the this, then exponentiations,

Â and linear operate, exponentiation is strictly monotonically increasing

Â operation and then we actually adjust this with one more factor called the,

Â 13:40

But we don't want translations that are very short.

Â So the BP, or the brevity penalty, is an adjustment factor that penalizes

Â translation systems that output translations that are too short.

Â So the formula for the brevity penalty is the following.

Â It's equal to 1 if your machine translation system actually outputs

Â things that are longer than the human generated reference outputs.

Â And otherwise is some formula like that that

Â overall penalizes shorter translations.

Â 14:19

So, in the details you can find in this paper.

Â So, once again, earlier in this set of courses,

Â you saw the importance of having a single real number evaluation metric.

Â Because it allows you to try out two ideas, see which one achieves a higher

Â score, and then try to stick with the one that achieved the higher score.

Â So the reason the BLEU score was revolutionary for

Â machine translation was because this gave a pretty good, by no means perfect, but

Â pretty good single real number evaluation metric.

Â And so that accelerated the progress of the entire field of machine translation.

Â I hope this video gave you a sense of how the BLEU score works.

Â In practice, few people would implement a BLEU score from scratch.

Â There are open source implementations that you can download and

Â just use to evaluate your own system.

Â But today, BLEU score is used to evaluate many systems that generate text,

Â such as machine translation systems, as well as the example I showed briefly

Â earlier of image captioning systems where you would have a system,

Â have a neural network generated image caption.

Â And then use the BLEU score to see how much that overlaps with maybe a reference

Â caption or multiple reference captions that were generated by people.

Â So the BLEU score is a useful single real number evaluation metric to use

Â whenever you want your algorithm to generate a piece of text.

Â And you want to see whether it has similar meaning as a reference

Â piece of text generated by humans.

Â This is not used for speech recognition, because in speech recognition,

Â there's usually one ground truth.

Â And you just use other measures to see if you got the speech transcription on

Â pretty much, exactly word for word correct.

Â But for things like image captioning, and multiple captions for a picture,

Â it could be about equally good or for machine translations.

Â There are multiple translations, but equally good.

Â The BLEU score gives you a way to evaluate that automatically and

Â therefore speed up your development.

Â So with that, I hope you have a sense of how the BLEU score works.

Â