0:00

In the third course of this sequence of five courses, you saw how error analysis

Â can help you focus your time on doing the most useful work for your project.

Â Now, beam search is an approximate search algorithm,

Â also called a heuristic search algorithm.

Â And so it doesn't always output the most likely sentence.

Â It's only keeping track of B equals 3 or 10 or 100 top possibilities.

Â So what if beam search makes a mistake?

Â In this video, you'll learn how error analysis interacts with beam search and

Â how you can figure out whether it is the beam search algorithm that's causing

Â problems and worth spending time on.

Â Or whether it might be your RNN model that is causing problems and

Â worth spending time on.

Â Let's take a look at how to do error analysis with beam search.

Â Let's use this example of Jane visite l'Afrique en septembre.

Â So let's say that in your machine translation dev set,

Â your development set, the human provided this translation and

Â Jane visits Africa in September, and I'm going to call this y*.

Â So it is a pretty good translation written by a human.

Â Then let's say that when you run beam search on your learned RNN model and

Â your learned translation model, it ends up with this translation,

Â which we will call y-hat, Jane visited Africa last September,

Â which is a much worse translation of the French sentence.

Â It actually changes the meaning, so it's not a good translation.

Â Now, your model has two main components.

Â There is a neural network model, the sequence to sequence model.

Â We shall just call this your RNN model.

Â It's really an encoder and a decoder.

Â And you have your beam search algorithm,

Â which you're running with some beam width b.

Â And wouldn't it be nice if you could attribute this error,

Â this not very good translation, to one of these two components?

Â Was it the RNN or really the neural network that is more to blame, or

Â is it the beam search algorithm, that is more to blame?

Â And what you saw in the third course of the sequence is that

Â it's always tempting to collect more training data that never hurts.

Â So in similar way, it's always tempting to increase the beam width that never

Â hurts or pretty much never hurts.

Â But just as getting more training data by itself might not

Â get you to the level of performance you want.

Â In the same way,

Â increasing the beam width by itself might not get you to where you want to go.

Â 2:38

But how do you decide whether or

Â not improving the search algorithm is a good use of your time?

Â So just how you can break the problem down and

Â figure out what's actually a good use of your time.

Â Now, the RNN, the neural network,

Â what was called RNN really means the encoder and the decoder.

Â It computes P(y given x).

Â So for example, for a sentence, Jane visits Africa

Â in September, you plug in Jane visits Africa.

Â Again, I'm ignoring upper versus lowercase now, right, and so on.

Â And this computes P(y given x).

Â So it turns out that the most useful thing for

Â you to do at this point is to compute using this model to compute

Â P(y* given x) as well as to compute P(y-hat given x) using your RNN model.

Â And then to see which of these two is bigger.

Â So it's possible that the left side is bigger than the right hand side.

Â It's also possible that P(y*) is less than P(y-hat) actually, or less than or

Â equal to, right?

Â Depending on which of these two cases hold true, you'd be able to more

Â clearly ascribe this particular error, this particular bad translation

Â to one of the RNN or the beam search algorithm being had greater fault.

Â So let's take out the logic behind this.

Â Here are the two sentences from the previous slide.

Â And remember, we're going to compute P(y* given x) and

Â P(y-hat given x) and see which of these two is bigger.

Â So there are going to be two cases.

Â In case 1, P(y* given x) as output by the RNN

Â model is greater than P(y-hat given x).

Â What does this mean?

Â Well, the beam search algorithm chose y-hat, right?

Â The way you got y-hat was you had an RNN that was computing P(y given x).

Â And beam search's job was to try to find a value of y that gives that arg max.

Â 4:51

But in this case, y* actually attains a higher value for

Â P(y given x) than the y-hat.

Â So what this allows you to conclude is beam search is failing to actually give

Â you the value of y that maximizes P(y given x) because the one

Â job that beam search had was to find the value of y that makes this really big.

Â But it chose y-hat, the y* actually gets a much bigger value.

Â So in this case, you could conclude that beam search is at fault.

Â 5:24

Now, how about the other case?

Â In case 2, P(y* given x) is less than or

Â equal to P(y-hat given x), right?

Â And then either this or this has gotta be true.

Â So either case 1 or case 2 has to hold true.

Â What do you conclude under case 2?

Â Well, in our example,

Â y* is a better translation than y-hat.

Â But according to the RNN, P(y*) is less than P(y-hat),

Â so saying that y* is a less likely output than y-hat.

Â So in this case, it seems that the RNN model is

Â at fault and it might be worth spending more time working on the RNN.

Â 6:13

There's some subtleties here pertaining to

Â length normalizations that I'm glossing over.

Â There's some subtleties pertaining to length normalizations that I'm

Â glossing over.

Â And if you are using some sort of length normalization,

Â instead of evaluating these probabilities, you should be evaluating the optimization

Â objective that takes into account length normalization.

Â But ignoring that complication for now, in this case, what this tells you is that

Â even though y* is a better translation,

Â the RNN ascribed y* in lower probability than the inferior translation.

Â So in this case, I will say the RNN model is at fault.

Â So the error analysis process looks as follows.

Â You go through the development set and

Â find the mistakes that the algorithm made in the development set.

Â 7:08

And so in this example, let's say that P(y* given x) was 2 x 10 to the -10,

Â whereas, P(y-hat given x) was 1 x 10 to the -10.

Â Using the logic from the previous slide, in this case, we see that

Â beam search actually chose y-hat, which has a lower probability than y*.

Â So I will say beam search is at fault.

Â So I'll abbreviate that B.

Â And then you go through a second mistake or

Â second bad output by the algorithm, look at these probabilities.

Â And maybe for the second example, you think the model is at fault.

Â I'm going to abbreviate the RNN model with R.

Â And you go through more examples.

Â And sometimes the beam search is at fault, sometimes the model is at fault,

Â and so on.

Â 7:58

And through this process, you can then carry out error analysis to figure out

Â what fraction of errors are due to beam search versus the RNN model.

Â And with an error analysis process like this, for every example in your dev sets,

Â where the algorithm gives a much worse output than the human translation,

Â you can try to ascribe the error to either the search algorithm or

Â to the objective function, or to the RNN model that generates

Â the objective function that beam search is supposed to be maximizing.

Â And through this, you can try to figure out which of these two components is

Â responsible for more errors.

Â And only if you find that beam search is responsible for a lot of errors,

Â then maybe is we're working hard to increase the beam width.

Â Whereas in contrast, if you find that the RNN model is at fault,

Â then you could do a deeper layer of analysis to try to figure out if you want

Â to add regularization, or get more training data, or

Â try a different network architecture, or something else.

Â And so a lot of the techniques that you saw in the third course in

Â the sequence will be applicable there.

Â So that's it for error analysis using beam search.

Â I found this particular error analysis process very useful whenever you have

Â an approximate optimization algorithm, such as beam search

Â that is working to optimize some sort of objective, some sort of cost function

Â that is output by a learning algorithm, such as a sequence-to-sequence model or

Â a sequence-to-sequence RNN that we've been discussing in these lectures.

Â So with that, I hope that you'll be more efficient at making these types of models

Â work well for your applications.

Â