Hi I welcome to today's lecture on phrase - based machine translation in the last videos, we introduced the basic word based statistical machine translation. Today, we want to go now a step further and go to phrase based machine translation. The main difference is that we no longer treat every word as a single unit. But now our basic units are phrases in phrase based Asmt. We are referring to any sequence of consecutive words as phrases. So these phrases don't need to be some, all linguistic, motivated and need to somehow be connected. But just in a sentence, every sequence of words, which is consecutive, can be referred to as a phrase. And now the main engineers that we no longer translate each of the words in the sewer sentence individual into the target sentence. While we are translating some bigger chunks of words, like two or three words together into the target sentence, where we can also then generate two or three words. Why should it be now an advantage to work on sound units of several words. So first saying is that there is not a word to work. Correspondence between the sewers and target language. For example, you have translation of idioms like the English term "kick the baggage in German can be translated into peace in scars, which literally translated means" bit the grass ." And if we now want to use a word based translation model to generate this translation. It would be quite hard, because kicked will be very rarely translated into bit, or the bucket will be very rarely translated into grass, but only in this context where we have this whole expression. This is a correct translation there before it might be helpful to just treat the whole expression as one unit, and then translate the whole unit into the target language. A second important problem when doing machine translation is often context dependent. So for example, if we look at prepositions, we have the German width of mine accustomed where the word "Alf" is translated into At because it is a a English sentence at my cast while on a different sentence of minimum shift on my boat. We translate the German word out into "on ." So depending where the proposition is used, the translation in English is very different. And there it might again be helpful not to translate every word by word, but to translate bigger chunks to better model the context and their word - based models might be able to do this by having different probabilities. And then you are using the language model. But in general, it is not very, very well modeled. So maybe it might be helpful to translate the whole phrase that it is. So we can then generate the correct preposition when we are doing phrase based translation. We are starting with a word to word alignment. We have already seen this word toward our alignments in Ibm models. And that is where Ibm models are still very useful in phrase based machine translation, because they are used to generate. These were to work mappings between sewers and targets. Hence, so we are using these, these models to generate a Viterbi alignment. A Viterbi alignment is the most probable alignment between the sewers and the target. So just assume we have the following peril sentence in the training data. So we will have a source German sentence faster. You busy a gaze in harden, and then the translation into English, what we see up till now. And using an Ibm model, we can now generate the most purple translation, which will align what to us we are to. We be here to up till now, and saw to gaze. And based on this alignment, we can now train our phrase - based model. One problem here is that the Ibm models were able to generate one to many mappings, like we have here for the German word "this here, which is aligned to three different tagged words. But we cannot have many to one alignment, which might in our case behave for for the phrase "gaze and hub into soil. Therefore, one good idea when doing machine translation is to look at it from both sides. So this is a forward direction. It might be helpful to look also at the reverse direction, where we just change the role of the sewers and target language. And we can do this by just training an additional model in the inverse direction. So we can now train an additional, for example, Ibm model to generate an alignment from the inverse direction. If we are doing this, we are getting the alignment generated here by the blue, and it is very similar, but there are some difference, because now we can have many to one alignments, but no longer one to many alignments. This here is now lined to till and gaze and harm is aligned to saw and things. It is helpful to having both alignments. The first step in the training of a phrase based system is now to take the alignments from both direction. So it trained, for example, from German to English, and as well as from ignorance to German, and to combine both alignments and over the time, different heuristics have been proposed, which all work in a similar way. So normally you start with the intersection of post alignments. So you are saying, if the word gets aligned by the one direction, and it also gets aligned by the other direction. It should be a good alignment. And starting from this intersection, you now look at words which get only aligned in one direction. And then there are heuristics. Which of these alignments take also into account for the final alignment. So if we now use this combination heuristics, we could, for example, end up with the Union here, where we have now the alignment between the source and target language, and where we have now one to many alignments, as well as many to one alignment. And then we are getting the final alignment where we align us to what we are to. We bizarre to up till now, and gaze and harm is a line to solve. So this is now the alignment we are working on when training or phrase based model. The next step and the training of the phrase - based model is now the so - called phrase restriction. The idea here is that we take our training data, which now consists of the sewer sentence, the target sentence and the alignment. And we are extracting all possible phrase translations which were having there. We are referring, we will refer to this as phrase pairs. So the idea is now we are extracting sewers and target phrases which are translations of each other. So the main question is, now, what is a good phrase pair. And what is a bad Facebook. And so in which case the Suez phrase is translated into a target phrase, and which phrases are not translations of each other. And there the main idea is that we are extracting all phrase pairs which are consistent with the alignment and the definition. When phrase pairs consistence with the alignment consists of three conditions. <breath> so we are extracting now all phrase pairs from the data where, if a source word is is in the phrase pair. Also, the aligned target word is in the phrase pair to make that as clear. Let us have a look at an example. So, for example, if we were looking at the phrase pair consisting of a German word, this hair and the English word till this phrase pair is not consistent with a word alignment, because in this case, the sewer's word. This here is in the phrase pair, but the target words up and now, which aligned to this word are not in the phrase "pit ." Therefore, we will not extract this phrase pair on the other hand, if we are looking at the phrase pair bis here and up till now. In this case, the only Sue's word is this hair. In all the words it does the line to which, other words, up till and now are all in the target phrase so we can extract the phrase "pair of bissia into obtain of. So for this phrase pair, the first condition is holding <breath> and since we already saw before, it is good to look at both sides. We can now do the same thing for the target site. The second condition, a phrase pair has to fulfill in aura to be extracted is that a target word is in the phrase pair. Also, all aligned source words have to be in the phrase pair ." So in this case, again, we wouldn't extract the phrase "gazean" into "saw ," because the target word saw" is in the phrase pair, but the sewers word "harbour ," which is aligned to saw is not in the Facebook. So in contrast, the phrase "pair gaze in Harvard and to saw will be extracted, because for all tagged words in the phrase" the aligned sewage ." Words are also in the phrase pair. And finally, one last condition is that at least one source word has to be aligned to one target word in the phrase pair. So for example, if we have here words like the comma, which are aligned to know where we cannot get a phrase pair, which only consists of this word. And now we are looking at all possible combination of Suez phrases and tagish raises. We are checking as they are consistent with the word "alignment ." And if this is the case, we are then extracting them. So for this very small sentence, we would extract several phrases. So the smallest phrases we would extract would be the following for phrases, thus into what via until we saw into gaze and Harmon and Biser and translated into upturn now. But we will also extract longer phrases. So we would also extract the phrase "Vasier" into what we are busy and hard and to saw up till now. We cannot, for example, extract the phrase via this hair. Because if we want to extract this phrase on the target side. We would also need to put in the wood saw. But this is aligned to busy in Harvard. So we cannot extract this phrase from this sentence. So then there are two more phrases which could be extracted, which are the phrases "Via Bisa Zeinab into. We saw her up till now. And Vasier Besair gaze in harm into what we saw up to no one important thing is now what happens with unaligned words like, in our case, the words "comma" is not aligned to anything on the target side. These words, according these conditions, can be added to all the phrases surrounding. So in addition to this raises, we can now also extract the phrase "this magazine Harboucoma ," and to saw until now, via this magazine, harv and comma intuitive. We saw up tunnel and Vasier Bessegacy in harmonic into what we saw up until now. So you already see if you have a lot of unaligned words. This will result in a lot more phrases that can be extracted. In general, you see that the number of phrases that can be extracted are quite huge. That that is why normally we we limit the lengths until which we extracting the phrases up to something like seven or twelve sewage words. So now we extracted our phrases, which are now the main building blocks when generating the translation. But now the second question is, how do we evaluate these different phrases? So we have now to assign probabilities to the different phrases in the word based models. We had a generative model, which are signs of probability to every alignment. Here we no longer have that because we won't don't want to distinguish between long phrases and charge phrases. So there is not one fixed segmentation of a training data into phrases, but we can extract very long phrases and very short phrases. And we want to keep both of them, because sometimes it is more helpful to use the short phrases, and otherwise it might be more helpful to take the long phrases, and therefore we are using a different way to estimate the probability of the different phrases, instead for the phrase - based model. We are looking at the frequency of the phrases, and we are estimating the probability of a phrase given its frequency. So the probability of a target trace given a source race as a number of times. We are having seen this phrase pair count E and F, divided by the number of times we have seen this target phrase with any source phrase to make it more clear. Let us have a look at an example. So for example, we are looking at the target phrase up till now. And we want to estimate the probability of the target phrase up to now, given this here, then we are looking how often did the word up to now curricula with the phrase "this hair ." And in our Copus, we have extracted this twenty times, but we have seen the phrase up to now, also with other phrases like, for example, bits. Yes, the Soyuz and several others. All in all, we have seen the phrase up till now with any other phrase seventy times. So this probability will then be twenty divided by seventy. So the probability of up to now translating into Bishop is two sevens. So now we have one probability which can be assigned to a phrase bit. But if you remember last time's video, we saw that the one advantage of the lock linear model is that we can use any number of features. Therefore it might be helpful not to only use one type of probability to estimate the quality of a phrase pit, for example. Just imagine you have a tagged phrase, which only a Chris one time. So then the count of E and F will be one. And since E only, of course, once the denominator will also be won, and so the probability will be won while it might be not a good translation, because we have seen it only once. Therefore, it is helpful, again, to look also in the inverse director. So in addition to the probability P of E given F. We are also calculating the probability of P of F given E, which is calculated exactly the same way, just by changing E and F. But both of these probabilities have a problem that for very long phrases, they normally occur quite rarely, and therefore the estimation is not as good. Therefore, additional to these two scores, we are also using so - called lexical weights. For these, we are using the word "alignment" and then calculating a score, which is very similar to the Ibm one score. So we are. Some are calculating a score within the phrase, which looks at the word "probabilities" and then estimates how good this phrase is based on the word "probabilities ." So it is a product over the lengths, and then we some possible translation. So this probability P of E given F and A is calculated similar to the Ibm One Score, where we have a product over different probabilities, which are where to word translations. And these were to word translation. Probabilities are here marked as W. So now we have seen how we are training a phrase based system. So the training mainly consists of three steps. So on the first step, we are doing the alignment, where from both direction takes the alignment and then make one word alignment between the sewers and target sentence. The second step is then to do the phrase "extraction" and to extract all phrases which are consistent with the word "alignment ." And thirdly, we are then using these four different scores in order to evaluate how good the translation pair is. So after we have done the training. The question is now, if we have one possible translation, how do we assign a score to this translation. So let us have again a look at the example to see how this works. So if we now want to translate the Germans must be a biscuit Zeinab. So it is no longer in the training, but we know doing it in the translation. We might have used three phrases to translate it. And we, we have used the phrases "Vassville" into what we we have used phrase, gaze and harmony to saw and bizarre into up till now. And we did hear some re ordering in order to get the correct order on the English side. So after we have seen the training, how do we know, assign a probability or a score to one possible translation. So just imagine we want to translate now the German sentence Vasvia Bassex Zeinab into English, and we ha- might have used these three different phrase pairs. What we Casinham saw and Biser into obtaino. One important thing is that the score depends also on the segmentation. So the same translation generated by two different segmentations can get a difference course. But how do we now calculate this code? So first we can use the translation model to assign a score to this translation. So the first feature in our lock linear model is the first translation model score. So it is a logarithm of the probability of translating Vasier into what we of translating "Gazein Harm into saw end of translating Biz Hair into up till now, using the laws of the logarithms, we can refrigerate it as a sum of three different probabilities. And then we can look in our phrase table, where we calculated all the probabilities for these phrases according to the equation we had on the last slide to get all these probabilities. So for example, the first phrase had a probability of one. And this we are dying nodes for the second and the third phrase, and getting here our log probabilities, and then we can calculate the final value, and then the first translation model will assign a score of minus to this translation. Now we are doing this not only with our first model, where we are doing it also with inverse model and with the lexical weights of post direction. So we are getting four different scores, which all describe how good this translation is based on this model. So we have four different feature values describing the quality of this translation. But we are not only having translation models. One thing we haven't talked about yet is how to model reorder. You see that here some words get re ordered. And this should also be somehow modeled. And the Ibm models, there were quite complex models in basic phrase base model. This is models, quite simply. So in generally, the idea is that it should be better to translate monitoringly, but sometimes you need to do some small reorderings like year. The phrases "bizarre" till now in -cause they inhabit to saw in order to get generate the chorate suicide. And therefore we are just looking if the two phrases are translated consecutively, are as there is a jump in between. And this is then used as a feature in the lock linear model. So if we are looking at the first phrase pair. We are starting on sewers under the target side goes with no jumps. So the distance here is when generating. Now, the second tagged word on the sewer side. We are jumping over the word "Bishop ." So we are having here a jump of plus one. Therefore, we are getting some penalty for jumping over the world. And finally, when generating the last word, we have to jump back from the end here to the beginning, where it, which means we have to jump back over two words. So we are getting another penalty, and then we can use a quite simple metric, how we penalize these jumps. For example, we can just take query it. And then, for example, the feature value here would be five. This is a very rough estimation how jumps and reorderings are modeled. But in the first approximation, it might be quite good. And it is mainly penalizing that you don't want to randomly jump on the solar sentence, but mainly you want to translate it consecutively. And then, of course, we are also having a language model which might assign the score of minus twenty, one point five, six to the English sentence. What we saw up till now. And now we can combine all these model scores in the lock linear model by giving weights to each of this model and generating a final score for this hypothesis, one one. So first, we are having here the four different translation probabilities, then we are having the distortion model. And finally, we are having the language model. And then we are getting a final score for this hypothesis of minus twenty, three point seven, eight, five. So to summarize today's lectures, the first thing we have learned is that the basic units in phrase based translation are the phrases, which is just a sequence of consecutive words in order to train this more. The first thing we are doing is to generate a word to word alignment with word based models, then we are extracting phrase bears from this word "alignment ," which are consistent with the word alignment ." And in order to be consistent, they have to fulfill three conditions. First, for every source word, the aligned target word has to be in the phrase "second" for every tagged word, the aligned source word has to be in the phrase, and certainly at least one source and target word in the phrase have to be aligned after we extracted all this phrase pairs. We are evaluating them by assigning four different scores to the to each phrase pit. There is like the relative probability in the normal direction, as well as a relative probability in the inverse direction and the lexical weights in both directions. And finally, we saw how we can combine all these scores. And using the log linear model in order to assign a score to one possible translation