High and welcome to today's lectures. In the last part, we introduce the fundamental principles of neural machine translation. This will give us a very good baseline system in this part of the lecture, we want to now describe advanced techniques to achieve state of the art performance in machine translation. And the first challenge we will address in this lecture is the vocabulary. But in contrast to neural machine translation, language doesn't have a fixed vocabulary over the time, new words are added to the language, and other words, are dropped and no longer used. One typical example here are names. So names at some point are curing out more often in your text, and then at some point they are getting less important again. And of course, we also have to be able to handle these names. Another example are compounds which are very often used in German. You are on the right side, you see an example of a German compound. It is the word, do know, dump chafat's Gazelle shots. Copy ten in order to understand this word, you have to break it down into it. Parts <breath> so so first you see the first word means "do know ." Then we have steam. We have shipping, then there is a company, and then there is a kept. So altogether this word means that it is the captain of the company who runs the steam ships on the donor. So this is an example of a very long German word. So in a neural machine translation system will not be able to have all possible compounds in the German language represented. So somehow we need a way to split these words into its parts and represent them separately or somehow model them differently by not just using them as one entity. In addition, in all languages, all the new words are created. One recent, very famous example is a word "Brexit ." So it is produced by the English word "Britain" and the exit and should describe the exit of Britain from the Eu. By concertinating Britain an exit. We get Brexit, and also have a new word, which, if you have trained your system like five years ago, you wouldn't have seen this word before. So if we look back how vocabulary is handled in neural machine translation. We are first a mapping which maps every word to a unique index. And then this word is just represented by this index. If we only have fifty or one hundred thousand words. That is possible. We can just represent every word by it, by its index. But as we have seen just before from the examples there is not a fixed vocabulary. So there are always new words coming in. And so it is not clear how we can model that, because we can only translate words which we have seen in training, and where we have a good representation in our neural network. So how can we handle this problem? A first idea is to use a fixed vocabulary for the most frequent words, and then have an additional token, the so called ank token, which represents all other words. Of course, this might not be ideal, because all the other words, which are not in your vocabulary, are replaced by just one token, but it is a first work around in order to handle an open vocabulary problem. So let us look as an example how it works. So just imagine we want to translate the sentence. I live in cultural ," the words I live in are very common in the English language. And so there should be in our vocabulary. But the word "cardserer" is not that common, and therefore most probable, and will not be in our vocabulary. So now, prior to the translation, we will replace the word "culture" by just our unknown token, then we would take the new sentence. I live in Arank, and then we will use our Senate machine translation system to translate this sentence into the target set. For example, the German sentence each von in. Of course, we also have to do this in training in order to learn that the word "Ang" is translated into the word "ain't ," although we are now able to translate any sentence. We are getting this, we are still having a problem. So we see in this example that the content is often described in rare words. And therefore now, if we just replace the words per angs, the contents gets lost, and all the meaning, which wasn't the sentence is no longer there. So if we read the German sentence, it won in Uncle ." Then we know that he is living somewhere. But the most important information, like where he is living is no longer there. So this translation is less useful, and therefore it is not an ideal solution to address this problem. We can now improve this method by the so called copy method, which allows us to also generate translations for words that are not in our vocabulary. Here again, we will use the unknown token to replace rare words and use a new machine translation system to generate a translation out of this. But then we will use an external tool in order to generate a translation for the unknown word, and then replace the unknown word in the target side by the translation. Let us let us have a look at an example. So again, we are starting with our example sentence ," I live in cancer. And again, we will replace counselor by the unknown token. Now we use as an example before our new machine translation. And we will generate the translation ish. Von Neumann given the translation Yvonna in. We want to find now the Suez word that generates the tagged word "ik" in our case, this is the word "ank ," which was before casual. So how can we do this there we can use our Neural Machine Translation Architecture, where we use the attention weights and the attention weights are somehow or in alignment, which aligned source words to target words. So for one target word, we can now look with source words has the most attention when generating this target word, and then most probable that will be the source word which generated this target work. So in our case, the attention for the word "Ang" will most probable be the sewer's word "ank ," which was before casual. So we can use the attention weights in order to link sewer's words to target words. In our case, this is quite simple, since there is only one source Ang and one target act. So it S- should be quite clear that these bows are translation of each others, but there might be sentences where there are several on words. And in this case, we need to find out which ankward is a translation of which unquote. So since we know have this alignment, we know that the last in our target sentence should be replaced by the translation of Castro. So now we can use an external tool in order to translate cultural into a German. For example, we can use an Smt system. So a typical Smt system has a large phrase table, and it doesn't have the problem of having a fixed recording so we can use our phrase "able to generate a translation of counselor, which in our case would be cause through also. And then insert this translation into the target side. So in contrast to the first approach we are now able to generate a real translation for the word "cancer ." We no longer having the problem that in our translation there will be the young token, because the young token were replaced by the translation. We got from a different tool. But there is still one problem with this type of approach. So we are not using our neural machine translation to translate the railroads. But we are completely relying on external tubes. And for this quite a simple case, it was Ok. And we are getting the correct translation, but in other cases, it can be problematic to generate the translation for rare words with a different method than generating the translation for the rest of the centres. Just think of when generating German one problem when generating German we saw before is that we have to generate the correct agreement between words. And in order to generate the correct morphological form of word. We need the context. But now, when translating rare words, we are completely ignoring the context and only translating this one word into the target side. So we have seen now two methods which replayers rare words by an unknown token, and then somehow translate these a different approach to handle open vocabulary is to find a method that we can represent all our words by just a fixed number of symbols. One very successful approach to do this is the so called byte pair encoding. And so now we want to take a look how bait pair encoding can be used to encode our search sentence and our target sentence. The main idea here is, in contrast to the previous approaches where we used, and symbols like the most frequent and words and the unknown token to represent the words and then ignore all the railroads. Now we want to find an encoding that we can represent all words which are in the sentence by just and symbols. One good starting point is the characters. So if we look at characters, we can represent each sentence by just a sequence of twenty six characters. But if we just use the twenty six characters. We have a very small vocabulary, and the sequences get very long. So we want to find a method to use more than twenty six symbols, but at most hour, for example, forty or eighty thousand symbols, and therefore we can use a bypairing coding aggress. So in this algorithm, we will first start by representing our sequence by just the characters. And then we will look in all our corpus and the fi- find the most frequent character diagram. So the most frequent sequence of two charities, and then we will replace these two characters by a new symbol. And this method, we will continue until we have used our forty thousand, for example, characters. And then we have represented all our texts by just a fixed amount of characters. This idea was first used as a comparison agorism. So let us again have a look at the examples. So again, we have over sentence. I live in causal. And then as a first step, we will now replace each word by a sequence of characters. So we have eye, which is just one letter, then we have live, which is a four letter sequence we have in as a two letter sequence and casual as the last sequence. Now we do that for all our data in the training purpose. And then we are looking, which are the most common character diagram in English. This could, for example, be E and N. So we would now replace E and M by a new symbol. But in our case here, there is no E. So we continue to look what are the most, which is the second most third most enforced and so on. Characters sequence, in our case, the most frequent character diagram in this sentence. Here is the sequence I and N. So we will now replace the sequence I and N by a new symbol to keep it readable. Let us just call this symbol in. So now we have the new sequence, where in is already like one word, and the rest are still a sequence of characters. And so we will continue and look for the most frequent anger in this sentence, the most frequent character diagram would now be A and R. So we will replace A and R in casual by just A and R. The next most frequent is now Ellen I from live. And so we will continue through the list. Of course, we can now also use diagram sequences, where each symbol already is a concordination of several seconds. So for example, in this case now, the most frequent next diagram will be L, I and V to the word "live ." So we will continue this since no longer we have any rule which you can apply. And then, for example, get here the sequence where I live in are represented as whole words, and cancer is represented as a sequence of five S- subwards. So we have car, then we have L S, we have Rue and her. So we see that where words get splitted into several sub parts. And then we need to learn how we can translate these sub parts into the other language. While we have now the advantage where we can represent any word in the vocabulary with our and symbols. Of course, this approach has also some problems. So we see that where words get split into many sub parts. So in all our sequence, we will get a little bit longer. And we know that in New Machine Translation as a sentence gets longer. Also, the computation time gets longer, and it just gets more difficult. But the bigger problem is that this splitting is not always reasonable. We just do the splitting based and the frequency of the angrams. And for example, here we see that the L in cards were is just used as a single word, and it is not straightforward how we can translate an L in the middle of a word. So it might be that in this as that, in this case, cancer is just translated into cancer. So the L will somehow be translated. Justin and L in the other language. But in other words, we might just generate a complete different word in the target language. And then there is no clear connection between the Ellen, the source sentence and the words and the parts of the target word in the target sentence, in byte - pair encoding. We have already seen that we have started with characters, because the main advantages there is that the number of characters is very limited. But then in a preprocessing there we again go to word or subward - like units. We can now do also the other way, and directly use characters in neural machine translation. This is commonly called Director based Neural Machine Translation. So in this case, our new machine translation will not work on the word "units ," but it will directly translate on the character level. So instead of getting a sequence of word into our neural machine translation, we can develop a model which directly gets a sequence of characters. One challenge here is that, of course, the sequence of characters is significantly longer than the sequence of words. Therefore, we have to somehow adapt our architecture to handle these long sequences. One common approach to do this is shown here on the right side. The main idea is that we will first learn a representation for a sequence of a fixed number of characters, or a character n gram, and then use this representation as input to the Neural Machine Translation System. So instead of word embeddings, we will now use embeddings of characters, sequences. One main difference to word based models is that we no longer use a word segmentation, so we will just use the spaces as one character, as an input, and then treat the whole sequence as one sequence, and especially in languages where we don't have a segmentation into words, this as an advantage. So here you see, for example, we are having first a model or a director model, which learns a representation for the first five characters. I space L, I Phi. Then we will learn a representation for the sequence E, space, I n space. Then we learn a representation for corals and finally learn a representation for Ruror. We can see that each of the characters embedding in red, which we are learning can represent several words, like the first two, which represent I and part of live and the end of live and in, but it can also only represent part of the world. And the last two, which represent the beginning of cancer and the end of constant one of the main challenges interactive based on machine translation is that the sequence lengths get significantly longer, and therefore the run time is significantly slower. So typically it takes a lot longer to train and decode with these models. Then with the word "based months ." So in summary, we learn today how we can handle open vocabulary task in your machine translation. First of all, there is a difference between language and what you know, machine translation can handle. So while neural Machine translation has a fixed vocabulary and can only get us input these words, and only can generate translation consisting of our fixed vocabulary. Language is involving and has an open vocabulary so we can invent new words, new words can occur. And if we are training our model, we will never have seen all possible words. Furthermore, even in our training data, we have so large vocabulary that we are not able to represent this vocabulary in our new machine translation in order to address this problem, we learned three different methods to handle it. First, we learned the copy mechanism will replace rare words by unknown words, and then use an external tool in order to generate translation for these rare words. Secondly, we have seen the bite pair encoding algorithm, which is used as a preprocessing step to neural machine translation, and where we represent all sequences of words by a fixed number of symbols. So then we couldn't, for example, limit ourselves to forty thousand symbols, and for rare words, represent them, not by one word, but by several separates. And finally, we have seen the character - based new machine translation, which directly works on the director level and gets director sequences as input and not word sequences,