You've learned about self attention, you've learned about multi headed attention. Let's put it all together to build a transformer network in this video. You see how you can pair the attention mechanisms you saw in the previous videos to build the transformer architecture. Starting again with the sentence Jane Visite the feet on September and its corresponding embedding. Let's walk through how you can translate the sentence from french to English. I've also added the start of sentence and end of sentence tokens here. Up until this point, for the sake of simplicity, I've only been talking about the embeddings for the words in the sentence. But in many sequences sequence translation task, will be useful to also at the start of sentence or the SOS and the end of sentence or the EOS tokens which I have in this example. The first step in the transformer is these embeddings get fed into an encoder block which has a multi head attention there. So this is exactly what you saw on the last slide where you feed in the values Q K and V computed from the embeddings and the weight matrices W. This layer then produces a matrix that can be passed into a feed forward neural network. Which helps determine what interesting features there are in the sentence. In the transformer paper, this block, this encoding block is repeated N times and a typical value for N is six. So after, maybe about six times through this block, we will then feed the awkward of the encoder into a decoder block. Let's start building the decoder block. And the decoders block's job is output the English translation. So the first output will be the start of sentence token, which I've already written down here. At every step, the decoder block will input the first few words whatever we've already generated of the translation. When we're just getting started, the only thing we know is that the translation will start with a start the sentence token. And so the start a sentence token gets fed into this multi- head attention block. And just this one token, the SOS token saw the sentence is used to compute Q K and V for this multi-headed attention block. This first blocks, output is used to generate the Q matrix for the next multi head attention block. And the output of the encoder is used to generate K and V. So here's the second multi-headed detention block with inputs, Q, K and V as before. Why is it structured this way? Maybe here's one piece of intuition that could hope the input down here is whatever you have translated of the sentence so far. And so this will also query to say what of the start of sentence and it will then pull context from K and V, which is translated from the french version of the sentence. To then try to decide what is the next word in the sequence to generate. To finish the description of the decoder block, the multi- head detention block outputs the values which are fed to feed for neural network. This decoder block is also going to be repeated N times, maybe six times where you take the output, feed it back to the input and have this go through say half a dozen times. And the job of this new network is to predict the next word in the sentence. So hopefully it will decide that the first word in the English translation is Jane and what we do is then feed Jane to the input as well. And now the next query comes from SOS and Jane and it says, well given Jane, what is the most appropriate nick's words. Let's find the right key and the right value that lets us generate the most appropriate next word, which hopefully will generate Visite. And then running this neural network again generates Africa, then we feed Africa back into the input. Hopefully it then generates in and then September and with this input, hopefully it generates the end of sentence token and then we're done. These encoder and decoder blocks and how they are combined to perform a sequence a sequence translation tasks are the main ideas behind the transform architecture. In this case, you saw how you can translate an input sentence into a sentence in another language to gain some intuition about how attention and your networks can be combined to allow simultaneous computation. But beyond these main ideas, there are a few extra bells and whistles to transform it. Let me brief these steps through these extra bells and whistles that makes the transformer network work even better. The first of these is positional encoding of the input. If you recall the self attention equations, there's nothing that indicates the position of a word. Is this word, the first word in the sentence, in the middle of the last word in the sentence. But the position within the sentence can be extremely important to translation. So the way you encode the position of elements in the input is that you use a combination of these sine and cosine equations. So let's say for example that your word embedding is a vector with four values. In this case the dimension D of the word embedding is 4. So X1, X2, X3 let's say those are four dimensional vectors. In this example, we're going to then create a positional embedded in vector of the same dimension. Also four dimensional, and I'm going to call this positional embedding P1. Let's say for the position embedding of the first word Jane. In this equation below, pos position denotes the numerical position of the word. So for the word Jane, pos is equal to 1 and I over here refers to the different dimensions of the encoding. And so the first element responds to I equals 0. This element i equals 1,i equals 2,i equals 3. So these are the variables, pos and i that go into these equations down below where pos is the position of the word. i goes from 0 to 3 and d is equal to 4 as the dimension of this factor. And what's the position encoding does with sine and cosine, is create a unique position encoding vector. One of these vectors that is unique for each word. So the vector P3 that encodes the position of the freak, the third word will be a set of four values. They'll be different, then the four values used encode the position of the first word of Jane. This is what the sine and cosine curves look like. Yes i equals 0,i equals 1,i equals 2 and i equals 3. And because you have these terms and denominator you end up with i equals 0. We'll have some sinus or curve that looks like this and i equals 1 will be the match cosine. So 9° on the face. And i cos 2 we'll end up with a lower frequency sinusoid like so and i cos 3 gives you a matched cosine curve. So for P1, for position 1 you read off values that disposition to fill in those four values there. Whereas for a different word at a different position, maybe this is now 3 on the horizontal axis. You read off a different set of values and notice these first two values may be very similar because they're roughly the same height. But by using these multiple sines and cosines, then looking across all four values, P3 will be a different vector than P1. So the position of encoding P1 is added directly to X1 to the input this way. So that each of the word vectors is also influence or colors with where in the sentence the word appears. The output of the encoding block contains contextual semantic embedding and positional encoding information. The outputs of the embedding layer is then d, which in this case 4 by the maximum length of sequence your mother can take. The outputs of all these layers are also of this shape. In addition to adding these position encodings to the embeddings, you'd also pass them through the network with residual connections. These residual connections are similar to those you previously seen in the resin. And their purpose in this case is to pass along positional information through the entire architecture. In addition to positional encoding, the transformer network also uses a layer very similar to a bash norm. And their purpose in this case is to pass along positional information, the position encoding. The transformer also uses a layer called adenome. That is very similar to the national layer that you're already familiar with. For the purpose of this video, don't worry about the differences. Think of it as playing a role very similar to the bash norm and just helps speed up learning. And this bash norm like layer just add norm layer is repeated throughout this architecture. And finally for the output of the decoder block, there's actually also a linear and then a softmax layer to predict the next word one word at a time. In case you read the literature on the transformer network, you may also hear something called the mast multi -head attention. We should only draw it over here. Mast multi -head attention is important only during the training process where you're using a data set of correct french to English translations to train your transformer. So previously we step through how the transformer performs prediction, one word at the time, but how does it train? Let's say your data set has the correct french to English translation, Jane Visite freak on September and Jane visits Africa in September. When training you have access to the entire correct English translation, the correct output and they're correct input. And because you have the full correct output you don't actually have to generate the words one at a time during training. Instead, what masking does is it blocks out the last part of the sentence to mimic what the network will need to do at test time or during prediction. In other words, all that mask multi- head attention does is repeatedly pretends that the network had perfectly translated. Say the first few words and hides the remaining words to see if given a perfect first part of the translation, whether the neural network can predict the next word in the sequence accurately. So that's a summary of the transform architecture. Since the paper attention is all you need came out, there have been many other iterations of this model. Such as birth or birth distill, which you get to explore yourself this week. So that was it. I know there was a lot of details, but now you have a good sense of all of the major building blocks of the transformer network. And when you see this in this week's program exercise playing around with the code there will help you to build even deeper intuition about how to make this work for your applications.