You have learned all the basics of attention by now. In fact, you could already build a transformer from it. But if you want it to work really well, run fast and get very good results, you'll need one more thing, the multi-head attention. Let me show you what it is. First, I'll share with you some intuition on multi-headed attention. Afterwards, I'll show you the math behind it. Recall that you need word embeddings for the query key and value matrices and scale dot-product attention. In multi-head attention, you apply in parallel the attention mechanism to multiple sets of these matrices that you can get by transforming the original embeddings. In multi-head attention, the number of times that you apply the attention mechanism is the number of heads in the model. For instance, you will need two sets of queries, keys, and values, any model with two heads. The first head would use a sets of representations, and the second head would use a different sets. In the transformer model, you got different representations by linearly transforming the original embeddings by using a set of matrices, W^Q, K, V for each head in the model. Using different sets of representations allow your model to learn multiple relationships between the words from the query and key matrices. With that in mind, let me show you how multi-headed attention works. The input to multi-head attention is the value key and query matrices. First, you transform each of these matrices into multiple vector spaces. As you saw previously, the number of transformations for each matrix is equal to the number of heads in the model. Then, you will apply the scale dot-product attention mechanism to every sets of value, key and query transformations. Where again, the number of sets is equal to the number of heads and the model. After that, you concatenate the results from each head in the model into a single matrix. Finally, you transform the resulting matrix to get the output context vectors. Note that every linear transformation in multi-headed attention contains a set of learnable parameters. Let's go through every step in more detail. Say that you have two heads in your model. The inputs to the multi-headed attention layer are the queries, keys, and values matrices. The number of columns in those matrices is equal to d_model, which is the embedding size, and the number of rows is given by the number of words the sequence is used to construct each matrix. The number of rows is given by the number of words the sequence is used to construct each matrix. The first step is to transform the queries, keys, and values using a set of matrices, W^Q, K, and V per head of the model. This step will give you the different sets of representations that you use for the parallel attention mechanisms. The number of rows in the transformation matrices is equal to d_model, the number of columns, d_k for the queries and keys transformation matrices, and the number of columns d_v for W^V are hyperparameters that you can choose. In the original transformer model, the author advises setting d_k and d_v equals to the dimension of the embeddings divided by the number of heads in the model. This choice of sizes would ensure that the computational cost of multi-head attention doesn't exceed by much the one for a single head attention. After getting the transformed values for the query key and value matrices per head, you can apply in parallel the attention mechanism. As a result, you get a matrix per head with the column dimensions equal to d_ v. The number of rows in those matrices is the same as the number of rows in the query matrix. Then you concatenate horizontally the matrices, outputted by each attention head in the model. You will get a matrix that has d_v times the number of heads columns. Then you apply a linear transformation, W^O to the concatenated matrix. This linear transformation has columns equal to d_model. If you choose d_v to be equal to the embedding size divided by the number of heads, the number of rows in this matrix would also be d_model. This as with single-headed attention, you'll get a matrix with the context vectors of size d_model for each of your original queries. That's it. For multi-head attention, you just need to apply the attention mechanism to multiple sets of representations for the queries, keys, and values. Then you concatenate the results from each attention computation to get a matrix that you linearly transform to get the context vectors for each original query. In this video, you learned how multi-headed attention works and you saw some of the dimensions of the parameter matrices involved in its calculations. You can implement the multi-head attention to make computations in parallel. With the proper choice of sizes for the transformation matrices, the total computational time is similar to the one of single-head attention. In the last three videos, you learned about attention, you know the basic dot-product attention, the causal one, and the multi-head one. You're now ready to build your own transformer decoder. That's what we'll do in the next video.