Hello and welcome. In this video, we will be reviewing how to apply Recurrent Neural Networks to Language Modeling. Language Modeling is a gateway into many exciting deep learning applications like Speech Recognition, Machine Translation, and Image Captioning. At its simplest, Language Modeling is the process of assigning probabilities to sequences of words. So for example, a language model could analyze a sequence of words and predict which word is most likely to follow. So with the sequence, this is ''an'' which you see here. A language model might predict what the next word might be. Clearly, there are many options for what word could be used as the next one in the string. But a trained model might predict with an 80 percent probability of being correct that the word ''example'' is most likely to follow. This boils down to a sequential data analysis problem. The sequence of words forms the context and the most recent word is the input data. Using these two pieces of information, you need to output both a predicted word and a new context that contains the input word. Recurrent Neural Networks are a great fit for this type of problem. At the first time step, a recurrent net can receive a word as input along the initial contexts. It generates an output. The output word with a current sequence of words as the context will then be re-fed into the network in the second time step. A new word would be predicted and these steps are repeated until the sentence is complete. Now, let's take a closer look at an LSTM network for modeling the language. In this network, we will use an RNN network with two stacked LSTM units. For training such a network, we have to pass each word of the sentence to the network and let the network generate an output. For example, after passing the words, "This'' and ''Is", if we pass the word, "an" in the third time step, we expect the network to generate the word ''example'' as output. But notice that we cannot easily pass a word to the network. We have to convert it into a vector of numbers somehow. We can use Word Embedding for this purpose. Let's quickly examine what happens in Word Embedding. An interesting way to process words is through a structure known as a Word Embedding. A Word Embedding is an n-dimensional vector of real numbers for each word. The vector is typically large, for example 200 length. You can see what that might look like with the word ''example'' here. You think of Word Embedding as a type of encoding for text to numbers. Now, the question is how do we find the proper values for these vectors? In our RNN model, the vectors also known as the matrix for the vocabulary are initialized randomly for all the words that we are going to use for training. Then, during the recurrent network's training, the vector values are updated based on the context into which the word is being inserted. So, words that are used in similar contexts end up with similar positions in the vector space. This can be visualized by utilizing a dimensionality reduction algorithm. Take a look at the example shown here. After training the RNN, if we visualize the words based on their embedding vectors, the words are grouped together, either because they're synonyms or they're used in similar places within a sentence. For example, the words ''zero'' and ''none'' are close semantically, so it's natural for them to be placed close together. While Italy and Germany aren't synonyms, they can be interchanged in several sentences without distorting the grammar. Now let's look back at the RNN that we have been using. Imagine that the input data is a batch with only one sequence of words. Think of it as a batch that includes one sentence only, one that includes 20 words. Assume that the vocabulary size of the words is 10,000 words and the length of each embedding vector is 200. We have to look up those 20 words in the randomly initialized embedding matrix and then feed them into the first LSTM unit. Please notice that only one word in each time step is fed into the network and one word would be the output. But during 20 time steps, the output would be 20 words. In our network, we have two LSTM units with arbitrary hidden sizes of 256 and 128. So the output of the second LSTM unit would be a matrix of size 20 by 128. Now, we need a soft max layer to calculate the probability of the output words. It squashes that 128 dimensional vector of real values to a 10,000 dimensional vector, which is a vocabulary size. This means that the output of the network at each time step as a probability vector of length 10,000. So the output word is the one with maximum probability value in the vector. Now, we can compare the sequence of 20 output words with the ground truth words. Finally, calculate the discrepancy as a quantitative value, so-called loss value and back propagate the errors into the network, and of course, we will not train the model using only one sequence. We will use a batch of sequences to train it and calculate the error. So instead of feeding one sequence, we can feed the network in many iterations, perhaps even a batch of 60 sentences for example. Now, the key question to be asked is, what does the network learn when the error is propagated back in each iteration? Well, as previously noted, the weights keep updating based on the error of the network in training. First, the embedding matrix will be updated in each iteration. Second, there are a bunch of weight matrices related to the gates in the LSTM units, which will be changed. Finally, the weights related to the soft-max layer, which somehow plays the decoding role for the encoded words in the embedding layer. By now, you should have a good understanding of how to use LSTM for language modeling. Thanks for watching this video.