Now you might agree that representing text with vectors is a good idea. But how exactly does it work? Let's look at Word2vec, and walk you through the neural network training process. Word2vec was created by a team of Google researchers led by Thomas Michelob, in 2013. If you're interested, please read their original paper in the reading list. Let's start with intuition. Think about how you learn new vocabulary while reading back in elementary schools. You read, guess and you remember, how do you guess the meeting of a new word? Right? Normally through the context, meaning the words surrounding that missing word. For example, look at this sentence. Capable lawyers with business acumen are valuable to any firm. As an eight year old, you don't recognize the word acumen, but you might guess through the context that this word probably means intelligence and sharp judgment. The idea of using context to predict words applies to Word2vec. Word2vec is not a single algorithm but a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. You'll explore two major variants of Word2vec, continuous bag-of-words CBOW and Skip-gram. Both algorithms rely on the same idea of using relationships between the surrounding context and the center word to make predictions but in opposite directions. Continuous bag-of-words predicts the center word given the context words and Skip-gram predicts the context words given the center word. Let's look at how these two algorithms work. Let's go back to our favorite example. A dog is chasing a person. Assume you miss the center word, chasing and CBOW tries to use the context words, dog is, a person, to predict the missing word. This is as intuitive as filling in a blank puzzle on a quiz. How does it work technically? Let's walk through the two primary steps. Preparing the dataset and training the neural network. Step one, prepare the training data. You run a sliding window of size 2K + 1 over the entire text corpus. You might hear the word corpus often in NLP. It basically means a collection of words or a body of words. Corpus in Latin, means body. To put simply, let's say K = 2. You start with the first word A, to run a 2K + 1 window and K = 2. You have two words before and after the center word, A no words before but two words after A. Therefore the two words after, a dog, and, is, will be used to predict A, which is sometimes called the label work. Then the window shifts to the second word, dog, as the center word. Dog, will be predicted by the word before it and the words, is and chasing, after it. Therefore, three words, a, is and chasing, will be used to predict the label word dog. The same process continues till the last word person, is the center word. Now you get the training data ready. Step two, train the neural network, specifically a narrow neural network to learn the embedding matrix. Narrow, means there's only one hidden layer in this model, instead of multiple hidden layers as in a deep neural network. This step is a little complicated, skip If you don't intend to go in depth. In practice, you normally call the Word2vec library, for example, from tensor flow, without needing to know the details. In case you're curious and want to train your own Word2vec model instead of using a pre trained one, here is what happens in the back end. The goal is to learn the embedding matrix EV x D. Where V is the size of the vocabulary and D is the number of dimensions. For example in the sentence, a dog is chasing a person, you have five different words. A, dog, is, chasing, and person. Therefore, V = 5. The value of D depends on the number of features that you want the neural work to learn to represent each word. It can be anywhere between 1 to 4 digits. Normally, when the number is larger, the model will be more refined, although it costs more computational resources. D is also a hyper parameter that you can tune when using Google's pre-trained Word2vec embedding, which means that you can try different numbers to see which one produces the best result. To make the visual simple, let's say d = 3 here. Therefore, the matrix in this example is 5 by 3, and it looks like this. Each W represents a weight that you want to train the neural network to learn. A simplified version to explain the process is to take the vectors of the surrounding 2,000 words as input, sum them up, and output a vector to represent the center word. This version is not very informative. Let's discuss this in more detail. First is the input layer. The input layer consists of 2,000 words, each is represented by a one hot encoder vector. If you recall from the previous lesson, a one hot encoder is a one by V vector, and V is the size of the vocabulary. You'll place a one in the position corresponding to the word and zeros in the rest of the positions. Then you embed these vectors with the embedding matrix EV x D. Which means you multiply each vector by the embedding matrix. To illustrate this, let's assume you have an input vector of a one hot encoder as 01000. You multiply it with the embedding matrix, which in the case is the Word2vec, by using the CBOW technique. And then you get the embedded one by d vector as 10, 12 23. You can imagine the embedding matrix as a look up table to turn the original word into a vector that has semantic meaning. To begin this process, all the values weights in the embedding matrix E are randomly assigned. Once you get the embedded vector for each context word, sum all the vectors and get a hidden layer H, a one by D vector. You multiply H with another matrix E, and feed the result to a Softmax function to get the probability. Y, y is a one by d vector, and each value shows the probability of the word in that position being the center word. This is the output layer and the predicted result. However, this is actually the beginning of the iteration. You must compare the output vector with the actual result, and use back propagation to adjust the weights in the embedding matrices, E and E. Iterate this process until the difference between the predicted result and the actual result is minimum. Now you get the E, the Word2vec embedding matrix that you aim to learn at the beginning. The training process depicts how a neural network learns. You'll explore more details about neural networks in the next module, opposite to continuous bag of words. Skip gram uses the center word to predict context words. For example, given chasing, what are the probabilities of other words such as, a, dog, and person, occur in the surrounding context. The process is similar to CBOW though. First you prepare the training data by running a sliding window of size 2K + 1, for example, let K = 2. You start with the first word a, you have no word before a, but two words after it. Therefore, you use a, to predict dog in his, respectively. And then the window shifts to the second word dog, as the center word. You use, dog, to predict the word of before it, and the words, is and chasing, after it. The same process continues till the last word, person, is the center word. Now your training data is ready. Next you train the neural network to learn the embedding matrix E compared to CBOW, E here is based on the Skip-gram technique. You start with the input layer, which consists of the one hot encoder vector of the center word. Embed this vector with embedding matrix E, and feed the embedded vectors to one hidden layer. After another E, and the Softmax function, you finally generate the output layer which consists of vectors to predict 2,000 surrounding context words. Now that you understand how Word2vec works, let's look at the coding. Fortunately, you don't have to train Word2vec yourself, Kira's in TensorFlow packages all the details and you only need to call the pre-trained models and tune the hyper parameters. In the setup you must import embedding from keras. Then you use one statement to create an embedding layer to embed the words in the vocabulary into certain dimensional vectors. The embedding layer can be understood as a look up table that maps from specific words. Integer indices such as one hot encoder to dense vectors, their embeddings. The dimensionality or width of the embedding is a parameter that you can experiment with to see what works well for your problem. Much in the same way you would experiment with the number of neurons in a dense layer. When you create an embedding layer, the weights for the embedding are randomly initialized just like any other layer. During training, they are gradually adjusted by using back-propagation. Once trained, the learned word embeddings will roughly encode similarities between words as they were learned for the specific problem your model is trained on. That's it.