0:18

So have a word named word.

Â And technically, to fit it into TensorFlow,

Â you'd probably have to represent it as some kind of number.

Â For example, the ID of this word in your dictionary.

Â And basically,

Â the way you usually use this word in your pipeline is you take one-hot vectors,

Â this large size of a dictionary vector that only has one nonzero value.

Â And then push it through some kind of linear models or neural networks, or

Â similar stuff.

Â The only problem is, you're actually doing this thing very inefficiently.

Â So you have this one-hot vector, and then you multiply it by a weight vector, or

Â a weight matrix.

Â It actually, it's actually [INAUDIBLE] process,

Â because you have a lot of weights that gets multiplied by zeros.

Â Now, you could actually compute this kind of weighted sum much more efficiently.

Â If you look slightly closer, you could actually write the answer,

Â you could actually write the answer itself without any sums or multiplications.

Â Could you do that?

Â Yeah, exactly.

Â You could just take one weight corresponding to this one here, so

Â weight ID 1337.

Â And this weight would be equal to your whole product,

Â because everything else is 0.

Â I could use the same approach when I have multiple kind of neurons.

Â So let's say instead of vector, we now multiply by a matrix.

Â So now in this case you could also try to deduce the result, but

Â think of it as a matrix product being just a lot of vector products stacked.

Â Now, how do you compute this particular kind of output activation

Â vector of your dense layer if you use this kind of one-hot representation here?

Â Yeah, exactly.

Â So you can simply do this as you've taken the first column vector, and

Â of this vector, the only thing remaining is the element under ID 1337.

Â Then take the second vector, and

Â then it has the corresponding element of this vector now as your second activation.

Â Now you have the third, fourth, and so

Â on, have as many vectors as you have kind of hidden units.

Â And if you visualize all the remaining values in this matrix,

Â to just be one row of this matrix.

Â Basically what this says is that you can replace this large scale

Â multiplication by simply taking one row.

Â And this of course speeds up your computation a lot.

Â 2:37

Now let's finally get back to word2vec.

Â Remember, we want to train some kind of vector representation so

Â that the words with similar contexts get similar vectors.

Â The kind of general idea of how we do that.

Â So we define a model which is very close to how you kind of your autoencoder.

Â Trains the representation you want as a byproduct of this kind of model training.

Â And what it actually tries to do is it tries to predict words' contexts.

Â So basically it only has one input, the word, say, the word liked.

Â And it wants to predict the probability of some other word for

Â every word being a neighbor of this word.

Â So basically, you would expect this model to have large probabilities at the output

Â for words that coincide with your word like, like the restaurant, for example.

Â And small probabilities for the words that don't coincide.

Â 3:27

Now, the problem here is that this is kind of under-defined machine learning task.

Â So you can't actually perfectly predict that.

Â But we don't even need that.

Â So actually, we don't expect our model to predict probabilities ideally.

Â So perturbations are necessary,

Â because we only need this model to obtain this first kind of matrix.

Â So it's the guy on the left here.

Â Now, actually, what this matrix does, is it takes one-hot vector representation

Â of one word, multiplies it by a matrix of weight.

Â So since we already know that this multiplication can be simplified,

Â it's basically the idea that you have these matrix, and for

Â each kind of mini batch, for one word, you take the corresponding row of this matrix,

Â and then send it forward along your network.

Â The second layer tries to take this representation, this word vector.

Â I won't be afraid of this word.

Â 5:53

Now, okay, so basically, those models are kind of symmetric.

Â And they're non-similar representations up to, well, some minor changes.

Â So the general idea stays the same.

Â And you can, again,

Â use one of those two matrices as your word-embedding matrix now.

Â Because for example, in the word-to-context model, what you had is,

Â this first matrix was, for every single possible sample,

Â only one row of this matrix was kind of used at a time.

Â So basically, you could assume that this kind of of row in the matrix is the vector

Â corresponding to your word and use this matrix as your word embedding.

Â If you train this model by yourself or

Â if you use a pretrained model, you'd actually notice that it has

Â a lot of peculiar properties on top of what we actually wanted it to have.

Â So of course, it does what we actually trained it for.

Â It trains similar vectors for synonyms,

Â and different vectors for semantically different words.

Â But there's also a very peculiar effect called kind of linear algebra,

Â word algebra.

Â For example, if you take a vector of king, then subtract from it the vector of man

Â and the vector of woman, it gets something very close to the vector of queen.

Â So kind of, king minus man plus woman equals queen.

Â Or another example, you could take moscow minus russia plus france equals paris.

Â And they kind of make sense, well,

Â they're kind of underdefined, in mathematical terms.

Â And this is just a side effect of the smaller training.

Â So, again, like other models we've studied previously,

Â this is not a desired, kind of originally intended effect.

Â But it's very kind of interesting and sometimes it's even helpful for

Â applications of these word embedding models.

Â Now, if you visualize those word vectors, for example, if you take first two

Â principal components of your trained word vectors, it also emerges that this linear

Â algebraic stands very nicely to kind of structured space of those word embeddings.

Â For example, in many cases, you may expect a kind of similar direction

Â vector connecting all countries to their corresponding capitals.

Â Or all male profession names to the corresponding female profession names.

Â So there's a lot of those particular properties.

Â Of course, you cannot expect them to be 100% certainly trainable.

Â So sometimes you get the desired effects, sometimes you just get rubbish.

Â And of course, the model doesn't strictly apply to the exact same distance have

Â to be preserved through, it just trains something peculiar by the way it trains.

Â And this kind of coincides with the idea that, for example, autoencoders and

Â other unsupervised learning methods have a lot of kind of unexpected properties that

Â they all satisfy.

Â So hopefully by now I managed to convince you that having those word vectors

Â around is really convenient, or at least cool,

Â because they have all those nice properties.

Â It's later going to turn out those word vectors are really crucial for

Â some other deep learning applications to natural language processing,

Â like recurrent neural networks.

Â But before we cover that, let's actually find out, how do we train them,

Â how do we obtain those vectors to start collecting the benefits from them.

Â 9:46

The first one, which basically takes a one-hot vector and

Â multiplies it by a matrix, can be, as you already know,

Â replaced by just taking one row of this matrix.

Â But the second one doesn't have this property, because it uses a dense vector.

Â So you compute this thing naively,

Â you're actually going to face a matrix multiplication the scale of, say,

Â 100 vector dimension by 10 to the power of 4 or 5 possible words.

Â Which is kind of hard for a model that only has two layers in it.

Â And the hardest part here is that you cannot actually cheat by

Â computing the partial output of this matrix.

Â Because actually, the problem here is that this kind of second layer,

Â it tries to [INAUDIBLE] here.

Â And when you think probability in deep learning, you actually mean softmax here.

Â The problem with softmax is that to compute just one class probability

Â with softmax, you have to exponentiate the logit for this class and then divide

Â it by exponentiate the logits for all possible classes, including this one.

Â And the second kind of [INAUDIBLE] part here is really hard,

Â because you have to add up probabilities, kind of.

Â Not unnormalized probabilities,

Â exponentiated logits from all classes in order to compute just one output.

Â 10:59

Now, okay, you could of course do this.

Â Theoretically, there is enough GPU space to do that on modern GPUs, or

Â it's even feasible on CPUs.

Â But the problem is that it's a very simple operation that requires a lot of

Â power here.

Â So instead there are some kind of special modifications of softmax like hierarchical

Â softmax or sample softmax, which try to estimate this thing more efficiently,

Â sacrificing either some of the mathematical properties or

Â sacrificing the fact that the softmax has deterministic probability,

Â so just adding some noise.

Â 11:33

There's also a number of similar models that try to avoid computing probabilities,

Â avoid having softmax altogether.

Â Like this GloVe, Global Vectors, which uses no such nonlinearity in its pipeline.

Â Now, finally, word embeddings can be extended to high-level representations.

Â You can find embeddings for different objects.

Â For example, you can find embeddings for the entire sentence,

Â which makes this kind of hierarchical method.

Â Or you could try to find embedding for a specific data, like maybe an amino acid.

Â There is a special, from bioinformatic,

Â I know this model called protein2vec that tries to vectorize protein components.

Â And this thing is more or less advanced part of natural language processing.

Â We'll add links describing it into the readings section.

Â But you can more or less expect that they'll be covered in more detail

Â in the natural language oriented course in our specialization.

Â Now, okay, so basically, to be continued.

Â If you're intrigued by this,

Â you can jump to the reading section before the NLP course starts.

Â So this basically concludes the part of today's lecture dedicated to natural

Â language, and word embeddings, in particular.

Â But don't worry.

Â [INAUDIBLE] reading section, we'll also have the entire next week dedicated to

Â advanced applications for natural language processing.

Â We'll study recurrent neural networks that can, when paired with word embeddings

Â of course, solve not only the text classification like sentiment analysis.

Â But also the inverse problem,

Â like generating the text given a particular kind of task.

Â This, of course, coincides very well with your course project,

Â which is generating text captions given images.

Â See you in the next section.

Â [SOUND]

Â [MUSIC]

Â