So let's continue this discussion of the concept of inner product to try to gain further understanding of what it means and how it can be used. So again, let's look at our 10-dimensional highly simplified representation of a word vector. We're doing it in 10 dimensions here simply because it's easy for us to show visually here, but I want to underscore that in practice when we actually do machine learning on natural language processing, the dimension of these word vectors could be on the order of 256 or five-twelfths. So generally, much larger than 10. We're using 10 here just for visualization. Furthermore, this is back to the example that we considered earlier, the word Paris, and I'm attributing meaning to each of the 10 components in the way that we talked about earlier, where we looked at pairs of words which are similar. We can uncover this meaning. Remember, we talked about gender, we talked about plurality, a word being plural, and so we have these various components. So the word Paris notionally is represented by a d here, a 10-dimensional vector. If Paris is positively associated with a given topic which is represented by each of the d or 10 dimensions, it is positive, and if it's not related, it is negative. Let's use this concept now to try to gain some further insight into what this inner product is doing. So let's look at word i and word j from our vocabulary. So this is the ith word and the jth word from our v-dimensional vocabulary. These are just two arbitrary words. What I'm showing you here in 10 dimensions, this notionally are the word vectors between those associated with those two words. The thing I want you to notice is that almost always for each of the 10 components, if the ith word is positive, the jth word is positive, that component is positive. So for example, if you look at the last component, the 10th component of word i and word j, it is positive in both cases. If you look at the first component of word i and word j, it is negative. So the thing I want you to notice is that these word vectors are similar in the sense that for almost all of the 10 dimensions, the sign, positive or negative, is the same for each of the components. With this understanding that each of the components represents some underlying meaning associated with the words, we would infer that since these word vectors are similar that these two words should have similar meaning. So the thing that I want you to notice is that if you multiply a negative number by a negative number, you get a positive number. Remember that the inner product is taking component by component multiplication. So if we look at, for example, the 10th component of this vector, it's positive in both cases. Positive number multiplied by positive number will give me a positive number. If we look at the first component, it's negative in both cases. Negative number by a negative number will give me a positive number. So the idea is that if the ith and jth words share almost always the same sign for here, for each of the 10 components, when we take those multiplications, they will almost always be positive, and when we sum them up to constitute the final inner product, that inner product will be positive. So we see that if words have similar meaning, which means that they have similar word vectors, then the associated inner product will be positive. The more similar the words are, the more positive that inner product will be. In contrast, if we look at these two words, word i and word j, and we look at, for example, the 10th component, the rightmost component, word i is negative, word j is positive. So we see that those are inconsistent with each other. If we look at the first component, word i is positive, word j is negative. So since the sign of each of the components of word i and word j is almost always opposite, the sign is opposite, when we take the product, that product will be negative. Most of the components of that inner product will be negative. When we sum them altogether, we would expect that we will get a negative inner product. A negative inner product implies that the ith and the jth words are dissimilar, they do not have the same meaning. So the key concept that we're trying to get through this inner product is that each of the components of the d-dimensional vectors associated with words have underlying meaning. If two words are dissimilar, they will in general have different signs on the respective d-components, and the inner product will be negative. If two words are aligned, the signs of each of the d-components will mostly align, and the inner product will be positive. So therefore, this inner product through the sign of the inner product, positive being similar, negative being dissimilar, tells us information about any two words in a vocabulary. Now, for the reasons that will become clear in a moment, it is not very convenient to have to work with inner products that are sometimes negative and sometimes positive. So this idea of a negative inner product and positive inner product tells us a lot about the relationships of the words. However, when we actually do the machine learning, it will be more convenient to only work with positive numbers. So to do that, we're going to remind ourselves of the exponential function. So here the x-axis represents the input and the vertical represents the output of a function. Here, the function is the exponential function, exp of x. Notice that when x is equal to 0, the x of 0 is 1. The most important thing to notice from this representation of the exponential function is that for every input, for any value of x, the exponential function is positive. However, there are other functions we could choose that would always be positive. The reason that we use the exponential function is that the more positive x is, the larger the exponential function is. The more negative x is, the smaller the exponential function is. So the exponential function is called a monotonically increasing function of the input x. Remember that the inner product, positive or large, represents words that are aligned. Negative represents words that are not aligned or are different. We want to preserve this idea of scale between positive and negative. So this exponential function is a way of doing it. The output of the exponential function is always larger if the input is larger. So if you give me two values of x, one negative and one positive, the exponential function of the positive input will always be larger than the exponential function of the negative input. So the exponential function preserves the meaning of positive and negative input, but it has an output. The output of the exponential function is always positive. So this is a convenient function for reasons that we will see in a moment. So as I said before, the exponential function for all x is always positive. Then as I was alluding to, the exponential function is a monotonically increasing function of the input x. Here, in the context of what we're going to do, the x is going to correspond to the inner product as we'll see in a moment. So if we then go back and wrap this up on the top left, is that inner product. I'm now spelling it out in all of its detail. But remember to the right is our notation of an inner product. The inner product is represented as a dot product. Again, the reason we use that notation is because the inner product is between c_1 and c_2 is just notationally c_1.c_2. If c_1 and c_2 is large, then the exponential function, because it's a monotonically increasing function, will also be large. The other thing is for all inputs, for any inner product, c_1, c_2, it is always positive. But it preserves this monotonically increasing nature which will be important because it therefore preserves this concept of similarity between words. If two words are similar, then the exponentiation of the inner product between those two words will be large. If the inner product between two words, c_1 and c_2 is small or negative, the exponential function will be small. So the idea through this exponentiation is we preserve the meaning of words as represented by their inner products. We choose to use the exponential function because it's always positive. As we will discuss in a moment, it's very convenient for the model that we're going to develop.