We started out talking about embeddings from movie IDs, these were categorical features. Then we applied the same example to words in an ad, and those were text features. So, what's common between them? Embeddings are not just for categorical or text features, they're about something more. So here, I'm showing you a classic machine learning problem called MNIST. The idea is to recognize handwritten digits from scanned images. So you take each image, and each of the pixels in the image is an input. So, that's what I mean by raw bitmap here. Now, the images are 28 by 28, so there are 784 pixels in that bitmap. So, consider this array of 784 numbers. Most of the array corresponds to blank pixels. Embeddings are useful here also. We take the 784 numbers and we represent them as a sparse tensor. Essentially, we only save the pixels, where the handwritten digit appears. We only save the pixels where the digit is black, and then we pass it through a 3D embedding. We can then have a normal two-layer neural network and we could pass and other features if you wanted, and then we trained the model to predict the actual number in the image based on these labels. Why do I have a logit layer here? These form the output layer of a neural network. A logit is what the output has to be for a classification problem. When we use a linear classifier or a DNN classifier, the output layer is a logit, one single logit. But that's only if you have one output. In the case of the MNIST problem, we have 10 total classes. Essentially the digits zero, one, two up to nine. So that's why I don't have one logit, I have a logit layer. I have one logit for each of the possible digits. When we have a logit layer as opposed to a single logit, there is no guarantee that the total probability of all the digits will equal one. That's the role of the Softmax. It normalizes the individual logits so that the total probability equals to one. But sorry for the digression, we were talking about embeddings. So here, then we trained the model to recognize handwritten digits, each image will be represented by three numbers. Unlike in the categorical case though, the raw bitmap is not one-hot encoded. So, we won't get three numbers for each pixel. Instead the three numbers, correspond to all the pixels that are turned on for a specific image. In tensor board, you can visualize these embeddings, the 3D vector that corresponds to each 784 pixel image. Here, we have assigned different colors to the labels, and as you can see, something cool happens. All the fives clustered together in 3D space as do all the sevens and all the zeros. In other words, the 3D numbers that represent each handwritten image are such that similar items are close to each other in the 3D space. This is true of embeddings for categorical variables, for natural language text, and for raw bitmaps. So what's common to all of them? They're all sparse. If you take a sparse vector encoding and pass it through an embedding column and then use that embedding column as the input to a DNN and then you train the DNN, then the trained embeddings will have this similarity property, as long as of course, you have enough data and your training achieved good accuracy. You can take advantage of this similarity property and other situations. Suppose for example, your task is to find a song similar to this song. What you could do is to create an embedding of the audio associated with songs. Essentially, you take the audio clip and represent it as an array of values. Then, just like with the MNIST image, you take the array and pass it through an embedding layer. You use it to train some reasonable machine learning problem. Perhaps you use the audio signal to train a model to predict the musical genre or the next musical note. Regardless of what you train the model to predict, the embedding will give you a lower dimensional representation of the audio clip. Now to find similar songs, you can simply compute your euclidean distance between two clips, between their embeddings, and this becomes a measure of the similarity of the two songs. You could also use the embedding vectors as inputs to a clustering algorithm. The similarity idea can also be used to jointly embed diverse features. For example, text in two different languages or text and its corresponding audio to define similarity between them. In all the four examples, we used three for the number of embeddings. You can use different numbers of course. But what numbers should you use? The number of embeddings is the hyperparameter to your machine learning model. You will have to try different numbers of embedding dimensions because there is a trade-off here. Higher dimensional embeddings can more accurately represent the relationship between input values. But, the more dimensions you have, the greater the chance of overfitting. Also, the model gets larger and this leads to slower training. So, a good starting point is to go with the fourth root of the total number of possible values. For example, if you're embedding movie IDs and you have 500,000 movies in your catalogue, the total number of possible values is 500,000. So a good starting point would be the fourth root of 500,000. Now the square root of 500,000 is about 700, and the square root of 700 is about 26. So, I would probably start at around 25. If you are doing hyperparameter tuning of the number of embedding dimensions, I would specify a search space of maybe 15 to 35. But that's just a rule of thumb of course.