In this video, we’ll talk about the Softmax function and understand its internal workings. First we’ll start discussing the Softmax function in 1D and then we’ll cover the 2D case. This will give you an intuition of how the Softmax function generalizes to multiple dimensions. Just like logistic regression, we'll have integer classes but instead of just two classes. We could have multiple classes, and in this case we have four. We'll also have feature vectors or tensors and each sample. Each sample will correspond to a different row in the matrix or tensor X. Just like logistic regression the Softmax function will use different lines to classify data. Let's start with an example to understand how the Softmax function works. In this example we have three classes and a one-dimensional feature vector x. Y equals 0 is denoted by the blue points, y equals 1 is denoted by the red points, and y equals 2 is denoted by the green points. Now, it is pretty evident from this plot that any point in this region will be classified as blue. Points in this region, will be classified as red. And points in this region, will be classified as green. Let's see how we can use different lines to classify these points. Here we have three different lines each having their corresponding weights and bias terms. Let's look at the outputs of these lines and see what happens when we plug in different values for x. If we plug a value of x in this region, it turns out the output of the line z zero will be greater than the red and green lines. From the picture it is clearly evident the output of the red line is lesser than the blue line. Further, later in the video we will see, that the output of the green line is negative for points belonging in this region. Thus, we can conclude that for values of x in this region, z zero is greater than z one and z two. Similarly, if we plug a value of x from this region, the output of z one will be greater than the other lines. Thus, we can conclude that, the output of z one will be greater than z zero and z two for values of x in this region. Finally if we're in the green region the output of z two will be larger than any of the other lines. Thus the output of z two will be greater than z zero and z one for values of x in this region. Before we continue let's review the argmax function. The argmax function returns the index corresponding to the largest value in a sequence of numbers. Here the largest value in z is 100, and the corresponding index is 0. Thus, the argmax function will return zero. In this example, the largest value of Z is 10 and the corresponding index is 7, so the argmax function will return a 7. Now, let’s combine the argmax function and the different lines we had from earlier for fully understanding the working of the Softmax function Here, we have the three lines along with the actual values for the weights and bias parameters. So let's say we have a sample of x equals minus 0.5. Plugging in this value of x in each line, we get an output for each line. For the green line we can't really see the output since it's negative. We store these outputs in the following table with the index i corresponding to the line number and apply the argmax function to the table. Since the largest value corresponds to z zero. The argmax function returns zero and y hat equals 0. That’s the SoftMax prediction for that sample. Let’s look at another example. This time the value of x is equal to 0.5 Plugging in the values of x in the different lines and storing the outputs in a table. And applying the argmax function we notice that this time the largest output corresponds to index 1and thus y hat equals 1. So Softmax will classify this sample as class 1. Finally, let’s select the value of x as 1.5. Plugging in the values of x in the different lines and storing the outputs in a table. And applying the argmax function we notice that this time the largest output corresponds to index 2 and thus y hat equals 2. So Softmax will classify this sample as class 2. Now, let’s cover the Softmax function for the general case where we will have multi dimensional inputs. We’ll use the MNIST dataset for explaining how Softmax works for the general case. The MNIST dataset is used for classifying handwritten digits into different classes ranging from 0 to 9. For each handwritten image we are going to concatenate it to a vector as follows. Since each image is a greyscale image, the intensity values for each pixel can range from 0 to 255. Further, each image in the MNIST dataset comprises of 784 pixels, thus our vector has 784 values in it. To visualize the softmax function, we'll consider vectors in 2 dimensions. Visualizing and plotting 784 dimensions would be extremely difficult. To visualize Softmax in 2D, you can think of the samples as vectors. Here we have three weight parameters w 0, w 1 and w 2, the vectors values are shown in the table. These vectors represent the parameters of Softmax in 2D. The Softmax function is used for finding the points nearest to each parameter vector. So anything in this quadrant will be classified as blue because its nearest to the vector w 1. Similarly, anything in this quadrant will be classified as red because it’s nearest to the vector w 0. We can do the same for the green parameter Let’s look at few examples. For the sake of simplicity, we’ll use the x 1 vector to represent 0 in 2 dimensions. The Softmax function will first perform the dot product of x 1 with each of the w vectors. It will then call the argmax function, which will return 0 and Softmax will classify this sample as class 0. Intuitively as well we can see the sample x 1 is nearest to w 0 compared to w 1 and w 2 Let’s look at another vector x 2 which represents the number 1 in 2 dimensions. Computing the dot product, and using the argmax function on the computed values, we see that we will classify this sample belonging to class 1. Mathematically, the dot product for the vectors is calculated by performing matrix multiplication for each vector as explained earlier in the previous modules The reason the function is called Softmax since the actual distances i.e. dot products for each input vector with the parameters is converted to probabilities using the following probability functions. Similar to logistic regression. In the next videos, we will see how to use Softmax in PyTorch for performing classification by specifying the loss criteria as cross entropy loss. Thank you for watching this video!