Hello, friends. This is the last week of our adventure together, and it's natural to feel a bit sad, but let's end on a high note. We can do that in two ways. First, you can wear a bright cheerful outfit, like this shirt that I'm wearing, and second, let's learn about two important forms of learning, supervised learning and reinforcement learning. Let's start with supervised learning, which has become all the rage these days given the emergence of big data and the rise of machine learning. Did you know that many important algorithms in machine learning have their origins in models of neurons and how neurons learn? Let's take a closer look. Let's begin that classification with a fundamental problem in machine learning. Suppose I gave you a bunch of image such as these, some containing faces, and others containing objects, such as the adventure hat, vehicles that will get your heart racing, such as these two, for different reasons, and logos, of course, that you love, such as this. The problem of classification is basically getting a machine to decide which of these images contain faces, and which do not. This is obviously a trivial problem for you brain, and in fact not only can your brain decide if an image contains a face, but the face will also probably trigger related associations, such as memorable Bollywood dancers in some cases, and not so memorable ones in others, or even memorable sentences, such as I did not have sexual relations with that woman. But how can a machine solve this classification problem? Here's one way of tackling this problem. Suppose each of these green and red points represents one of our images. And obviously these points exist in a very high dimensional space. For example, if our image had a million pixels, then each of these points would exist in a million dimensional space. But for simplicity, let's consider just this two dimensional space. And these points, these images, are labeled by the fact that either they're faces or they're not faces. So we can label the face images with a +1 and the images containing other objects with a -1. And if we now have these labels associated with the set of face images and the set of non-face images, then if we're lucky, perhaps the face images cluster in one part space and the other images cluster in a different part of the space. So if that's the case, can you think of way of classifying new images, such as, let's say, an image that I have that is now in that location in this image space, and another image that is perhaps in this location, so how would you classify this image here, and this image here? One way of classifying these new points is to find a hyperplane that separates the face images from the non-face images. So, in this case, the hyperplane happens to be just a line, because it's just two dimensions. But in the general case of a very high dimensional image space, we're going to find a hyperplane. And now, what we can do is we can look at our new point, so this image. And if it's above the separating line, in this case, then you would label that with +1, so we'd call that a face image. And if the new image is below the separating line, we would call that a non-face image, so we would label that with a -1. Now, the question is, can neurons do something like this? Can neurons do classification? Well, let's go back to the idealized model of a neuron that we discussed in the very first week of our course, where we assumed that the neuron simply sums up its inputs and if the summation of all its inputs exceeds a threshold, then the neuron generates an output spike. So mathematically what this means is that if the inputs are denoted by ui and if the synaptic weights are given by wi, then what we're saying with this simple idealized model is that the neuron is going to generate an output spike if the weighted sum of all the inputs is bigger than some threshold. Let's call that threshold mu. So, if the weighted sum is bigger than this threshold mu, then we have an output spike. This simple model for a neuron in fact has a name. And the name is perceptron. The perceptron was originally proposed by Rosenblatt in the 1950s, building on the work of those pioneers of neuromodeling, McCulloch and Pitts from the 1940s. Here is a schematic depiction of a perception. The inputs are +1 or -1, denoting a spike or no spike. And as we discussed earlier, the perceptron computes the weighted sum of it's inputs, compares it to it's threshold, and if the weighted sum is above the threshold, you have an output of +1, meaning a spike, and if the weighted sum is below or equal to the threshold, then we have an output of -1, or no spike. Now, here is an equation that defines the output of a perceptron. So v is either +1 or -1. And the output is determined by this function theta, where the function outputs +1 if its argument is bigger than 0, and -1 if its argument is less than or equal to 0. You can see now how if you write this equation using this function theta, it implements exactly what we want. So if the weighted sum minus the threshold mu is bigger than 0, it means that the sum is bigger than the threshold, which means you have an output of +1, and if the weighted sum is less than or equal to the threshold means that this summation minus mu is less than or equal to 0, which means that the output is going to be -1. So what does a perceptron do? Well, let's set that weighted sum expression equal to 0. Now, what does this equation remind you of? What does this equation define in the n-dimensional space of the inputs? Here's a hint, it's a linear equation. You're right, the equation defines a hyperplane. Or, in the special case of two dimensional inputs, the equation defines a straight line, and what's more, all the input points that are above the line satisfy the property that the weighted sum is bigger than the threshold, because the left hand side of this equation for all these inputs, is going to be a value bigger than the value 0. And all the points that are below the line are going to satisfy the property that the weighted sum is less than the threshold, because the left hand side in those cases will turn out to be less than 0. What this means is that the perceptron is going to have an output of +1 for all the inputs on one side of the hyperplane and an output of -1 for all the inputs on the other side of the hyperplane. In other words, the perceptron can separate inputs from one class, let's say class 1, from the inputs from another class, let's say class 2. So you know what that means. It means that perceptrons can classify. In other words, they can perform linear classification. Linear because they use a line or a hyperplane, to separate one class of coins from the other. So here's the super wise learning problem for the perceptron. We're given a set of inputs that are labeled, so the red points here are labeled plus one denoting class one, and the green points are labeled minus one denoting class two. The problem is, how do we learn the rates and the threshold for the perceptron, given these inputs and their labels. In other words, how do we find a separating hyperplane, by adjusting the weight and the threshold. You guessed right. There's a learning rule for perceptrons, and it involves adjusting the weights and the threshold according to the output error. The output error is given by vd- v. vd here denotes the desired output, or the label that we get with each input. And v denotes the output of the perceptron. Here are the update rules for the weights, and the threshold. Epsilon, as you will recall, is the learning rate. A positive constant that determines how fast the rates are adapted. Let's see if we can understand this weight update rule, in the case where the input was positive. Now, in this case, the learning rule, you can see, increases the weight, if the error is positive, so what does that mean? It means that vd was plus 1 and the output of the perceptron was minus 1. So, in order for it to do the correct thing in this case, so generate an output of plus 1, the perceptron needs to increase the weighted sum, so that it's above the threshold. So, it can do that by increasing the weight. And so, we can see, now, that the learning rule's doing the right thing in this particular case. Now what if the error was negative? So, in that case, you can see that this learning rule is going to decrease the weight. So is that the right thing to do? Well, if the error is negative, it means that the desired output, the label, was minus 1. And the output of the perceptron must have been plus 1, so that gives you a negative error. So in this case, what we want the perceptron learning rule to do, is to make the output, which is plus 1, be a minus 1 output. So you can make the output minus 1, by decreasing the rated sum to be below the threshold. And that's in fact, what the learning role does, it decreases the weight w i, which in turn makes the weighted sum eventually go below the threshold. The learning rule does the opposite for the case where u i, is negative and you should be able to convince yourself that that's the right thing to do. In the case of the threshold, the update rule decreases the threshold if the error is positive, and increases the threshold if the error is negative. So to see that this is again the right thing to do in this case, and the errors positive it means that v d was plus one, and the output of the perceptron was minus 1. And so you can see that when you decrease the threshold, this in turn encourages the output of the perceptron to go from minus1 to plus1, because now the threshold has been decreased and so that again, is doing the correct thing. Similarly, when the error is negative, you must have had the case that the desired output was minus 1, and the perceptron's output was plus 1. And so by increasing the threshold, we are now encouraging the perceptron to not have the output plus 1, it's going to have the weighted sum go below the threshold, because the threshold is now being increased. And so once again, that's the right thing to do to make sure that the perceptron's output, matches the desired output. That's great. Now that we have a learning rule for the perceptron. You're probably asking yourself, can perceptrons learn any function? Well, let's look at the exclusive r or xr function. So, here is the table for the exclusive r function. And, as you already know from your logic or mathematics classes, the exclusive XOR function gives you an output of plus 1, only when the two inputs differ from each other, otherwise you have an output of minus 1. And here's a graphical depiction of the exclusive or function. Here is the two dimensional space of the inputs, and you can see that the two inputs, that give you an output of plus 1 are denoted by these red points. And the green points denote the two inputs, that will give you a minus 1 output according to the XOR function. The question that I would like to ask you is, can a perceptron learn to separate the plus 1 inputs, from the minus 1 inputs? In other words, can a perceptron learn the exclusive XOR function? The answer, as you might have guessed, is no, unfortunately they cannot. Perceptrons can only classify linearly separable data. So what if you really like the perceptron model very much, because it's a simple model of a neuron and perhaps you really love that name perceptron? How do we still keep the perceptron model, and handle linearly inseparable data? The answer of course, is to use multiple layers of neurons and this gives us multilayer perceptrons. These can classify linearly inseparable data. So for example, we can use those two layer perceptron to compute the XOR function, and I will then encourage you to substitute the different values, for u1 and u2 to verify that this 2 layer perceptron does indeed compute the XOR function correctly. Now, what if you want continuous outputs rather than the plus1 and minus 1 outputs you obtained from the perceptron. In other words, what if you want to do regression, rather than classification? One example where this might be applicable would be in teaching a network to drive a truck. So in this case you might be mapping the images of the road, and pedestrians, and bicyclists, and so on, to appropriate steering angles for the truck. So you might argue in this case, that you could get away with using classification, by mapping the plus 1 outputs to swing to the left and minus 1 outputs to swing to the right. And I've actually seen drivers in Seattle practicing this kind of behavior. But to be safe, It's better to use regression and map the inputs to appropriate continuous steering angles. We can get continuous outputs from our network if we use Sigmoid functions for the output of our neuron. So, in the case of the perceptron we used a threshold function data, so instead of using the threshold function, if we now use continuous valued functions such as the Sigmoid function, then we can get continuous outputs from our network. So, here's the mathematical expression for the Sigmoid function. And here is a graphical depiction of the Sigmoid function. You'll notice that the Sigmoid takes values between minus infinity, and plus infinity, and it maps it to values between 0 and 1. And so one can interpret the output of the Sigmoid as fighting rate of the neuron, and so the fighting rate then lies between the minimum value of 0 And a maximum firing weight value which has been normalized to be the value one. So for example if a neuron has a maximum firing rate of 100 hertz and a minimum firing rate of zero hertz, then we can normalize the output firing rate of the neuron by dividing each firing weight by 100 and that would make the range of the firing weight be between zero and one as in the case of the sigmoid. The parameter beta, which appears here in the sigmoid function, controls the slope of the sigmoid. So, for example, if the parameter beta is large then the sigmoid approaches a threshold function, like the data function we had in the perceptron. And when the parameter beta is small, then the sigmoid looks more like a linear function. Let's see if we can learn multilayered sigmoid networks for regression. Why multilayer? Well, if you have a single layer of neurons then the network is not going to be very powerful as we saw in the case of the exclusive r function. So let's consider the case where we have three layers. So we have the input layer here. We have a hidden layer of neurons and then the output layer. And here is what the network does. The network takes a rated sum of its inputs, and that's given by this expression here. And the rated sum is then passed through the sigmoid function that results in an output. Let's call that xj and that is the output here in the hidden layer so this would be x1, x2, and so on. And these outputs are then transformed by the rates from the hidden layer to the output layer and that's given by this rated sum. The sum over j of WIJ times XJ, and that in turn is then passed through the sigmoid function again to give you the output of the network for each individual neuron in the output layer. Note that in this network, we're using only one hidden layer of neurons. If you use many hidden layers, then we get what are called deep networks. And these deep networks have received a lot of attention recently because they've been shown to learn more and more complex features in the deeper layers of the network and that in turn allows the deep network to learn complex functions. If you're interested in learning more about these deep networks and deep learning, I'd encourage you to Google deep networks and find out the details. Let's now focus on this three layered network and try to figure out how to learn the rates of this network. Remember that we're also given the desired output d for each input u because this is a supervised learning problem. So how would you change these rates in order to get your network to produce the desired outputs, d? Here's one way we can do that, we can minimize the output error. So here is an example of an error function, E. It's a function of both the big W and the little w. And it's simply the sum of all the output neurons of the square of the output error, di minus vi. So I'd like to ask you how would you minimize this error function with respect to the big W and the little w? Let me give you a hint, perhaps you can use the gradient of the error function with respect to the weights. That's right, you can use gradient descent to minimize this output function. So here is how you could do that for the case of the big W, the weights from the hidden layer to the output layer. So delta Wij is going to be equal to the negative of the gradient of E with respect to Wij, and epsilon as before as a small positive constant known as the learning rate. And if you take the derivative of E with respect to Wij, you're going to get this expression here. And this learning rule, this weight update rule, is known as the delta rule. And that's for historical reasons, because this error, this difference between the desired output and the actual output, has been called delta and therefore this learning rule is called the delta rule. Now if you're wondering why gradient descent is the right thing to do in this case, then the argument is actually quite similar to what we had for the case of gradient ascent, which we used in a previous lecture for maximizing a function. Since we are trying to minimize a function in this case what we need to do is gradient descent, which is why we have the negative sign in front of the gradient, and that makes the parameter estimate move closer and closer towards the optimum value which minimizes this error function. One last thing we haven't talked about, yet, is how we change the weights for the hidden layer, the little w. We could use gradient descent and we get this expression, but how do we take the derivative of the output error with respect to the hidden layer weights? The output error is defined in terms of the activities of the output neurons, whereas now we have to take a derivative of that error with respect to the hidden layer weights. The answer of course lies in the chain rule from calculus. The chain rule tells us that we can take a derivative such as dE over dJK and we can write it as the product of two derivatives, the first one being dE over dxj times the second one being dxj over dwjk. And now you can see that both of these derivatives can be computed. The first one can be computed from the expression for E. The second one can be computed from the expression for xj which is the activity of dj hidden layer neuron. Let me plug in this product of derivatives into the beta visual, we get the very famous back propagation learning rule from multilayered networks, and you can see why it's called back propagation. It's called back propagation because we are propagating the errors from the output layer all the way down to the hidden layer and in the case of many hidden layers you can generalize this chain rule to apply to more than just one hidden layer. And therefore you're propagating the errors down from the output layer down to all of the hidden layers of the network. I'll encourage you to look into the supplementary materials for the actual derivation of the backpropagation learning rule and the expressions that we get as a result of taking these derivatives. Okay, after all that hard work, this is where the rubber hits the road, if you pardon the pun. We're going to use back propagation to drive a truck. And since our lawyers will not allow us to use back propagation to drive a real truck, we going to use a simulation that was created by Keith Grochow, who was a student at the University of Washington several years ago. The specific task for the network is to learn to back a truck into a loading dock. So here is the truck in green and here is the loading dock. What we are trying to do is to train the network given both inputs and desired outputs to back this truck into this space here which is the loading dock. So the input to the network is going to be the position x and y in two dimensions as well as the orientation data of the truck and the output that we would like to get from the network is the steering angle, so that the truck can back into this space denoting the loading dock. The training data for the network is provided by a human backing this simulated truck into the loading docks. The human is providing the steering angles for different positions and orientations of the truck. So the question then is, can the network, given this data, learn to back a truck, on its own, into the loading dock? Well, what do you think, do you think it can do it? Well, let's see, here we go. Here's the truck during the very early stages of learning. And as you can see, it's not doing very well. In fact, it's driving like a maniac. And it reminds me of some crazy drivers that I saw when I was growing up in the Indian city of Hyderabad. Well, let's see if we can help the truck a little bit by training it some more on the human data. Well, here we go. So it's gone through 4,000 passes now of the human data. And you can see that it's doing a little bit better. It's getting closer to the loading dock. Now let's train it some more. And now let's see if it can actually get there. So yes, it is actually getting very close to the loading dock. So what do you think? You think that you will let this network drive your car or truck? Well, I wouldn't. Well, that was fun. But more than driving trucks like maniacs, animals are interested in finding food and other rewards in their environment. In the next lecture, we'll learn about predicting rewards, and how the brain might go about doing this. Until then, au revoir and goodbye.