In the first part of this lecture, we began talking about neural networks as a computational model that's loosely based on the behavior of neurons in the brain. So we began with a discussion of perceptrons, which are the simplest kind of neural network, basically consists of an input and an output layer. But now perceptrons, I think it's fair to say, are more of a theoretical than practical interest these days. But they form a very good foundation for understanding more complex neural networks. They're a good sort of pedagogical springboard for understanding where we're going to go, which is to look at multi-layer neural networks. And again, the simplest of these, but already much more interesting than perceptrons, are three-layer neural networks like the one in this diagram here. So here we have an input layer, a central layer of artificial neurons, of neural elements, which we're calling a hidden layer, and then the output layer. Again, much like the model of the perceptron, the idea is that we're going to feed a variety of inputs. In this case, it looks like a three-bit input. So we're going to feed sequences of three bits into this multi-layer neural network. What we want is to capture some idea, some pattern and to use this neural network as a classifier that will tell us if the input belongs to the category we're looking for or if it doesn't. So in this case, we would feed, and again, we're sort of reading this from left to right, we're providing three-bit inputs to this neural network. We are looking to see how the network classifies those inputs by virtue of what it produces at the output layer. And when the output layer is, when the output produces an error, when it doesn't produce the answer that we want. We use a learning rule to get the multi-layer network to change those internal weights between the layers. So that it will hone in on becoming a machine, becoming a pattern classifier, that will behave as we want. We had much the same formulation with perceptrons, except now, because we have multiple layers, we actually have a great deal more computational power. So now the format of this diagram, the format of this neural network, is as I mentioned before, this is a kind of standard format. It's called a multi-layered feed forward neural network. Why feed forward? Because there are no connections from later neural elements to earlier ones. There are no arrows going backward in this. So all the information flows in one direction. Why multi-layered? Because it's arranged so that each layer only communicates with the layer ahead of it. So the input layer is providing weighted inputs to the hidden layer, which in turn provides weighted inputs to the output layer. What we're going to focus on for this second part of the lecture, and it'll be a relatively brief part, is the classic algorithm called back propagation for training these neural networks. It's an extension of the learning algorithm that we already saw for perceptrons. So let's just get right into it, okay? What we saw last time for training perceptrons is that if you have an error at the. Remember, if the perceptron output node, if an output node of the perceptron is not in error, we leave alone the weights coming into that node. Because after all, this particular node, this output node, gave us the correct answer. If the answer is wrong, then we update the weight leading into that, each weight, actually, leading into that output node, according to the rule that you see at the top here. If an input node J is leading to that output node, you recall that we adjust the weight from J to the output node. By taking the original weight and adding a correction term, which is a product of four factors. One, the x sub J at the end here, which is whether the input node was 1 or 0, whether it was on or off. If it was 0, then we have no basis on which to change the weight. If it was 1, then we do. So it makes sense for that last term to be in there. We only want to change the weight when in fact this input node was contributing to the response of the output node. The middle two terms, the greater the error, and the error is what we wanted versus what we got, the greater the error, the more we want to change the weight. The third term is the derivative of the output with respect to a little change in input. Qualitatively what that means is that we want to change the weight when it's going to make a difference to the output layer. That is to say, the derivative is telling us, would a little change in input actually make a difference to the output? And the more that's true, the more we want to change the weight. And finally, that first term, alpha, is a tuning parameter. It can be a little larger or smaller, and it's used to make the changes in weights larger or smaller. So it's a kind of adjustment that we can use. Okay, we're going to update this learning rule so that it applies to multi-layer networks. And first I'm just going to rewrite the rule for the, now let's just start with the output layer of the multi-layer network, that third layer of the multi-layer network. The learning rule looks just the same as it does for perceptrons. But I'm just going to rewrite it because we're going to find this sort of new sort of labeling helpful when we deal with multi-layer networks. So first, instead of x sub J, we're just going to say that the input from the. If we're dealing with an output node, the input from the hidden layer node to the output node is going to be a sub J instead of x sub J. But same thing, if it's 0, we have no reason to change this weight. If it's 1, we might have a reason to change the weight. The alpha is the same tuning parameter that it was before. That middle term delta k, it's a helpful sort of abbreviation, in a way, for those two middle terms in our original perceptron rule. The way I like to think of delta k is, how much does this node wants to change its output? Or how much does this node want see a change in the weight that's coming into it? I know this is totally my sort of internal version, but I think of it almost as a kind of yearning. A, how wrong am I? And how much will a little change help? So it's the product of the error term and the derivative of the output with respect to the input term. We're going to call that delta sub k. Okay, so this rule that you see here is for the output layer of the multi-layer network, and it's just the same as we had for the output layer of the perceptrons. The main change now occurs for a hidden layer node. We're going to keep, the the basic idea is going to be still the same, but we want to compute delta for a hidden layer node. Delta is going to be, now, here's the the issue. We don't know what the error is in a hidden layer node. We know what the error is in the output node because we might have wanted a 1 and got a 0, in which case we got an error. We don't know what the error is in a hidden layer node, but we can compute delta as follows. We again take this term of the product of the one factor is the derivative of the output of this hidden layer node with respect to its input. In addition, we multiply that, I should, say by the weighted sum of the yearning of the need to change of all of our target neural elements. Since we're dealing with a three layer network and we're dealing with the hidden layer here, then, in this particular case, we would be dealing with the weighted sum of the deltas of all of the output layer nodes. One more time, we're going to compute the delta of a hidden layer node, the need to change. That will be the product of the derivative of the output, and the weighted sum of all the deltas of the output nodes weighted by the the weights on the edges going to them. So if we have a strong influence on one output node, and it has a high delta, that will contribute to the delta of this layer node. So let's just read this out in prose. Delta sub j is the product of the derivative of the output times the sum over all m of the weighted sum, where the weight is the weight on the arrow, the edge leading from this node to the output node m. It's the weighted sum of weight sub j to m, delta m. Just as before, we're using this continuous sigmoidal function for our output function. So the derivative of the output is just as it was before, when we were looking at perceptrons. It's a sub j times one minus a sub j. So there we have the equation at the very end. Okay, we're almost done now. Really what we've got is our formulas to compute delta sub j, or delta, for every element in the neural net, for all the output elements and all the hidden layer elements. With that in hand we can implement this backpropagation algorithm. And here's how it goes. We start at the output layer. For each neuron in the output layer we compute its delta. So we start at the right end of our feed forward neural net, the output layer, and we compute delta for each of those nodes separately. We have the we have the numbers to do that, okay? Using those deltas we then compute the deltas for each of the hidden layer nodes. And we have the formula to do that. So our first step, or if you like, our first two steps is to compute delta, this delta quantity, this yearning quantity, for each of the neural elements in the feed forward neural net. Once we've got all those deltas, we just implement the weight change rule for each of the edges in the neural net. All the ones leaving from the hidden layer to the output, and all the ones leaving from the input to the hidden layer. So a review, we compute delta for the, think of the deltas as being sort of the information is moving backward. We compute delta for the output layer, and then for the hidden layer. And then using those numbers, we recompute all the the weights. The reason this is called backpropagation is because of that feature of information moving backward. We use the errors or the the desire to change in the output layer to compute the desire to change in the hidden layer. I should say that this is used for a three layer neural net. In principle three layer neural nets turn out to be quite powerful. We'll discuss some other variants in the next lecture. However, three layer neural nets are quite powerful. This backpropagation algorithm is generalizable to multi-layer feed forward networks of more than three layers. You could use this same algorithm to compute deltas of an output layer, a hidden layer, another hidden layer, another hidden layer, and so forth, and then adjust a bunch of weights. So this backpropagation algorithm is in principle generalizable to multi-layer neural networks of more than three layers. This backpropagation algorithm is sort of the, in an artificial intelligence classroom or a machine learning class, this would be sort of the first major neural network algorithm that students would learn. It only represents a first sort of toe in the water of the issues surrounding neural networks and how they're used. This is not a course devoted to neural networks, so we can't go into much more detail about this. But the backpropagation algorithm is sort of the bedrock of working with neural nets. Let me at least mention a few issues about neural nets and training. What are they used for, anyway? In general, at least I think it's fair to say that for the most typical or archetypal uses of neural nets, they're used for pattern recognition. They're used to do things like voice recognition. Many systems that interpret vocal input, for example, if you're stating a phone number, many systems that would interpret your vocal input as a series of digits could be implemented by neural nets. Visual pattern recognition can be implemented by neural nets. Handwriting recognition can be implemented by neural nets. This is something of a first-order approximation, but facial recognition algorithms can be implemented by things like neural nets. So they're extremely good at taking in raw data and classifying that data in terms that this is the spoken number five. This is the face of a man, as opposed to a woman. This is the handwritten letter G, as opposed to the handwritten letter Q, or something like that. They're extremely good at that sort of pattern recognition. They have historically been seen as less appropriate to the kinds of search problems that we discussed earlier in the class, like solving a Rubik's Cube or something of that nature. There are debates about how useful neural nets can be for a whole variety of non-typical situations. But again, I think, to a first approximation, it's fair to say that they're most commonly used and and most profitably used for pattern recognition purposes. A few things about implementing neural nets, and some of these will come up in assigned readings as well. If you want to implement a neural net to, say, recognize vocal input, you want to implement a neural net that can be trained to recognize vocal input for the digits 0 through 9. How big should that neural network be? How many neural elements should be in it? Well, the input layer should be long enough to represent bitwise. It should be large enough to implement whatever, a second or so of vocal input. So that might be a fair number of input nodes. How big should the hidden layer be, and how big should the output layer be? Well, the output layer is trying to distinguish between ten possible outcomes. There are a number of ways of doing that. But one very straightforward way would be to have ten output nodes, where the intent is that the neural net will take as input the vocal input. And then output a 1 on one and only one of the output nodes, corresponding to 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. How many hidden layer elements? That's the tricky part. You might think, offhand, that a large number of hidden layer elements would be a good idea for pattern recognition on complex input. Interestingly, it turns out that that's not quite the case. You don't want too many hidden layer elements, because of a phenomenon called overfitting. And again, without going into too much detail here, the idea is that if there are too many hidden elements in the neural network, then it may end up being trained to recognize only exactly the inputs that it was given as positive inputs. In other words, it will recognize as the number 2 only the particular inputs for number 2 that it was given. In other instances, it would get confused. What you really want is a modest number of hidden layer elements. Because what you're trying to do is take this vast array of raw data, of possible inputs, different kinds of voices saying the digits 0 through 9. You're trying to take that and form general categories out of them. So what you want is a neural net that's large enough to represent the basic idea of ten possible digits, but not so large that it's too finely attuned to the particular data that it gets. I'm trying to think, what's a good way of thinking about this? I mean, think about, for example, how you might learn to understand the idea of a cat. As a child, you see a number of cats walking around. If your categorization, as a result of that, is to classify as cats only the particular animals that you saw that were cats, that's not a good classification. What you want is to see a number of cats, and use that as a basis from which to classify things that you've never seen before. The danger of overfitting is that a network can be too finely attuned to its data. So this is a common issue in designing neural nets. That alpha term, what should it be? Well, this is not my major area of programming, but from everything I've seen about neural net programming, I've done a little bit of it, the whole sort of choice of training parameters, and other things besides, is a little bit of a craft. It's kind of a black art. How big should the the training parameter be? Well, roughly speaking, what a lot of people do is they actually adjust the training parameter as learning goes on in the neural net. They begin with a training parameter, an alpha value that's a little bit on the larger side so that the weights are adjusting more quickly. And then once the the weights of the neural net settle into values that are close to the correct values or to the target values, then we tune down the alpha parameter a little bit. So the changes will be smaller. This is sometimes called simulated annealing, where the idea is that you start out with the neural net sort of bouncing around a lot. Think of it as a high temperature. It's a high temperature neural net. So it's bouncing around a lot between weights, but then you cool down the neural net as it starts to learn. There are lots and lots of other sort of elements of the art of neural net training. I mention one here. There are techniques where you can train smaller nets, and then combine them in various ways to shorten the training regimen for more complex or multipart input. Again, this is just the sort of very, very first early steps in looking at neural networks as computational models of brain function. But at least this is a beginning. And knowing backpropagation, if this is a subject of interest to you, you're in a much better position to go much further in the area of neural net design and training.