In today's discussion, what we are going to do is start talking about the ways in which we can model collections of neurons similar to in an abstract sense, similar to the neurons in the human brain or animal brains in computational terms. So we're going to talk about relatively abstract computational models of collections of neurons. Now this is, first of all, it's a huge subject. It's a subject that really sort of you can trace the pre-history of it to papers back in the late 1940s, early 1950s in computer science. A certain level of activity in modeling neurons in computation took place in the 1960s. Then, the way most people tell this history is that interest in these kinds of computational models really began or took off in the 1980s and is not really abated ever since. The models have gotten more varied and more complex. What I'm going to do today, actually, I'm going to divide this discussion into two parts. In one part, this first part, I'm going to describe the simplest kind of neural networks to you. Then, in the second part, we'll move on to slightly more complex neural networks. Even taking the two parts together, what I'm really doing here is giving you the very basic sort of introduction to neural networks and the theories according to which they are built and interpreted. This also represents thinking back on how we got to this place. You'll remember that we were talking about one of the early tenets or one of the early sort of philosophical ideas of cognitive science, being that software is to hardware as mind is to brain. That is to say a kind of a stark version of what we call the computational metaphor of mind. An implication of that stark version was that you don't have to understand the brains in order to understand the mind, much as you don't have to understand computer architecture in order to understand the idea of a program like Quick Sort or something like that. That is that version of the computational metaphor of mind. It doesn't seem implausible to me. However, history has not been especially kind to it. That is to say, in dealing with subjects like the ones we were talking about before like mental imagery, it seems to be the case that a greater understanding is obtained by understanding the structure of the brain. So computational modeling of the brain in this view takes on a greater and greater importance. It's a way of understanding by modeling, what's going on in the brain as it acts as an information processor. So let's begin. We're going to talk about how we can computationally get some purchase on how collections of neurons behave. In order to do that, we have to begin with a model of a neuron. So you're going to hear this phrase by the way a great deal probably in this and the next discussion which is for our purposes. I'm going to be using the phrase for our purposes because what we're going to try and do is keep things as simple and abstract as possible. There's still plenty of content here, but we're going to try and avoid complication as much as possible. So here is a neuron for our purposes. This is the kind of textbook diagram of a neuron. A neuron is a cell. The brain is composed of something close to a hundred billion of these, maybe not quite that many. But in any event, a neuron is a cell just like any other cell in the body and like any other cell, it has in fact, has a fairly complex structure, has a cell body, and which includes a nucleus. In this diagram of the neuron, you see there is a cell body that's kind of the circular area towards the left of the diagram and there's the nucleus at the center. The neurons come in many different shapes and sizes, but this is a kind of standard vanilla neuron. For this kind of neuron which is the one we're going to treat as a typical neuron, there are these protrusions along the membrane of the neuron. The protrusions come in two types. There these kind of feathery protrusions called dendrites. Those are shown at the left of the neuron here. Then, there's this kind of long filament called the axon. That too ends in a bunch of little sort of finer filaments at the very end of the axon, which is shown in the right in this diagram, the foot of the axon. Now, here's the basic idea of how the neuron works. It is an information processor and in this case, we can sort of read the neuron as processing information from left to right. Information comes into the neuron over the dendrites in the form of electrochemical stimulation. So you can think of the dendrites as receiving little chemical signals which either say, "You should excite the neuron or you should inhibit excitement in the neuron." In other words, dendrites represent what are called excitatory or inhibitory connections. If there is enough excitatory connection, that is if there's enough stimulation to the neuron, then it in turn fires, it sends what's called an action potential down the axon. It releases chemicals at the end of that axon, which in turn can excite or inhibit subsequent neurons. In practice, in the brain. Neuron may be connected as I've read to hundreds or maybe even 1,000 other neurons. So in other words, you can think of this neuron as receiving input from hundreds, maybe a 1,000 other neurons. It collects all of that information, decides whether it is sufficiently stimulated to send an action potential. If it is sufficiently stimulated, then it sends an action potential and in turn inhibits or excites future neurons down the line here. Okay. So with that in mind, and again this is most abstract portrait. What we're going to do is build up networks of the graphs of little computational elements that represent the information processing of neurons. I'm going to skip ahead to the next slide here. I'm going to move back and forth between this slide and the next one so you can get an idea of what I'm talking about. This first bullet, we're going to create graphs that represent collections of neurons. Each node in the graph we'll call it a neural element, and it'll be a computational element that's based on the structure of the neuron that I just showed you. A neural net is a collection of these neural elements. Again, a collection being a graph where the neurons are connected by weighted links. So let me show you what this looks like in the next slide. This is a very simple, it's what's called a feed-forward neural net. It is arranged, this is a particular form of neural net, but it's a very common one. It's one we're going to be talking about in the second part of this lecture. But in this case, the feed-forward neural net is composed of, you can see them, three vertical layers of neural elements, the neural elements of the circles in gray. Okay. So this is a three-layer neural net. As you can see in this diagram which has the typical labeling, you think of this as an information processor where the information flows in at the left is processed by the overall neural network and then output at the right. So you think of inputs is coming in, input signals is coming into this neural net and output signals is coming out of it at the right You'll also notice some structure to this neural net. The three each layer is connected only to the layer ahead of it. So the input layer has connections only to the hidden layer and the hidden layer has connections only to the output layer. That's a standard feature of these. It's not a universal feature of neural nets by any means, but it's a standard feature of these kinds of simple multi-layer neural nets that we're going to move toward. Now, the arrows in this diagram are weighted links. So think of those arrows, the input lines, you can think of them as just zeros or ones. So the input to this neural network is a collection of bits, zeros and ones. Think of it as a bit pattern that's coming in. But the internal arrows in this neural net, the ones between layers 1 and 2, and layers 2 and 3, those are weighted links. Meaning that you can think of each of those arrows as having a particular strength represented by a weight. The weight can be any real number. In particular, the weight can be a positive or a negative number. In practice, those numbers don't get outlandish in either direction. But for example, if the weight on one of these arrows is 1.5, then that is a positive or excitatory connection between the neuron at the left and the neuron at the right. If the weight on the arrow is negative two, then that's a inhibitory connection between the neuron at the left and the neuron at the right. So again, think of each of these arrows as being a weighted link representing a strength of connection between the neuron at the beginning of the arrow and the neuron at the end. Let's go back to the previous slide. So with that picture in mind, now you can look at this slide again, and let's review. Each neural element, each gray circle is loosely based on the structure that you just saw in a neuron. We'll get into what that actually means. A neural net is a graph like the one you just saw, a collection of neural elements connected by weighted links. Again for our purposes, for the kinds of neural nets we're looking at, we think of some neurons as input elements. Those are ones and zeros pattern of bits at the beginning of the neural net. Those are linked via potentially multiple layers of neurons. Those are linked to output elements. You can view the entire neural network, the entire collection of neural elements as a pattern classifier. Think of the input bits as representing a raw data. What we're trying to do within the neural network is classify that raw data into perhaps one of several different categories. That's the easiest way to think about it. So the output signals represent a category identifier. Does this pattern meet this criterion for this category? Let's go back to the next slide again. That overall collection of eight neural elements is a pattern classifier. It's something that says, "You give me an input pattern of 3-bits, eight possibilities and I will classify them into perhaps one of four possible sets represented by the two output lines". Okay. Now let's go back a bit. In this particular discussion, we're going to talk about the very simplest kind of neural networks which are called perceptrons. Perceptrons only have two layers; an input layer and output layer. They don't have a hidden layer like the one you saw in that next slide. Now, perceptrons are not very often used in practice these days but they're a good pedagogical device to introduce students to the overall idea of neural networks, and I'm going to follow that convention. So we'll first describe perceptrons which are the simplest kinds of neural nets, and then we'll move from there into multilayer nets. I'm hoping this is so far so good. What we haven't yet really described, I've said so, this is a neural net. Each of those gray elements is a little information processor modeled on the information processing structure of the neuron itself. Those little neural elements are connected, in this case, in a very structured way from one neuron to another. When one neuron is linked to another, it's linked by a certain strength or weight which can be positive or negative. I hope so far so good. If anything is unclear so far, this would be a good opportunity to pause and go over the lecture again to this point. Now, what about these little gray elements, these neural elements? Think of each neuron as itself doing a simple computation. It receives input values. Those input values can be positive or negative. It sums them all up and it compares them to a threshold. So often, in the case of these abstract neural elements that we're concerned with, we've identified by convention the threshold is a zero. So unless otherwise stated, we can think of the threshold of our artificial neural elements as zero. But, in general, first we take all of the input elements, we sum them up, again, some may be positive and some may be negative, we compare the overall result to a threshold by convention often zero. If the sum of the inputs is greater than the threshold, then this neuron fires. It sends an action potential down the axon. In the case of the computational neural element, it sends out a one, outputs a one. If it doesn't fire, it outputs a zero. The output then is the product of a zero or one from this neuron, multiplied by the weight on the link going to the next neuron. So let me show you now. This is a perceptron with one neural element in the output layer. This gives us at least a little bit of purchase on how the neural networks are structured overall. So think of that central circle with the big summation sign in it, that's a neural element. That's the gray circle. It's receiving a bunch of inputs. Now, the input layer is again a bunch of zeros or ones coming in at the left here. The zeros or ones coming in at the left are then multiplied by weights which, again, are positive or negative values. The output layer, that's the output layer of the perceptron, that gray circle or the summation circle, that's the output layer. It's getting in a bunch of inputs. The inputs one through n are zeros and ones. Each of those zeros and ones is multiplied pairwise by the weight on the link coming into this summer neuron. We take the grand sum of the weighted inputs, compare them to a threshold. If the overall sum of the inputs is greater than the threshold, then we ourselves output a one or a zero. Now, you notice that little box at the end. It's the central line. Again, think of that little vertical line as being the threshold value which we can often, by convention, identify as zero. So colloquially, what's happening is that, central sum output neuron is collecting all the inputs, the weighted inputs, comparing them to zero seeing if the result is positive or negative. If the result is positive, it outputs a one. If the result is negative, it outputs a zero. The graph that's drawn here is actually a smooth continuous graph and we'll see the utility of that. But you see that the form of that graph is intended to give you a switch that if the sum coming into this neuron is a little bit greater than zero, then the output will be very close to one. If it's a little bit less than zero, then the output will be zero. So if the overall input happens to be zero, then we don't really know what to do. In this particular graph, the way it looks is that the output of the neuron is 0.5. But let's not worry about that. What you are looking at, therefore, is the overall perceptron, simplest kind of neural net, with inputs, a pattern of bits. Those pattern of bits are weighted. They are then summed. They are compared to a threshold, which we'll say for now is zero, and depending on whether the result is positive or negative, this perceptron outputs a one or a zero. That's about as simple a neural network as we can get. Notice that we can create, from that simple description, we can create very small neural networks that behave, in this case, like logical elements that you see in digital logic textbooks. At the bottom here, at the left, you see what's called an AND perceptron. Those ones that are listed here, those are the weights on these three links. Remember that the three input lines here, these blue elements, can either be zero or one. So think about one special feature is that red dot at the bottom, think of it as always one. It's always one. So that input is always set to one and it has a weight of negative 2.5. Now, what's going on with this perceptron. Think about what happens here. If all three of these input blue circles are one, then we'll have 1 times 1, 1 times 1, 1 times 1. Those are all summed up here to get three. We subtract 2.5 and we get an overall value of 0.5 and that output blue neural elements fires, it sends a one. If any of the input blue circles here is zero as opposed to one, then see what happens. Suppose that this top circle is zero, then we have zero, one, one. That adds up to 2. We subtract 2.5. We get negative 0.5 and this perceptron does not fire. So this is the perceptron that fires if and only if all three of its inputs are one. If any one or more of them is zero, it doesn't fire. That's acting like an AND gate in digital logic terms. It only fires if all of its inputs are on. The OR perceptron is even simpler. That's at the right here. It fires. Notice that there's no special red circle here. It has a threshold of zero and it will fire if any of the input happens to be one. It could be one or two or all three of the inputs, but if any of them happens to be one, it'll fire. This gives you a flavor of how you could take some of these neural elements and treat them as things like digital logic elements and build up larger machines out of them. We'll leave that to the side, but people have done that kind of work. But, again, for our purposes, this is what we're concerned about with perceptrons. Now, you just saw an AND perceptron and an OR perceptron. These are set perceptrons like you'll find in a textbook. What we're after now is of a very deep idea which is, suppose we want to create a perceptron which we'll recognize a certain kind of pattern and output a one for that certain kind of pattern and will output a zero if it doesn't meet that pattern, but we don't know what the weights on the links should be. So let me show you an example of what I'm talking about. Suppose, you could eyeball this. In fact, you could figure this out but you don't have to. Let's imagine that we have a perceptron like this. So here's three inputs. They could be zero or one, and we have an output. We'll think of the threshold is being zero. Okay? What we want is to recognize only the pattern one, one, zero, okay? So what we want is that this perceptron, this output neuron, should output a one, if and only if, these two neurons are one and this one is zero. Otherwise, it outputs a zero. Now, we could figure out the weights that we would need for this purpose but let's assume we don't know them. Here's the really interesting thing about perceptrons and about neural nets in general. This is why they're such an interesting model. What we're going to do is start by assigning random weights, just randomly chosen weights to these three arrows. Then we're going to see whether this overall perceptron recognizes the pattern that we want. Very likely it doesn't, because we just chose the weights at random. But we're going to give this. We're going to keep testing this perceptron by sending in patterns of bits, comparing the output to what we want, and then readjusting the weights as we go, to get closer and closer to the result that we want. You could think of it as a reverse engineering. We think of this overall thing as a machine, and we want it to take certain inputs and produce certain outputs as a result. We don't know what the internals of this machine should exactly be. We don't know what these weights should be. But what we're going to do is start by just putting random weights in here, sending in inputs to the perceptron and doing a behavior as training. Every time the perceptron gets a wrong answer, we adjust the weights internally in a way that you'll see in a moment. Every time the perceptron gets a right answer, we leave it alone. As it turns out, by using the learning algorithm that I'm about to show you, this perceptron can learn to recognize certain kinds of patterns of inputs including the one I just mentioned, by the way. But let's go into more detail about how we do this, okay? So we're going to assign random weights to the edges in this perceptron. Then we're going to feed this perceptron a particular input. Let's say we feed it, zero, one, one. Now, remember that what we want is for this perceptron to only recognize, to only fire on the one, one, zero. Instead, we feed this perceptron zero, one, one and it outputs a one. That's wrong. What we're going to then do, is go back and fiddle with these weights, to move them in a direction that appears to be promising. That would make the overall perceptron behave better. So we're going to take this, it's called training a perceptron. So we're going to train this perceptron by adjusting its weights. We're going to train it by feeding it lots and lots of input, and correcting it, adjusting the weights as a correction every time the perceptron gets a wrong answer. We'll leave it alone, every time it gets a right answer If we're fortunate, if the perceptron can be trained to learn the function that we want, then the weights will adjust over time to values that in fact, make the overall machine behave like the machine that we want to create. How do we do this? This seems like magic. We didn't know what weights to put in this thing, to begin with. We put in random weights. We simply tell the perceptron when it's right or wrong, and after a certain number of training examples, eventually, the perceptron seems to be behaving correctly, it's miraculous. Here's what's going on. When the perceptron outputs a one, when this guy outputs a one, or a zero, we're going to see if it's an error. The error is the difference between what we wanted and what we got. So suppose we wanted this to output a zero, and instead, we got a one, then the error is one minus zero or zero. In some situations in training perceptrons, we use the square error to always make it positive. But again, for our purposes right now, we'll think of the error as having assigned value. So if we wanted a one and we got a zero, the error is one. If we wanted a zero, and we got a one, then the error is negative one. Okay? If our perceptron is in error, we adjust each of these weights leading to this output node. We'll adjust each weight in such a way as to make the error value smaller. Now, how do we do that? What I'm about to show you is, it looks like a complex formula at first but it's actually quite understandable. It tells you if you got a one and wanted a zero. So the error is one. Yeah. If you wanted a zero, and the difference between what we wanted and what we got. So we wanted a one we got a zero, so the error is positive with the difference between what we wanted and what we got is one. What we're going to do is take that one, we wanted more than we got. We're going to use that to adjust some of these weights in a direction that would have moved this output to a one. Here's the formula for doing that. We're going to call this weight sub one, weight sub two, weights sub three, and in this formula, I've just labeled this as weight sub j. To adjust weight sub j, from its previous to its new value. Here's the formula. The right side of this formula is saying, take the original weight sub j and change it by adding this product. That'll be the new assignment to weight sub j. So that product is four terms. Let's look at the second term. It's the error, the difference between what we wanted and what we got. So again, if we wanted a one and got a zero, then error is one, it's a positive number. If we wanted a zero and got a one, then the error is negative one. It's a negative number. The error is, therefore, that's the difference between what we wanted and what we got. What about that last term, it's X sub j and that's the output from this node to this one. Notice it's not the weighted output. It's either a zero or a one, depending on what the pattern input was. Okay? So in other words, X sub j, that's the last term here, is zero or one depending on whether input sub j is zero or one. The first term, the alpha is a rate parameter. Think of it as a tuning device. If alpha is large, then we're going to be making large jumps in the weights of every time we change them. If alpha is small, we'll be making tiny changes in the weights every time we change them. So I've explained three terms of that right side of the formula. We're taking the original weight, and we're changing it by a product of four terms. A rate parameter which we can tune, how off we were, and in what direction. That last term is whether the input was one or zero. What about that third term? It's the derivative of the output here with respect to the input. It's how fast the output here will change, as a small change in input occurs. There that explains each of the terms here, but why are they there? Think of it this way. There's a purpose to each of these terms. That the Alpha is your little knob that says, I want to change weights by a lot or a little when I'm making adjustments. The error term is, think of it colloquially, I mean, what is that error term, and therefore, it's saying, well, if we were off by a lot, we probably want to change this weight upward. If we wanted a one and got a zero, then we want to change the weight upwards so that it'll be more important if the input is a one. So the error term is saying, the more off we are, the more in error we are, the more we want to change. The last term, the x_j is saying, if this input was as zero, then whatever the weight here is couldn't matter to our output. If the input here was a zero, then we don't know how to adjust this weight because the weight here didn't contribute to our error. We multiply zero by something here, it doesn't matter if it was negative 0.5 or positive three or whatever, we don't know how to change this weight, because this weight did not contribute to the wrong answer. So again, the Alpha term is a little knob, the error term makes sense because it's saying, the more of we are, the more in error we are, the more we want to change. The final term makes sense because it's saying, look, we should only change when the weight on this arrow might have made a difference to our decision. It only might've made a different sort of decision, if the x_j term here was one, not zero. Finally, the derivative term is saying, we want to change most when that change is going to make a difference. The derivative term is saying, how much would our output have changed if the input changed by a little bit. If the input changed by a little bit, would our output actually change by much or a little? If the output would change by a lot, then let's use this opportunity to adjust the weight. So let's read this formula again one more time, this time in prose. Take the original weight, W_j, and multiply these four terms and use that to adjust W_j, a tuning parameter. How much in error we were, how much good changing the weight would do if we were to make that change. In other words, if the change in weight were to occur, would our output have changed in response to it? Finally, only change the weight if the input coming into us is making that weight meaningful. So one more time, take the original weight, and take a tuning parameter, how much in error we were, how much good a little change in the overall input would have done, and how much this weight actually mattered to us. Use that four term product, add that to the original weight, and that'll change the weight of the perceptron. Now the weird thing is that if you then trained this perceptron by using many, many inputs, and adjusting weights every time the perceptron gets something wrong, then let me put it this way, if the function that you want is realizable in a perceptron and many functions aren't, but if the function that you want is realizable in a perceptron, then this procedure will lead you to the correct weights of the perceptron. I won't show the proof of that, but that happens to be true. So this learning algorithm for a perceptron is effective. It doesn't say anything about how long it's going to take, how many examples it's going to take to train the edge weights in the perceptron. But in practice, it proves to be a very useful training algorithm. So these perceptrons we can now think of as little trainable learning devices as well as pattern classifiers. We set up a perceptron, we say, we want this perceptron to recognize a particular pattern in the input. What we're going to do is set up the internal edges, the weights at random, and then give this perceptron lots and lots of yes, no examples. Every time the perceptron gets a wrong answer, we bang its weights a little bit. It's like punishing the perceptron, smack, and then the weights change and adjust a little bit. Every time the perceptron gets it right, we leave it alone. We just pat it, good perceptron. After, who knows? Maybe 100s, maybe 1000s of training examples, the weights do hone in on values that will give us the function that we wanted. One last point before we leave this topic of perceptrons just now. I showed this term, the derivative of the output with respect to the input. How do we get that? Here's the standard way of doing that. Remember that in that earlier graph I showed you that we want to see a continuous function to represent the mapping from input to output. So if the input sum is a little bit less than zero, then we want the perceptron to output zero. If the input some is a little bit greater than zero, if it's positive, then we want the output to be one. A very good for our purposes, a very good choice of function is what's called the sigmoidal function. Take this, the summed input, and the output is going to be 1 over 1 plus e to the negative input. Think about what that means qualitatively. Suppose the input is large and positive, then that bottom term is 1 plus e to the negative 10th or something. In other words, the denominator of this is very close to one, and that the output here is 1 over 1 which is 1. Suppose the overall input is negative 10, then this denominator is 1 plus e to the negative 10th, or 1 plus e to the 10th. That's a big number. 1 over 1 plus e to the 10th is very close to zero. So if the input is large and negative, this denominator is large, and this function is giving you a number very close to zero. If the input is large and positive, this denominator is very close to one, and so the output value is very close to one. So the sigmoidal function behaves like the kind of function we want. It goes rapidly from zero to one depending on whether the input is negative or positive. It has another really neat feature, which I will leave you to prove via calculus, if you remember your first year calculus. It turns out that the derivative of this function is simply can be represented by the value of the function, output times 1 minus output. So you don't even have to do fancy calculations to get the derivative of this function. The sigmoidal function is structured so that the derivative is output times 1 minus the output, very easy to find. So going back one last time, the weight adjustment algorithm proves to be a rather simple one. Every time the perceptron gets something wrong, we go back to these internal weights, and we adjust them. We take the original weight, and adjust it by a factor which is a product of a rate parameter, how wrong we were, how much a little change in input will help us, and the how much this weight might have contributed to our error, which is reflected by whether this input was a one or a zero. We multiply those those terms together, and use the product to adjust our weight. It turns out that by repeatedly training the perceptron, we hone in on an overall machine that behaves the way we want.