[MUSIC]. Hello neuro explorers. Last week, we learned how neurons can be connected to form feed forward and recurrent networks. This week, we learn how these connections can be adapted using synaptic elasticity, allowing the brain to learn about the world from its inputs. Now, what better way to start this journey than to gaze upon this beautiful drawing of the hippocampus by the great Romon y Cayal from a 100 years ago. it was in the hippocampus that some of the first results on synaptic plasticity were obtained. One type of synaptic plasticity that's observed in the brain is long term potentiation or LTP. An LTPs defined as an experimentally observed increase in the synaptic strength from some neuron A to another neuron B that can last for several hours or even days. And the way you would induce LTP is by causing some neuron A to fire a burst of spikes and if the neuron A is connected to some other neuron B where B is also excited, it's depolarized or it's firing some spikes. Then what you would see is an increase in the size of the excitatory postsynaptic potential. So that means that for the same input initially you might have a small EPSP. But after you pair neuron A with neuron B several times then what you'd observe is an increase in the size of the EPSP which indicates that the strength of the connection from neuron A to neuron B has been increased. The counterpart to Long-Term Potentiation or LTP is Long-Term Depression or LTD. And LTD corresponds to an experimentally observed decrease in the synaptic strand that lasts for hours or days and you can obtain LTD when you have the following situation. So if the neuron A firing some spikes but neuron B does not fire any spikes or it does not depolarize. So in this situation, when you have some input from neuron A but no output coming from neuron B then what you observe is a decrease in the EPSB size. So if the initial EPSB for a single input was as shown at the very top here. And as the pairing occurs where you have some input from A but no output from B, you would expect to see a decrease in the size of EPSP. And what that implies is that the connection strength from neuron A to neuron B has been decreased. Now here's something that's interesting. Even before LTP and LTD were discovered in the brain, a Canadian psychologist named Donald Hebb predicted that something like LTB should occur in the brain. He suggested a learning rule for how neurons in the brain should adapt the connections among themselves and this learning rule has been called Hebb's Learning Rule or Hebbian Learning Rule and here's what it says. If a neuron A repeatedly takes part in firing another neuron B, then the synapse from A to B should be strengthened. And here is a cartoon of what this learning rule implies, if we have a neuron A that is firing and that intern is participating in the firing of another neurons. So the neuron B produces for example one or a few spikes. Now if this situation occurs, Hebb's Learning Rule predicts that one ought to increase the strength of the connection from neuron A to neuron B, because neuron A is participating in the firing of neuron B. And so what we then get is for the same input from neuron A now and the input from other neurons you have an increase in the activity. You have more spikes from neuron B. And so another way of phrasing Hebbs learning rule is through that famous mantra that you already heard during the first week of lectures and that is that neurons that fire together wire together. Now mantras are great for chanting but they're hard to implement on a computer. Now let's see if we can formalize Hebb's rule as a mathematical model. So, let's start with a linear feet forward neuron. So here's the neuron with an outward v and it's receiving some inputs we're calling the input vector u and the synaptic weights from the inputs to the output neuron are given by a synaptic weight vector w. So this is very similar to the feed forward networks that we considered in the previous set of lectures last week. Now, if we assume that the dynamics of this network of the firing fate is fast, then we can look at the steady state output. And that's given by this equation. So the output firing rate of the neuron is nothing but just the dot product of the weights, the synaptic weights with the inputs and you can write it as a dot product. Or you can write it as w transpose u or you can write it as u transpose w. Now, here's how you can write Hebb's rule mathematically. You can use a differential equation to capture how the rates from the input neurons to the output neuron change as a function of time. So there is some time constant tau sub w that governs how fast the weights are changing. And we set tau sub w dw dt to be equal to the product of the input firing rates and the output firing rate. So how does this capture the intusion behind Hebb's rule? But remember that in Hebb's rule the increase the strength of the connection from an input neuron A to an output neuron B, if there is both activity from neuron A as well as activity from neuron B. And this product of the input firing rates with the output firing rate captures that intuition. Now in order to implement this differential equation on a computer, you need to discretize it. And so if you look at the discrete implementation of this differential equation, then this leads you to a eight update rule. And the weight update rule is shown here. So this is how you update the weights given inputs. And so the weight update rule tells you that the weights at time step i plus 1 is given by the weights at time step i plus some epsilon some positive constant. So this is called the learning rate. And that is multiplied by u times v. Or another way of expressing this equation is to say that the change in the weight. So delta w is equal to the learning rate epsilon times uv. In order to understand the Hebb rule, it is useful to look at the average effect of this rule on the synaptic weights w. So here is the Hebb rule from the previous slide and if you want to look at the average effect of this rule, then we can take the average of the right-hand side with respect to all the inputs u. So these brackets over here denote the average. And if we now substitute the value for v from the previous slide again. Then what we find is that the Hebb rule modifies the weight w according to the input correlation matrix, where the correlation matrix, as you might know, is given by simply the average of uu transpose. So what does this mean? What does it mean to change the weight w according to the input correlation matrix? Well think about that for a minute. We will answer this question towards the end of the lecture. Well the Hebb rule that we've been discussing so far only increases synaptic weights and this models a phenomenon of LTP or long term potentiation in the brain. But as we discussed earlier the brain also exhibits LTD or long term depression, which involves decreasing the strength of the connection from one neuron to another. Now can we model both LTP and LTD using a single learning rule? In other words can we derive a learning rule that can both increase or decrease the strength of a synaptic connection? One rule that incorporates both LTP and LTD is the covariance rule and we'll come to why it's called that in just a minute. Here is the differential equation for the covariance rule and you'll notice that it is again a product of the input firing rate with the output firing rate. Except that now the output firing rate as a difference term that includes the difference between the output firing rate and the average of the output firing rate so what is the effect this difference term? Well consider the case when the output firing rate is bigger than the average output firing rate. So in this case you're going to have a positive quantity here, which means that when you multiply the input firing rates with a positive quantity you're going to have an increase in the synaptic strength. And that is going to result in LTP. On other other hand, if the output firing rate is low, so for example. It is less than the average output firing rate or even the case where there is no output so v could be 0. In that case what you're going to get is a negative quantity here and so when you multiply the input firing rates with a negative quantity you're going to get a decrease in the synaptic weight. And so that results in LTD. So what does the Covariance Rule do? Well, just as we did with the Hebb rule we can look at the average effect of this rule. And that means taking the average of the right hand side of the rule with respect to all the inputs u. And if you substitute the value for v and you simplify these expressions then what you get is the fact that the. Covariance rule is changing the weight vector w according to surprise, surprise the input covariance matrix. So here's the input covariance matrix. It's simply u u transposed, the average of that minus the average of u with the average of u transposed. At this point I would like you to think about what it means for w to be changed according to the input covariance matrix. What do you think w would converge to when it's modified according to this equation? We will answer that question towards the end of the lecture. Now let's ask the question are these learning rules stable? In other words does w converge to a stable value or does it explode? Now how do we answer this question? Well one could look at the length of w as a function of time and see if the length of w remains bounded or if the length of w grows without any bounds. Let's first look at the Hebb rule. So here is the Hebb rule and let's look at how the length of w squared changes as a function of time. So let's take the derivative of the length of w squared with respect to time and when we do that, we get this expression here. And if we substitute the value for dw dt according to the Hebb rule We have this expression. And note that w transpose u here is nothing but the output firing rate v. And if we substitute that value here, we get this expression. Now unless v is always equal to 0, this expression is going to be positive. And so what we then have is the fact that the derivative of the length of w squared with respect to time, is always positive. What does that mean? It means that the length of w is going to keep increasing, which means that w grows without bound. Well, you might be thinking that's not too surprising, because the Hebb rule only increases synaptic waves. It only models LTP and so perhaps that's why w grows without bound. Well if that's the case then what about the covariance rule? So as we discussed, the covariance rule incorporates both LTP and LTD. And therefore it can both increase synaptic ways as well as decrease synaptic ways and perhaps that makes the covariance rule stable. What do you think? Do you think it's stable? Well, here's the answer and I'm sorry to say that it's not good news. If you take the derivative of the length of w squared with respect to time as before and we simplify the resulting expression then, if you further take the average of the right hand side of that expression what you find is that the. Derivative of the length of w squared with respect to time is always positive. And what that means is that the length of w when changed according to the covariance rule grows without any bound, which means that w grows without any bound. So, how do we stabilize the Hebb rule and the covariance rule? Well one in which you can do that is by forcing a constraint. On the synaptic weight vector w. So what kind of a constraint can we impose? Well, you could impose the constraint that the length of w should always be equal to 1 and how do we do that? Well, each time that you update the weight vector according to a new input, we simply divide the resulting weight vector with the length of that weight vector. And this ensures that the length of the weight vector always equals 1. Now this seems like a hack and perhaps it's not even biologically plausible. So is there a more elegant way of imposing a constraint on the length of the weight vector. Now let's look at the last of our Hebbian learning rules and this one's called Oja's rule named after its discoverer. And Oja's rule is similar to the Hebb rule in that we again multiply the input firing rates with the output firing rate except that now we subtract a term alpha v squared w from u times v and alpha is some positive value. Now the question is, is Oja's rule stable? What do you think? Well, let's do what we did before, which is take the derivative of the length of w squared with respect to time. So when we do that, we get this differential equation for the length of w squared. So, looking at this differential equation, do you think that the length of w squared converges to a particular value or do you think that the length of w squared grows without bound. Well, here's the answer. So length of w squared in fact does converge to a particular value and it converges to the value 1 or alpha. And you can see that by setting the derivative equal to zeros in that case unless v is equal to 0 we have the fact that the length of w squared is equal to 1 over alpha because this term over here has to be equal to 0. And if that's the case then the length of w itself must be equal to 1 over square root of alpha. So what this tells us is that w for Oja's Rule does not grow without bound, which means that the rule is stable. Okay, let's summarize what we've learned so far about Hebbian learning. The basic Hebb rule involves multiplying the input firing rates with the output firing rate and this models the phenomenon of LTP in the brain. We found out that this learning rule is unstable unless we impose a constraint on the length of w after each weight update. The covariance rule involves multiplying u with v minus the average value of v, which means that we can now model both LTP and LTD. But we found out that that's not sufficient to make the learning rules stable. So this learning rule is also unstable unless we impose a constraint on the length of w. And finally we considered Oja's rule and we found out that Oja's rules in fact stable and the length of the weight vector converges to the value 1 over square root of alpha. Okay, we've arrived at the finale of the lecture we going to answer the question what does Hebbian Learning do anyway. We going to start with the averaged Hebb rule so as you recall the averaged Hebb rule is given by this differential equation where Q is the input correlation matrix. And what we would like to do is solve this differential equation defined wt. So what is w as the function of time when its being changed according to this differential equation. So how do we solve this equation? Any ideas? Well, if you guessed eigenvetors, you would be right. We can always rely on our dear friends, the eigenvectors. So, as before, let's write our vector wt in terms of the eigenvectors of the correlation matrix. Now recall that the input correlation matrix is going to be a real and symetric matrix which means that the eigenvectors are going to be orthonormal, which means that we can write any vector including the vector wt. As a linear combination of the eigenvectors. Now if we substitute our expression for wt in the differential equation for the average Hebb rule, then we can simplify as before and we can get this differential equation for the coefficients. And when we solve the differential equation for the coefficient, let's say ci, then we have this solution. And when we substitute this solution into our expression for wt, then we get this solution for the weight vector as a function of time. So, what is this equation telling us about the synaptic weight vector w as a function of time? It's telling us that the synaptic weight vector w is a linear combination of the eigenvectors of the input correlation matrix. And furthermore, it's telling us that the coefficients for these eigenvectors have terms that are exponentially dependent on the eigenvalues of the correlation matrix. So what do you think will happen to w as time goes on? So when t becomes very large, what do you think will happen to w? When t becomes large, the largest eigenvalue terms so that one that has the largest eigenvalue. Lets say it's the eigen value lamba 1 is the largest eigen value then that term dominates this linearly combition so what we get. Then is the result that the rate vector turns out to be proportional to the first eigenvector or the principal eigenvector of the input correlation matrix. And furthermore, if we're using Oja's rule as you know, the length of the weight vector then converges to 1 square root of alpha. So in that case, the weight vector approaches the value e1 divided by square root of alpha. We've actually shown something very exciting. We've shown that the brain can actually do statistics and that's in addition to what we showed last week which was that the brain can do calculus. There seems to be no stopping the brain. Well, let's look at why we think the brain does statistics. So it turns out the Hebbian learning rule that we just analyzed implements the same thing as the statistical technique of principal component analysis or PCA. So to understand what principal component analysis is all about let's look at a simple example. So here is some two dimensional data. We have these points which represent the values you want and u2, which comprise the input vector u. And if we start the Hebb rule with an initial weight vector that's given by this dashed line, then the Hebb rule rotates this initial weight vector to align itself with the direction of maximum variance. So here is the. Cloud of data and the final weight vector is going to be parallel to this line which is the direction of maximum variance. Now when we apply the Hebb rule to some data that has been shifted so the data from here can be shifted to a different location. Let's say with input mean. Two and two. So, in that case we find that the Hebb rule does not do what we want it to do. which is, it finds this direction as the direction of maximum variance going through the origin of this two dimensional plot. And that is really not the direction of maximum variance, the direction of maximum variance is given again by. This direction but luckily when we apply the equal variance rule we find that it does indeed find the direction of maximum variance. So it's taken care of the fact that the input mean is no longer 00 but it's 2 and 2 and that is accounted for by the equal variance rule. So the equal variance based Hebb rule is able to find again the direction of maximum variance. So in summary what we have shown is that Hebbian learning learns a weight vector that is aligned with the principal eigenvector of the input correlation or the input covariance matrix. In other words, it finds the direction of maximum variance in the input data. And that is precisely what principal component analysis does but now why is that interesting? Well, principal component analysis is a very important technique used in a variety of fields for tasks such as damage [INAUDIBLE] reduction. So for example here what we've done is we've shown that this two-dimensional data can be compressed to just one dimension by projecting each of these two dimensional points onto their corresponding locations along this particular line. And so we now have a compression from 2 d to the 1 d location along this particular line and that's an example of dimensionality reduction or compression. And you can imagine that when we have a very large input dimension, such as the number of pixels in an image. Then this type of technique where we find the directions of maximum variance in natural images or natural movies is indeed going to be extremely useful. Because you can compress a very high dimensional space such as the space of the input image or the space of the input video to may be a very small number of principle eigenvectors the dominant eigenvectors of the input covariance matrix. Well that's great but what if we give a neuron this data, what do you think the weight vector for the neuron will converge to if we apply the covariance learning rule? As you might have guessed the covariance rule ends up finding the weight vector that is aligned with the direction of maximum variance in this data set. Now unfortunately as many of you will agree this data set seems to consist of two clusters of data points. So here's one cluster and here's the other. And so it appears that this particular data set is not correctly modeled by principal component analysis. So just finding the directional maximum variance through these two clusters doesn't seem to provide us with a very satisfying model of this particular dataset. So the question that I would like to leave you with is what should a network of neurons learn from such data? This will be the topic of our next lecture. And we will encounter the interesting alogrithm known as competitive learning and this will allow us to segue into generative models. And this will in turn lead us into the exciting field known as unsupervised learning. So until then, hasta la vista and goodbye.