0:05

[MUSIC]. Hello neuro explorers.

Last week, we learned how neurons can be connected to form feed forward and

recurrent networks. This week, we learn how these connections

can be adapted using synaptic elasticity, allowing the brain to learn about the

world from its inputs. Now, what better way to start this

journey than to gaze upon this beautiful drawing of the hippocampus by the great

Romon y Cayal from a 100 years ago. it was in the hippocampus that some of

the first results on synaptic plasticity were obtained.

One type of synaptic plasticity that's observed in the brain is long term

potentiation or LTP. An LTPs defined as an experimentally

observed increase in the synaptic strength from some neuron A to another

neuron B that can last for several hours or even days.

And the way you would induce LTP is by causing some neuron A to fire a burst of

spikes and if the neuron A is connected to some other neuron B where B is also

excited, it's depolarized or it's firing some spikes.

Then what you would see is an increase in the size of the excitatory postsynaptic

potential. So that means that for the same input

initially you might have a small EPSP. But after you pair neuron A with neuron B

several times then what you'd observe is an increase in the size of the EPSP which

indicates that the strength of the connection from neuron A to neuron B has

been increased. The counterpart to Long-Term Potentiation

or LTP is Long-Term Depression or LTD. And LTD corresponds to an experimentally

observed decrease in the synaptic strand that lasts for hours or days and you can

obtain LTD when you have the following situation.

So if the neuron A firing some spikes but neuron B does not fire any spikes or it

does not depolarize. So in this situation, when you have some

input from neuron A but no output coming from neuron B then what you observe is a

decrease in the EPSB size. So if the initial EPSB for a single input

was as shown at the very top here. And as the pairing occurs where you have

some input from A but no output from B, you would expect to see a decrease in the

size of EPSP. And what that implies is that the

connection strength from neuron A to neuron B has been decreased.

2:33

Now here's something that's interesting. Even before LTP and LTD were discovered

in the brain, a Canadian psychologist named Donald Hebb predicted that

something like LTB should occur in the brain.

He suggested a learning rule for how neurons in the brain should adapt the

connections among themselves and this learning rule has been called Hebb's

Learning Rule or Hebbian Learning Rule and here's what it says.

If a neuron A repeatedly takes part in firing another neuron B, then the synapse

from A to B should be strengthened. And here is a cartoon of what this

learning rule implies, if we have a neuron A that is firing and that intern

is participating in the firing of another neurons.

So the neuron B produces for example one or a few spikes.

Now if this situation occurs, Hebb's Learning Rule predicts that one ought to

increase the strength of the connection from neuron A to neuron B, because neuron

A is participating in the firing of neuron B.

And so what we then get is for the same input from neuron A now and the input

from other neurons you have an increase in the activity.

You have more spikes from neuron B. And so another way of phrasing Hebbs

learning rule is through that famous mantra that you already heard during the

first week of lectures and that is that neurons that fire together wire together.

4:08

Now mantras are great for chanting but they're hard to implement on a computer.

Now let's see if we can formalize Hebb's rule as a mathematical model.

So, let's start with a linear feet forward neuron.

So here's the neuron with an outward v and it's receiving some inputs we're

calling the input vector u and the synaptic weights from the inputs to the

output neuron are given by a synaptic weight vector w.

So this is very similar to the feed forward networks that we considered in

the previous set of lectures last week. Now, if we assume that the dynamics of

this network of the firing fate is fast, then we can look at the steady state

output. And that's given by this equation.

So the output firing rate of the neuron is nothing but just the dot product of

the weights, the synaptic weights with the inputs and you can write it as a dot

product. Or you can write it as w transpose u or

you can write it as u transpose w. Now, here's how you can write Hebb's rule

mathematically. You can use a differential equation to

capture how the rates from the input neurons to the output neuron change as a

function of time. So there is some time constant tau sub w

that governs how fast the weights are changing.

And we set tau sub w dw dt to be equal to the product of the input firing rates and

the output firing rate. So how does this capture the intusion

behind Hebb's rule? But remember that in Hebb's rule the

increase the strength of the connection from an input neuron A to an output

neuron B, if there is both activity from neuron A as well as activity from neuron

B. And this product of the input firing

rates with the output firing rate captures that intuition.

6:01

Now in order to implement this differential equation on a computer, you

need to discretize it. And so if you look at the discrete

implementation of this differential equation, then this leads you to a eight

update rule. And the weight update rule is shown here.

So this is how you update the weights given inputs.

And so the weight update rule tells you that the weights at time step i plus 1 is

given by the weights at time step i plus some epsilon some positive constant.

So this is called the learning rate. And that is multiplied by u times v.

Or another way of expressing this equation is to say that the change in the

weight. So delta w is equal to the learning rate

epsilon times uv. In order to understand the Hebb rule, it

is useful to look at the average effect of this rule on the synaptic weights w.

So here is the Hebb rule from the previous slide and if you want to look at

the average effect of this rule, then we can take the average of the right-hand

side with respect to all the inputs u. So these brackets over here denote the

average. And if we now substitute the value for v

from the previous slide again. Then what we find is that the Hebb rule

modifies the weight w according to the input correlation matrix, where the

correlation matrix, as you might know, is given by simply the average of uu

transpose. So what does this mean?

What does it mean to change the weight w according to the input correlation

matrix? Well think about that for a minute.

7:46

Well the Hebb rule that we've been discussing so far only increases synaptic

weights and this models a phenomenon of LTP or long term potentiation in the

brain. But as we discussed earlier the brain

also exhibits LTD or long term depression, which involves decreasing the

strength of the connection from one neuron to another.

Now can we model both LTP and LTD using a single learning rule?

In other words can we derive a learning rule that can both increase or decrease

the strength of a synaptic connection? One rule that incorporates both LTP and

LTD is the covariance rule and we'll come to why it's called that in just a minute.

Here is the differential equation for the covariance rule and you'll notice that it

is again a product of the input firing rate with the output firing rate.

Except that now the output firing rate as a difference term that includes the

difference between the output firing rate and the average of the output firing rate

so what is the effect this difference term?

Well consider the case when the output firing rate is bigger than the average

output firing rate. So in this case you're going to have a

positive quantity here, which means that when you multiply the input firing rates

with a positive quantity you're going to have an increase in the synaptic

strength. And that is going to result in LTP.

On other other hand, if the output firing rate is low, so for example.

It is less than the average output firing rate or even the case where there is no

output so v could be 0. In that case what you're going to get is

a negative quantity here and so when you multiply the input firing rates with a

negative quantity you're going to get a decrease in the synaptic weight.

And so that results in LTD. So what does the Covariance Rule do?

Well, just as we did with the Hebb rule we can look at the average effect of this

rule. And that means taking the average of the

right hand side of the rule with respect to all the inputs u.

And if you substitute the value for v and you simplify these expressions then what

you get is the fact that the. Covariance rule is changing the weight

vector w according to surprise, surprise the input covariance matrix.

So here's the input covariance matrix. It's simply u u transposed, the average

of that minus the average of u with the average of u transposed.

At this point I would like you to think about what it means for w to be changed

according to the input covariance matrix. What do you think w would converge to

when it's modified according to this equation?

We will answer that question towards the end of the lecture.

10:32

Now let's ask the question are these learning rules stable?

In other words does w converge to a stable value or does it explode?

Now how do we answer this question? Well one could look at the length of w as

a function of time and see if the length of w remains bounded or if the length of

w grows without any bounds. Let's first look at the Hebb rule.

So here is the Hebb rule and let's look at how the length of w squared changes as

a function of time. So let's take the derivative of the

length of w squared with respect to time and when we do that, we get this

expression here. And if we substitute the value for dw dt

according to the Hebb rule We have this expression.

And note that w transpose u here is nothing but the output firing rate v.

And if we substitute that value here, we get this expression.

Now unless v is always equal to 0, this expression is going to be positive.

And so what we then have is the fact that the derivative of the length of w squared

with respect to time, is always positive. What does that mean?

It means that the length of w is going to keep increasing, which means that w grows

without bound. Well, you might be thinking that's not

too surprising, because the Hebb rule only increases synaptic waves.

It only models LTP and so perhaps that's why w grows without bound.

Well if that's the case then what about the covariance rule?

So as we discussed, the covariance rule incorporates both LTP and LTD.

And therefore it can both increase synaptic ways as well as decrease

synaptic ways and perhaps that makes the covariance rule stable.

What do you think? Do you think it's stable?

12:27

Well, here's the answer and I'm sorry to say that it's not good news.

If you take the derivative of the length of w squared with respect to time as

before and we simplify the resulting expression then, if you further take the

average of the right hand side of that expression what you find is that the.

Derivative of the length of w squared with respect to time is always positive.

And what that means is that the length of w when changed according to the

covariance rule grows without any bound, which means that w grows without any

bound. So, how do we stabilize the Hebb rule and

the covariance rule? Well one in which you can do that is by

forcing a constraint. On the synaptic weight vector w.

So what kind of a constraint can we impose?

Well, you could impose the constraint that the length of w should always be

equal to 1 and how do we do that? Well, each time that you update the

weight vector according to a new input, we simply divide the resulting weight

vector with the length of that weight vector.

And this ensures that the length of the weight vector always equals 1.

Now this seems like a hack and perhaps it's not even biologically plausible.

So is there a more elegant way of imposing a constraint on the length of

the weight vector. Now let's look at the last of our Hebbian

learning rules and this one's called Oja's rule named after its discoverer.

And Oja's rule is similar to the Hebb rule in that we again multiply the input

firing rates with the output firing rate except that now we subtract a term alpha

v squared w from u times v and alpha is some positive value.

Now the question is, is Oja's rule stable?

What do you think? Well, let's do what we did before, which

is take the derivative of the length of w squared with respect to time.

So when we do that, we get this differential equation for the length of w

squared. So, looking at this differential

equation, do you think that the length of w squared converges to a particular value

or do you think that the length of w squared grows without bound.

14:40

Well, here's the answer. So length of w squared in fact does

converge to a particular value and it converges to the value 1 or alpha.

And you can see that by setting the derivative equal to zeros in that case

unless v is equal to 0 we have the fact that the length of w squared is equal to

1 over alpha because this term over here has to be equal to 0.

And if that's the case then the length of w itself must be equal to 1 over square

root of alpha. So what this tells us is that w for Oja's

Rule does not grow without bound, which means that the rule is stable.

Okay, let's summarize what we've learned so far about Hebbian learning.

The basic Hebb rule involves multiplying the input firing rates with the output

firing rate and this models the phenomenon of LTP in the brain.

We found out that this learning rule is unstable unless we impose a constraint on

the length of w after each weight update. The covariance rule involves multiplying

u with v minus the average value of v, which means that we can now model both

LTP and LTD. But we found out that that's not

sufficient to make the learning rules stable.

So this learning rule is also unstable unless we impose a constraint on the

length of w. And finally we considered Oja's rule and

we found out that Oja's rules in fact stable and the length of the weight

vector converges to the value 1 over square root of alpha.

16:14

Okay, we've arrived at the finale of the lecture we going to answer the question

what does Hebbian Learning do anyway. We going to start with the averaged Hebb

rule so as you recall the averaged Hebb rule is given by this differential

equation where Q is the input correlation matrix.

And what we would like to do is solve this differential equation defined wt.

So what is w as the function of time when its being changed according to this

differential equation. So how do we solve this equation?

Any ideas? Well, if you guessed eigenvetors, you

would be right. We can always rely on our dear friends,

the eigenvectors. So, as before, let's write our vector wt

in terms of the eigenvectors of the correlation matrix.

Now recall that the input correlation matrix is going to be a real and symetric

matrix which means that the eigenvectors are going to be orthonormal, which means

that we can write any vector including the vector wt.

As a linear combination of the eigenvectors.

Now if we substitute our expression for wt in the differential equation for the

average Hebb rule, then we can simplify as before and we can get this

differential equation for the coefficients.

And when we solve the differential equation for the coefficient, let's say

ci, then we have this solution. And when we substitute this solution into

our expression for wt, then we get this solution for the weight vector as a

function of time. So, what is this equation telling us

about the synaptic weight vector w as a function of time?

It's telling us that the synaptic weight vector w is a linear combination of the

eigenvectors of the input correlation matrix.

And furthermore, it's telling us that the coefficients for these eigenvectors have

terms that are exponentially dependent on the eigenvalues of the correlation

matrix. So what do you think will happen to w as

time goes on? So when t becomes very large, what do you

think will happen to w? When t becomes large, the largest

eigenvalue terms so that one that has the largest eigenvalue.

Lets say it's the eigen value lamba 1 is the largest eigen value then that term

dominates this linearly combition so what we get.

Then is the result that the rate vector turns out to be proportional to the first

eigenvector or the principal eigenvector of the input correlation matrix.

And furthermore, if we're using Oja's rule as you know, the length of the

weight vector then converges to 1 square root of alpha.

So in that case, the weight vector approaches the value e1 divided by square

root of alpha. We've actually shown something very

exciting. We've shown that the brain can actually

do statistics and that's in addition to what we showed last week which was that

the brain can do calculus. There seems to be no stopping the brain.

Well, let's look at why we think the brain does statistics.

So it turns out the Hebbian learning rule that we just analyzed implements the same

thing as the statistical technique of principal component analysis or PCA.

So to understand what principal component analysis is all about let's look at a

simple example. So here is some two dimensional data.

We have these points which represent the values you want and u2, which comprise

the input vector u. And if we start the Hebb rule with an

initial weight vector that's given by this dashed line, then the Hebb rule

rotates this initial weight vector to align itself with the direction of

maximum variance. So here is the.

Cloud of data and the final weight vector is going to be parallel to this line

which is the direction of maximum variance.

Now when we apply the Hebb rule to some data that has been shifted so the data

from here can be shifted to a different location.

Let's say with input mean. Two and two.

So, in that case we find that the Hebb rule does not do what we want it to do.

which is, it finds this direction as the direction of maximum variance going

through the origin of this two dimensional plot.

And that is really not the direction of maximum variance, the direction of

maximum variance is given again by. This direction but luckily when we apply

the equal variance rule we find that it does indeed find the direction of maximum

variance. So it's taken care of the fact that the

input mean is no longer 00 but it's 2 and 2 and that is accounted for by the equal

variance rule. So the equal variance based Hebb rule is

able to find again the direction of maximum variance.

21:19

So in summary what we have shown is that Hebbian learning learns a weight vector

that is aligned with the principal eigenvector of the input correlation or

the input covariance matrix. In other words, it finds the direction of

maximum variance in the input data. And that is precisely what principal

component analysis does but now why is that interesting?

Well, principal component analysis is a very important technique used in a

variety of fields for tasks such as damage [INAUDIBLE] reduction.

So for example here what we've done is we've shown that this two-dimensional

data can be compressed to just one dimension by projecting each of these two

dimensional points onto their corresponding locations along this

particular line. And so we now have a compression from 2 d

to the 1 d location along this particular line and that's an example of

dimensionality reduction or compression. And you can imagine that when we have a

very large input dimension, such as the number of pixels in an image.

Then this type of technique where we find the directions of maximum variance in

natural images or natural movies is indeed going to be extremely useful.

Because you can compress a very high dimensional space such as the space of

the input image or the space of the input video to may be a very small number of

principle eigenvectors the dominant eigenvectors of the input covariance

matrix. Well that's great but what if we give a

neuron this data, what do you think the weight vector for the neuron will

converge to if we apply the covariance learning rule?

As you might have guessed the covariance rule ends up finding the weight vector

that is aligned with the direction of maximum variance in this data set.

Now unfortunately as many of you will agree this data set seems to consist of

two clusters of data points. So here's one cluster and here's the

other. And so it appears that this particular

data set is not correctly modeled by principal component analysis.

So just finding the directional maximum variance through these two clusters

doesn't seem to provide us with a very satisfying model of this particular

dataset. So the question that I would like to

leave you with is what should a network of neurons learn from such data?

This will be the topic of our next lecture.

And we will encounter the interesting alogrithm known as competitive learning

and this will allow us to segue into generative models.

And this will in turn lead us into the exciting field known as unsupervised

learning. So until then, hasta la vista and

goodbye.