0:00

Hello, companero fans.

Â In the previous two lectures we made friends with Hebb

Â and learned that his learning rule implements principal component analysis.

Â We then learned about unsupervised learning which tells us how

Â the brain can learn models of its inputs with no supervision at all.

Â I left you with the question,

Â how do we learn models of natural images?

Â What does the brain do?

Â Well, as the saying goes "When in doubt, trust your eigenvectors."

Â Well, can we use eigenvectors or

Â equivalently principal component analysis to represent natural images? Well, let's see.

Â So here's a famous example from Turk and Pentland.

Â And what they did was they took a bunch of face images,

Â and let's say that each of these face images has N pixels.

Â So they took a bunch of face images and they computed

Â the eigenvectors of the input covariance matrix.

Â And when they did that they found that the eigenvectors looked like this,

Â and they called these eigenvectors "Eigenfaces".

Â Now we can represent any face image,

Â such as this one,

Â as a linear combination of all of our Eigenfaces.

Â So here is the equation that captures this relationship.

Â Now why can we do that?

Â Well, remember that the Eigenfaces are the eigenvectors of the input covariance matrix,

Â and since the covariance matrix is a real and symmetric matrix,

Â its eigenvectors form an orthonormal basis with

Â which we can represent these input vectors.

Â Now, here's something interesting.

Â You can use only the first M principle eigenvectors.

Â So, what do we mean by the first M principle eigenvectors?

Â Well, these are the eigenvectors associated with

Â the M largest eigenvalues of the covariance matrix.

Â So if we use only the first M principle eigenvectors to represent the image,

Â then we get an equation that looks like this.

Â And this equation just tells you that there

Â are some differences between the reconstruction of the image using

Â only the first M principle eigenvectors and therefore we are going to model

Â those differences between the actual image and

Â the reconstructed image using a noise term.

Â So why is this a useful model?

Â It's a useful model because you can use it for image compression.

Â So suppose your input images were of size thousand by

Â a thousand pixels which means that N is going to equal one million pixels.

Â And now, if the first,

Â let's say 10 principle eigenvectors,

Â are sufficient which means that the first

Â 10 largest eigenvalues are sufficient to explain most of the variance in your data,

Â then M is going to equal 10 which means that

Â just 10 numbers are enough to represent an image.

Â So 10 of these coefficients are sufficient to

Â represent any image which consists of 1 million pixels.

Â So what we have then is a tremendous dimensionality reduction or compression,

Â from one million pixels down to just 10 numbers for each image.

Â Now wait a minute, not so fast- eigenvectors.

Â The eigenvector representation may be good for compression,

Â but it's not really very good if you want to extract

Â the local components or the parts of an image.

Â So for example, if you want to extract

Â the parts of a face such as the eyes, the nose, the ears,

Â you're not going to get that from an eigenvector analysis or

Â equivalently a principle component analysis of the face images.

Â And likewise, you're not going to be able to extract

Â the local components such as edges from natural scenes.

Â Now, this is certainly a sad day for the course

Â because eigenvectors have let us down for the first time.

Â But maybe we can resurrect the linear model so beloved to the eigenvectors.

Â So here is the linear model.

Â We have a natural scene for example that is represented by

Â a linear combination of a set of basis vectors or features,

Â so these do not have to be eigenvectors anymore.

Â And so, here is the equation again that captures this relationship.

Â And the difference now from the case of

Â eigenvectors that we had in the previous slide is that we are allowing M,

Â the number of these basis vectors or features to be larger than the number of pixels.

Â So why does that make sense?

Â Well, consider the fact that the number of

Â parts of objects and scenes can be much larger than the number of pixels.

Â So it does make sense to allow a larger value for M,

Â the number of basis vectors and features,

Â than the number of pixels.

Â And here's another way of writing the same equations.

Â So we're Replacing the summation with a matrix multiplication.

Â G times v where the columns of this matrix G are the different basis vectors or features,

Â and the vector v has elements which are

Â the coefficients for each of those bases vectors or features.

Â So the challenge before us now is to learn this matrix G,

Â the different basis vectors as well as for any given image we need to be able to estimate

Â the coefficients- this vector

Â v. In order to learn the basis vectors G and estimate the causes v,

Â we need to specify a generative model for images.

Â And as you recall we can define the generative model by

Â specifying a prior probability distribution

Â for the causes as well as a likelihood function.

Â Let's first look at the likelihood function.

Â So we start with our linear model from the previous slide

Â and if you assume that the noise vector is Gaussian,

Â and it's a Gaussian white noise- which means that there are

Â no correlations across the different components of the noise vector.

Â And if we assume that the Gaussian has

Â zero mean then we can show that the likelihood function amounts

Â to also a Gaussian distribution with the mean

Â of G times v and a covariance of just the identity matrix.

Â And here is what

Â this likelihood function then is proportional to- it's this exponential function.

Â And finally, if you take the logarithm of the likelihood function,

Â we obtain the log likelihood which now is simply just this quadratic term.

Â So it has just a negative one half of the square of the length of this reactor which is

Â simply the difference between the input image and the reconstruction of the image,

Â or the prediction of the image,

Â you're using your basis vectors.

Â Now, here's an interesting observation.

Â A lot of algorithms in engineering and in machine learning attempt to

Â minimize the squared reconstruction error which is just this term here.

Â And so, now you can see that when you are minimizing the reconstruction error,

Â it's the same thing as maximizing the log likelihood function,

Â or equivalently maximizing the likelihood of the data.

Â Isn't that interesting?

Â Now, let's define the prior probably distribution for the causes.

Â So one assumption you can make is that the causes are independent of each other.

Â And if you make that assumption,

Â then we have the result that the prior probability for the vector v is equal

Â to just the product of the individual prior probabilities for each of the causes.

Â Now, this assumption might not strictly hold

Â for natural images because some of the components

Â might depend on other components but let's start off

Â with the simplifying assumption and see where it takes us.

Â Now, if you take the logarithm of the prior probability distribution for v then you have

Â a product we now have the summation of

Â all the individual log prior probabilities for the causes.

Â Now, the question is how do we define

Â these individual prior probabilities for the causes?

Â Now, here's one answer.

Â We can begin with the observation that for any input

Â we want only a few of these causes v_i to be active.

Â Now, why does that make sense?

Â Well, if we are assuming that these causes

Â represent individual parts or components of natural scenes,

Â then for any given input which contains for example, a particular object,

Â only a few of these causes are going to

Â be activated in that particular image because those

Â are the parts of

Â that particular object and then the rest of the v_is are going to be zero.

Â So what we have then is that v_i for

Â any particular eye is going to be zero most of the time,

Â but it's going to be high for some inputs.

Â And this leads to the notion of a sparse distribution for pv_i.

Â And what this means is that the distribution for pv_i is going to have a peak at zero.

Â So it's going to be zero most of the time- v_i is going to be zero most of the time.

Â But the distribution is going to have a heavy tail which means

Â that for some inputs it's going to have a high value.

Â And this kind of a distribution is also called a Super Gaussian distribution.

Â Now, here are some examples of these super Gaussian are sparse prior distributions.

Â So this plot here shows three distributions all of them can be

Â expressed as pv equals exponential of G(v).

Â And the dotted distribution here is the Gaussian distribution,

Â and the other two distributions,

Â the dash as well as the solid line here,

Â represent the examples of sparse distributions.

Â And if you take the log of pv,

Â we get a more clear picture here of what these distributions look like.

Â So you can see that when G(v) equals minus the absolute value of v,

Â then we get an exponential distribution.

Â And when we have G(v) equals minus the logarithm of one plus v squared,

Â we get something called the Cauchy distribution.

Â So to summarize then,

Â the prior probability pv is equal to

Â just the product of these exponential functions and therefore,

Â the logarithm of the prior probabilty pv is going to equal

Â the summation of all these values G v_i plus some constant.

Â Okay, After all that hard work,

Â we finally arrived at the grand mathematical finale of figuring out how to find v given

Â any particular image and how to learn

Â G. And we're going to use a Bayesian approach do that.

Â So by Bayesian we mean that we are going

Â to maximize the posterior probability of the causes.

Â So here is p of v given u.

Â And from base will begin write p of v given u as

Â just the product of the likelihood times the prior.

Â And k here is just the normalization constant.

Â We can maximize the posterior probability by also maximizing

Â the log posterior- so that's the same thing as maximizing the posterior.

Â And so, here's the function F which is

Â the log posterior function and you can see how the function F has two terms.

Â One of them is a term containing the reconstruction error.

Â The other is a term containing the sparseness constraint and

Â we can maximize this function by essentially doing two things.

Â We have to minimize the reconstruction error,

Â but at the same time trying to maximize the sparseness constraint.

Â And so, you can see how this function F trades off

Â the reconstruction error with the sparseness constraint.

Â And so, we would like our representation to be sparse.

Â We would like only a few of these components to be active.

Â But at the same time we would also like to preserve information in

Â the images and that's enforced by this reconstruction error term.

Â One way of maximizing F with respect to v and G is to alternate between two steps.

Â The first step is maximizing F with respect to v keeping G fixed.

Â And the second step is maximizing F with respect to

Â G keeping v fixed to the value obtained from the previous step.

Â Now, this should remind you of the EM algorithm.

Â So just as in the EM algorithm,

Â in the E step we computed the posted probabiliy of

Â v. Here we are computing a value for v that maximizes F,

Â and similar to the EM algorithm where in the M step we updated the parameters here.

Â We're updating the parameter G- the matrix G,

Â to maximize the function F. Now,

Â the big question is, how do we maximize F with respect to v and G?

Â Well, one potential answer is to use something called

Â "gradient ascent" which is that we change v

Â for example according to the gradient of

Â F with respect to v. So why does this make sense?

Â Here's why it makes sense.

Â So let me draw F as a function of v. So suppose F is

Â this function and you can see that the value of v which maximizes F is some value here.

Â So let's call that v*.

Â And if the current value of v is let's say to the left of v*,

Â let's say that this is where the current value of v is,

Â you can look at the gradient of F with respect to

Â v. So you can see that it's the slope of those tangents here.

Â Do you think the gradient is positive or negative at this particular value?

Â Well, if you answered positive, you would be correct.

Â So it is a positive value.

Â So what does that mean? It means that if you update v according to this equation,

Â then you're going to move v in this direction.

Â So you are you going to add a small positive value to v and that's going to

Â move v in the right direction, towards v*.

Â Similarly, if you're on this side.

Â Let's say this is where your current value of v is.

Â I'm calling that v prime.

Â Then you can see that the gradient is in this case, you guessed right,

Â it's negative which means that you're going to subtract

Â a small value from your current value.

Â And that's going to move the value of v again,

Â towards the optimal value.

Â So either way, gradient ascent does the right thing.

Â Okay, let's apply the idea of gradient ascent then to our problem.

Â So we would like to take the derivative of F with

Â respect to v. And here is the expression that we get.

Â So G prime here denotes the derivative of our function G. And we can now look at

Â the way in which we should update the vector v and that's

Â given by this differential equation with some time constant.

Â And the interesting thing to note here is that we can

Â interpret the differential equation that we have

Â here for v as simply the firing rate dynamics of a recurrent network.

Â And so, what does this network do?

Â It takes the reconstruction error and it uses it

Â to update the activities of the recurrent network.

Â And it also takes into account the sparseness constrain that

Â encourages the output activities to be sparse.

Â And here is the recurrent network that implements our differential equation for v.

Â And so, you can see how it has both,

Â an input layer of neurons and an output layer of neurons.

Â But the interesting observation here is that

Â the network makes a prediction of what it expects the input to be.

Â So G times v is a prediction or a reconstruction of the input.

Â And then we take an error so u minus

Â G_v is the reconstruction error or the prediction error.

Â And that is then passed back to the output layer,

Â and the output layer neurons then use the error to

Â correct the estimates they have of the causes of the image as

Â given by the vector v. For any given image the network iterates by predicting and

Â correcting and eventually converges to a stable value for v for any given image.

Â We can learn these synaptic weight matrix G which contains

Â the basis vectors or the features that we are trying

Â to learn by again applying radiant ascent.

Â So we can say dG-dT to be proportional

Â to the gradient of F with respect to G. So then we take

Â the derivative of F with respect to G. You're going to get an expression that looks

Â like this so i u you minus Gv times v transpose.

Â And so, what we end up with then is this learning rule for

Â updating the synaptic weight G. It has a time constant tou_G and

Â that specifies the time scale at which we are going to update the rate

Â G. And so if he said tou_G to be bigger than the time constant we had for v,

Â that ensures that v converges faster than G. And so,

Â we have the desired property that for any given image

Â of v will converge fast to some particular value,

Â and then we can use that value for v to then update the weights for the network.

Â Now, if you look closely at the right hand side of the learning roll,

Â you'll see that it's actually Hebbian.

Â So you can see how it contains the term u times v. So u times v

Â transpose is basically the Hebbian term.

Â Now, it also contains a subtractive term and that actually makes this rule very,

Â very similar in fact almost identical to the Oja rule for learning.

Â So if the learning rule is almost identical to Oja's rule,

Â why doesn't this network then just compute the eigenvector?

Â So why isn't it just doing principal component analysis?

Â Well, the answer lies in the fact that the network is

Â actually trying to compute a sparse representation of the image.

Â And so that ensures that the network

Â does not just learn the eigenvectors of the covariance matrix,

Â it's actually learning a set of basis vectors

Â that can represent the input in a sparse manner.

Â Okay, So here's a pop quiz question.

Â If you feed your network some patches from natural images,

Â what do you think the network will learn in its matrix G?

Â What kind of basis vectors would you predict are learned for natural image patches?

Â Time for the drum roll.

Â The answer as first discovered by Olshausen and

Â Field is that the basis vectors remarkably

Â resemble the receptor fields in

Â the primary visual cortex as originally discovered by Huubel and Weisel.

Â So each of these square images is one vector

Â or one column of the matrix G. So you can obtain a vector from the square image by

Â collapsing each of the rows of the square image into

Â one long vector and that would be one column

Â of the matrix G. So what is this result telling us?

Â It's telling us that the brain is perhaps optimizing

Â its receptive fields to code for natural images in an efficient manner.

Â You can look at this model as an example

Â of an interpretive model- so this is going back to

Â the first week of our course where we

Â discussed the three different kinds of models and computational neuroscience.

Â So this would be an example of an interpretative model that provides

Â an ecological explanation for

Â the receptive fields that one finds in the primary visual cortex.

Â This sparse coding network that we have been discussing so far is in fact

Â a special case of a more general class of networks known as Predictive Coding Networks.

Â So here's a schematic diagram of a predictive coding network.

Â And the main idea here is to use feedback connections to convey predictions of the input,

Â and to use the feedforward connections to

Â convey the error signal between the prediction and the input.

Â And this box labeled "predictive estimator" maintains an estimate of the hidden causes,

Â the vector v of the input.

Â Now, here are some more details of the predictive coding network.

Â So as in the case of these sparse coding network,

Â there are a set of feedforward weights and a set of feedback weights.

Â But we also can potentially include a set of

Â recurrent weights and these would allow the network to model time varying input.

Â So for example, if the input is not just a static image, but natural movies,

Â then we could model the dynamics of the hidden causes of these natural movies by allowing

Â these estimates of the hidden causes to change

Â over time and that's modeled using a set of recurrent synapsis.

Â And additionally, one could also include

Â a component which is a gain on the sensory errors.

Â So this allows you to model certain effects such as visual attention.

Â Well, this brings back some fun memories for me because I worked

Â on these predictive coding networks as a graduate student and as a post doc.

Â If you're interested in more details of these predictive coding models,

Â I would encourage you to visit these supplementary materials on

Â the course website where you'll find some papers

Â that I wrote as a graduate student and as a post doc.

Â And finally, the predictive coding model suggests an answer to

Â a longstanding puzzle about the anatomy of the visual cortex.

Â Here's a diagram by Gilbert and Lee of the connections between

Â different areas of the visual cortex and the puzzle is this.

Â Every time you see a feedforward connection such as the one

Â from V1 to V2 given by the blue arrow,

Â you almost always also find a feedback connection from the second area to the first.

Â So in this case from V2 back to V1.

Â So why is there always

Â a feedback connection for every feedforward connection between two cortical areas?

Â And here's a schematic depiction of this puzzle.

Â Information from the retina as you know has passed

Â on to the LGN or lateral geniculate nucleus.

Â And then the information is passed on to cortical area V1,

Â cortical area V2, and so on.

Â But for every set of feedforward connection there seems to

Â be a corresponding set of feedback connections.

Â So what could be the role of these feedforward and feedback connections?

Â The predictive coding model suggest

Â interesting functional roles for the feedforward and feedback connections.

Â According to the predictive coding model the feedback connections convey predictions of

Â the activities in the lower cortical areas from a higher cortical area.

Â And the feedforward connections between one cortical area to the next

Â convey the error signal between the predictions and the actual activities.

Â It turns out that it can explain

Â certain interesting phenomena that people have observed in the visual cortex

Â known as "Contextual effects" or surround suppression or surround effects.

Â And these effects can be explained in an interesting manner by

Â the hierarchical predictive coding model that we

Â have here when it is trained on natural images.

Â So I would encourage you to go to

Â the supplementary materials on

Â the course web website if you're interested in more details.

Â Okay amigos and amigas,

Â that wraps up this lecture.

Â Next week we'll learn how neurons can act as classifiers,

Â and how the brain can learn from rewards using reinforcement learning.

Â Until then adios and goodbye.

Â