0:00

Welcome back In the previous lecture, we considered this dataset and we asked the

Â question, can a network of neurons learn to represent such data?

Â When we applied the co-variance rule as you'll recall, to this dataset, it ended

Â up finding a weight vector, that was aligned with the direction of maximum

Â variance of this dataset. But we noted at the time, that this was

Â not a very satisfying model of this data set, because the input data appear to be

Â consist of these two clusters of data points.

Â So lets ask the question, can neurons learn to represent such clusters?

Â Here is one way in which we can use neurons to represent clusters.

Â So lets use a feedforward network with two output neurons, neuron A and neuron

Â B. And lets use neuron A to represent

Â cluster A, and neuron B to represent cluster B.

Â So we can do that by making the weight vector WA, be the center of cluster A,

Â and the weight vector WB, be the center of cluster B.

Â And so now since this is the feedforward network, so here is the input component 1

Â and input component 2. So U1 and U2 together comprise the vector

Â for the input U. Here is the output of each of these two

Â neurons, so it's just the dot product between the weight vector and the input

Â vector. So the question that I would like to ask

Â you is, if I give you a particular input such as this one here, which neuron do

Â you think will fire the most? In other words, which neuron will have a

Â higher output firing rate? Is it neuron A or neuron B, for this

Â particular input? Notice that this particular input is

Â closer to neuron B. So the distance from here to the center

Â of the cluster, so that is the distance from this input to WB, seems to be

Â shorter than the distance from this input to the weight vector WA, which is the

Â center of cluster A. So, which neuron do you think will have a

Â higher activity? Neuron A or neuron B?

Â If you answered neuron B you would be correct.

Â The most active neuron in the network, is going to be the one whose weight vector

Â is closest to an input. So in this case, for this particular

Â input U, the closest weight vector is WB, and therefore, the most active neuron is

Â also going to be the neuron B. And we can show that by looking at the

Â Euclidean distance between the input and each of these two weight vectors, WA or

Â WB. And so the square of the Euclidean

Â distance turns out to be equal, if you simplify all these terms.

Â It turns out to be equal to be to the square of the length of the input vector,

Â plus the square of the length of the rate vector, minus 2 times the output activity

Â of the neuron. And so if we assume that the length of

Â the input vector has been normalized to have length 1, and similarly the length

Â of the weight vector has been normalized to also have, let's say a length of 1.

Â Then minimizing the squared distance between the input and the weight vector,

Â turns out to be the same as maximizing the activity of that particular neuron.

Â Now suppose I give you a new input, UT. And here's that new input.

Â How will you update the weights of the two neurons, given this new input UT?

Â Well lets think about that for a little bit.

Â So the first thing that we need to do, is perhaps figure out which cluster this new

Â input belongs to. Is it cluster A, or cluster B?

Â And we can do that, by looking at the distance between that new data point in

Â each of these centers of the clusters. So in this case, the distance between WA,

Â and this new data point as well as WB in that particular data point, and it

Â appears in this case that the cluster A is the one that this new input might

Â belong to. Because the distance from that input to

Â the center of the cluster, is the shortest.

Â And so now that we've figured out which cluster this new input might belongs to,

Â we can update the weight, which is now the center of the cluster, to now include

Â this new data point. So how would we do that?

Â One way of doing that is to set the weight vector to be the running average

Â of all the inputs, in that particular cluster.

Â So the running average of all the inputs in this cluster, including this new

Â input. So do you remember the equation for

Â computing the running average? Well here it is.

Â So we can derive the equation for the running average, by starting out with the

Â expression for the average, which all know, which is just the summation of the

Â data points in this cluster A, divided by the number of such points, which we're

Â calling T. And so if we express the sum now, as the

Â sum of all the data points except for the data point T, and then we simply it.

Â What we find is an equation that has both the weight vector before we got the new

Â data point, plus this additional term, which includes the new data point.

Â And so we can now write this weight update rule.

Â The delta W, which is the change in the weight vector W for the neuron A, is

Â equal to some epsilon times just the difference between the new input, and the

Â weight vector for that particular neuron. Now we can epsilon to be equal to 1 over

Â T, and that would make this equation compute the running average.

Â Or you can keep epsilon as some small positive value, and that allows the

Â method to adapt to new inputs for an indefinite period of time.

Â So the 1 over T would make epsilon go to 0 for very large T.

Â But if you keep epsilon to be some constant positive value, then that allows

Â the algorithm to remain adaptive to new inputs, for an indefinite period of time.

Â 6:11

Okay, we're now ready to make our acquaintance with the competitive

Â learning algorithm. Which is a famous algorthim in neural

Â networks thoery. So in competitive learning, given a input

Â we pick the most active neuron in the network.

Â And this we can do, for example, by using the winning takes all network that we

Â discussed in our previous lecture on recurrent networks.

Â And as we discussed earlier, the most active neuron also corresponds to the one

Â whose weights are closest to the new input.

Â And once we have a winning neuron, we can then update the weights for that neuron,

Â and we do that using the weight update rule from the previous slide.

Â Now what is the effect of this rule on the weights for the winning neuron?

Â Well, so here is what that looks like. So initially the weight vector for the

Â winning neuron, in this case, neuron A, was here in the center of the cluster.

Â And the effect of this learning rule this weight update rule, is to move the weight

Â vector slightly towards the direction of the new input.

Â And in doing so, what we're doing is updating the weights to be equal to the

Â average, of all the data points in this cluster including the new data point.

Â 7:33

Okay lets look at an example of competitive learning.

Â Suppose you have these green points as the input, and these are 2-dimensional

Â points. So these constitute the U1 and U2, in the

Â input vector. And suppose you have three output

Â neurons, and the output neurons have the weight W1, W2 and W3.

Â And lets randomly assign the values for these weights.

Â So W1 turns out to be here, W2 turns out to be here, and W3 down here.

Â Lets look at what happens to these weights, as we give the network these

Â green inputs one by one. So suppose I give the network this input

Â first. How do you think these weights will be

Â changed? Well, according to the competitive

Â learning algorithm, the closest weight to this particular input is W1, and

Â therefore W1 is moved to be closer to that particular input.

Â Now, if the second input let's say is this one, then again the winning neuron

Â is going to be neuron number 2, so W2 is going to be adapted, and it will move

Â closer to that particular input. And finally, if the 3rd input lets say is

Â this one, then as you would expect the winning neuron is going be neuron number

Â 3, and so it's weight is going to be adapted to be closer to that particular

Â input. Now if we keep repeating this procedure

Â for more than just these 3 inputs, than here's what the weights might look like

Â after you've updated them, using the competitive learning rule.

Â 9:05

As you can see, the three neurons have partitioned the data set into three

Â different clusters, given by the red points, the blue points, and the green

Â points. And the weight vectors of the three

Â neurons, W1, W2, and W3, now represent the centers of these three clusters.

Â Competitive learning is also closely related to self organizing maps, also

Â known as Kohonen maps after the Finnish professor who first proposed these maps.

Â And in self organizing maps, just as in competitive learning, we pick the winning

Â neuron given in any particular input. And as in competitive learning, we update

Â the weights for the winning neuron. But unlike competitive learning, we also

Â update the weights of other neurons, which are in the neighborhood of the

Â winning neuron. So, what do we mean by the neighborhood

Â of the winning neuron? Well, in self organizing maps, we have

Â locations assigned to the neurons in the network.

Â And so each of these would correspond to a location on a grid, such as this

Â 2-dimensional grid, that are assigned to the individual neurons.

Â And so when we have some inputs such as the inputs shown here, in this cloud of

Â data, then we not only update the weights of this individual neuron.

Â So we would move the weights for this winning neuron towards that particular

Â data point, but we'd also update the weights of the neighbors.

Â So this neuron, the one here and the one here, we would update these weights also

Â to be closer to that particular data point, resulting in a transformation of

Â the 2-dimensional grid, to look something like this.

Â And so this morphing of the grid is a result of updating the weights of the

Â winning neuron and It's neighbors. and if we keep doing that for all the

Â different points on this data cloud, so this could be a very high dimensional

Â cloud of data, then we end up with a representation that looks something like

Â this. And here you can see that we have now

Â transformed a very high dimensional data set, a potentially high dimensional data

Â set. We've mapped it now onto a 2-dimensional

Â representation, that preserves the topological properties of the input

Â space. Interestingly, in the brain, one also

Â finds a 2-dimensional map that preserves the topological properties of the input

Â space. In particular if you look at the primary

Â visual cortex or V1, we find are called orientation preference maps.

Â And here's an example of an orientation preference map, where each of these

Â colors represents one particular orientation, that the neurons in that

Â particular region of cortex prefer. And so if you're recording from one

Â particular location, let's say in blue here, and you move along in a particular

Â direction, you find that the neighboring neurons prefer similar orientations.

Â And that's very similar to the fact that in the self organizing map also,

Â neighboring neurons are going to prefer similar inputs.

Â 13:10

Well, the answer as you might have guessed, is a resounding yes.

Â There's a whole field, called unsupervised learning, that addresses

Â this question. In unsupervised learning you assume that

Â your inputs, these data points u, are being generated by a set of hidden

Â causes, that we called v. And the relationship between the causes

Â and the data point or inputs that you're observing, is given by a generative

Â model. Now the causes are selected according to

Â some prior distribution. So the prior probability distribuation of

Â the causes. And there's a set of parameters here G,

Â that we would like to learn. And once you have a particular cause that

Â is selected according to this distribution, then the generative model

Â assumes that the data point u is generated according to some likelihood

Â function, that's given by p of u given v and again G is some set of parameters.

Â So what is the unsupervised learning problem then?

Â Well the unsupervised learning problem comes down to learning these parameters

Â G, that characterize both the likelihood function, as well as the prior property

Â distribution of the causes. Let's look at what this means for our

Â example of the two clusters. So, we could use a generative model in

Â this case, which assumes there are two Gaussians that are generating these data

Â points. And each of these two Gaussians has a

Â mean and a standard deviation. And so the way that these points would be

Â generated then, according to this generative model, is that you first

Â select one of these two Gaussians according to some prior property

Â distribution given by p of v. And given a particular Gaussian, then you

Â generate one of these points within that cluster.

Â We can now write the property distribution of all of our inputs, as a

Â mixture of Gaussians. So here's the expression for the mixture

Â of Gaussians. And you can see that it contains the term

Â corresponding to the Gaussian for each of the Gaussians in our model, as well as

Â the prior Property for that particular Gaussian.

Â And therefore the parameters for this particular generative model, includes the

Â mean and standard deviation for each of the Gaussians, as well as the prior

Â property which we are calling Gamma, for each Gaussian.

Â So, to summarize, the goal of unsupervised learning, is to learn a good

Â generative model for the data that you're seeing.

Â In other words, we would like to mimic the data generation process.

Â And we find that in general, if you want to do that, we need to solve two

Â sub-problems. So what are these two sub-problems?

Â The first one corresponds to the problem of recognition.

Â So we mean by recognition? We mean that we would like to estimate

Â the causes v, given any particular input. So for any particulate input, we would

Â like to estimate the posterior property of the causes given the particular input

Â u. And once we have done that, it turns out

Â we can typically use that information to learn the parameters G.

Â So let's look at an example of how this approach can be applied to a specific

Â unsupervised learning problem. Lets look at the problem of clustering.

Â So as you recall, we were modelling this particular data set of two clusters, as a

Â mixture of Gaussians. And therefore, the parameters that we

Â would like to learn are the mean and standard deviation of each Gaussian, as

Â well as the prior property for each Gaussian.

Â So how does the approach from the previous slide apply to this problem?

Â Well given a particular data point such as this one, the first sub problem

Â involves finding out the posterior property of the causes, given this data

Â point and the parameters g. And the second problem then, is given the

Â posterior property distribution over causes, for that particular data point,

Â update the parameters g. This is actually very similar to

Â competitive learnings. So in competitive learning as you recall,

Â you also had two steps. In the first step, given a data point

Â such as this one here, we would assign the data point to a specific cluster.

Â So in this case, we would assign this data point to cluster A.

Â But now we're going to compute the posterior property of this data point

Â belong to either cluster A, or cluster B. And so presumably in this case we would

Â have a higher posterior property, that this data point belongs to cluster A.

Â So p of v given u and G, would have a higher property when v is equal to A,

Â than when v is equal to B. And the second step in competitive

Â learning was, changing the weight only for the cluster A.

Â Where as now, we're going to use the posterior property to then change the

Â parameters for both the Gaussian for cluster A and the Gaussian for cluster B.

Â So let's see how we can do that. The algorithm for learning the parameters

Â is called the Expectation Maximization algorithm or EM algorithm for short.

Â The EM algorithm is actually a very general algorithm for unsupervised

Â learning, and it has its rules in statistics.

Â It involves a trading between two steps, one is called the E step or the

Â expectation step, and the other is called the M step, or the maximization step.

Â So lets look at what these two steps mean in the case of our example of clustering.

Â 18:45

So here's the E step. And as you might expect, if you pardon

Â the pun, the E step involves computing the posterior property distribution of

Â the causes. Which in this case is whether the cause

Â is Gaussian A or Gaussian B, and we can do that by using base rules.

Â So you can compute the posterior property of the cause, given the input by using

Â base rule, and we can substitute the normal or the Gaussian distribution, for

Â each of the likelihood functions for our inputs.

Â And so we end up with an expression that looks like this.

Â And you can see that this expression, implements a form of sort competition, so

Â it's not a winner takes all type of competition.

Â It's a much democratized version of competition, where we assigning a

Â posterior property for each cause, as opposed to just declaring one these

Â clusters, or these causes to be the winner.

Â And here's the M step. And the M step involves changing the

Â parameters G, using the results from the E steps.

Â So here are the equation for changing the mean, the variance, and the prior

Â property for each of our Gaussians. And you can see how the equations used

Â the posterior property for each Gaussian, computed in the E step.

Â It's interesting to compare this algorithm to competitive learning.

Â So in the case of competitive learning, we only learn the mean of each cluster.

Â And so we did not have an estimate of the variance of each cluster, or the prior

Â property of each cluster. The other difference with competitive

Â learning is that, the EM algorithm assumes you have all the data points at

Â once. And so you can use all the data that

Â compute estimates such as this. Whereas, in the case of competitive

Â learning, we got the input one by one, and so we can call the EM algorithm a

Â batch learning algorithm. Whereas, the competitive learning

Â algorithm was an online learning algorithm.

Â Here is the results for applying the EM algorithm to our data set of two cluster.

Â The dash line here, represents the trajectory of the mean, and the circles

Â represent two standard deviations from the mean, for each of the Gaussian A and

Â B. And so as you see here, with each

Â iteration, the estimate of the mean gets better and better, as does the estimate

Â of the standard deviation. Until we get to iteration number 50,

Â where you can see that the mean of each distribution is quite accurately modeled,

Â and the standard deviation of each of these clusters seems to approximate quite

Â well, the standard deviation of the Gaussians that generated the data.

Â