0:00

In this video I'm going to talk about some advanced material.

Â It's not really appropriate for a first course on nerual networks but I know that

Â some of you are particularly interested in the urgent of deep learning.

Â And the content of this video is mathematically very pretty So I couldn't

Â resist putting it in. [INAUDIBLE] insight that stacking up

Â restrictive Boltzmann machines gives you something like a sigmoid belief net can

Â actually be seen without doing any math. Just by noticing, that a restrictive

Â Boltzmann machine is actually the same thing as an infinitely deep sigmoid

Â belief net with shared weights. Once again, wave sharing leads to

Â something very interesting. I'm now going to describe, a very

Â interesting explanation of why layer by layer learning works.

Â It depends on the fact that there is an equivalence between restricted bowlser

Â machines, which are undirected networks with symmetric connections, and

Â infinitely deep directed networks. In which every layer uses the same weight

Â matrix. This equivalence also gives insight into

Â why contrasted divergence learning works. So an RBM is really just an infinitely

Â deep sigmoid belief net with a lot of shared weights.

Â The Markoff chain that we run when we want to sample from an RBM can be viewed

Â as exactly the same thing as a sigmoid belief net.

Â So here's the picture. We have a very deep sigmoid belief net.

Â In fact, infinitely deep. We use the same weights at every layer.

Â We have to have all the V layers being the same size as each other, and all the

Â H layers being the same size as each other.

Â But V and H can be different sizes. The distribution generated by this very

Â deep network with replicated weights is exactly the equilibrium distribution that

Â you get by alternating between doing P of V given H, and P of H given V, where both

Â P of V given H and P of H given V are defined by the same weight matrix W.

Â And that's exactly what you do when you take a restricted Boltzmann machine, and

Â run a Markhov chain to get a sample from the equilibrium distribution.

Â So a top-down pass starting from infinitely higher up.

Â In this directed note, is exactly equivalent to letting a restricted

Â Boltzmann machine settle to equilibrium. But that would define the same

Â distribution. The sample you get at v0 if you run this

Â infinite directed note, would be an equilibrium sample of the equivalent RBM.

Â Now let's look at inference in an infinitely deep sigmoid belief net.

Â So in inference we start at v zero and then we have to infer the state of h

Â zero. Normally this would be a difficult thing

Â to do because of explaining away. If for example hidden units K and J both

Â had big positive weights to visible unit I, then we would expect that when we

Â observe that I is on, K and J become anti-correlated in the posterior

Â distribution. That's explaining a way.

Â However in this net, K and J are completely independent of one another

Â when we do inference given V0. So the inference is trivial, we just

Â multiply V0 by the transpose of W. And put whatever we get through the

Â logistic sigmoid and then sample. And that gives us binary states of the

Â units in H0. But the question is how could they

Â possible be independent given explaining away.

Â The answer to that question is that the model above H0 implements what I call a

Â complementary prior. It implements a prior distribution over

Â H0 that exactly cancels out the correlations in explaining away.

Â So for the example shown, the prior will implement positive correlation stream k

Â and j. Explain your way will cause negative

Â correlations and those will exactly cancel.

Â So what's really going on is that when we multiply v0 by the transpose of the

Â weights, we're not just computing the light unit term.

Â We're computing the product of a light unit term and a prior term.

Â And that's what you need to do to get the posterior.

Â It normally comes as a big surprise to people.

Â That when you multiply by w transpose, it's the product of the prior in the

Â posterior of your computer. So what's happening in this net is that

Â the complementary prior implemented by all the stuff above H0, exactly counts a

Â lot explaining why it makes inference very simple.

Â And that's true at every layer of this net so we can do inference for every

Â layer and get an unbiased sample with each layer simply by multiplying V0 by W

Â transpose. Then once we computed the binary state of

Â H0, we multiple that by W. Put that through the logistic sigmoid and

Â sample and that will give use a binary state for V1 and so on for all the way

Â up. Suggestive generating from this model is

Â equivalent to running the alternating mark off chain on a restricted Boltzmann

Â machine to equilibrium. Performing inference in this model is

Â exactly the same process in the opposite direction.

Â This is a very special kind of sigmoid belief net in which inference is as easy

Â as generation. So here I've shown the generative weights

Â that define the model, and also their transposes, that are the way we do

Â inference. And now I what want to show is how we get

Â the Bolton Machine Learning Algorithm out of the learning algorithm for directed

Â Sigmoid belief nets. So the learning rule for Sigmoid belief

Â net says that we should first get a sample from the posterior, that what the

Â Sj and Si are, samples from the posterior distribution.

Â And then we should change a weight, the generative weight in proportion to the

Â product of the pre activity as J and the difference between the [INAUDIBLE]

Â activity as i and the probability of turning on i given all the binary states

Â of the ladder Sj is in. Now if we ask how do we compute Pi,

Â something very interesting happens. If you look at inference in this network

Â on the right, we first infer a binary state for H0.

Â Once we've chosen that binary state, we then infer a binary state for V1 by

Â multiplying H0 by W, putting the result through the logistic, and then sampling.

Â So if you think about how Si1 was generated?

Â It was a sample from what we get if we put H0 through the weight matrix W and

Â then through the logistic. And that's exactly what we'd have to, to

Â in order to compute PiO. We'd have to take the binary activities

Â in H0 and going downwards now through the green weights, W, we will compute the

Â probability of turning on unit I given the binary states of its parents.

Â So the point is, the process that goes from H0 to V1 is identical to the process

Â that goes from H0 to V0. And so SI1 is an unbiased sample of PI0.

Â That means we can replace it in the learning rule.

Â So we end up with a learning rule that looks like this, because since we have

Â replicated weights, each of those lines is the term in the learning rule that

Â comes from one of those green weight matrices.

Â For the first green weight matrix here. The learning rule is the presynaptic

Â state Sj0 times the difference between the post synaptic state Si0 and the

Â probability that the binary states in H0 would turn on Si.

Â Which we could call PI0 but a sample with that probability is Si1.

Â And so an unbiased estimate of the relative, can be got by plugging in Si1

Â on that first line of the learning rule. Similarly for the second weight matrix,

Â the learning rule is SI1 into SJ0 minus PJ0 and an unbiased estimate of PJ0 is

Â SJ1. And so that's an unbiased testament of

Â the learning rule, for this second weight matrix.

Â And if you just keep going for all wave-matrices you get this infinite

Â series. And all the terms except the very first

Â term and the very last term cancel out. And so you end up with the Boltzmann

Â machine learning rule. Which is just SJ-zero into Si-zero, minus

Â SI-infinity into SI-infinity. So let's go back and look at how we would

Â learn an infinitely deep sigmoid belief net.

Â We would start by making all the weight matrices the same.

Â So we tie all the weight matrices together.

Â And we learn using those tied weights. Now that's exactly equivalent to learning

Â a restricted Boltzmann machine. The diagram on the right and the diagram

Â on the left are identical. We can think of the symmetric arrow in

Â the diagram on the left, as just a convenient shorthand for an infinite

Â directed net with tied weights. So we first learn that restricted

Â Boltzmann machine. Now we ought to learn it using maximum

Â likelihood learning, but actually we're just going to use contrasted divergence

Â learning. We're going to take a shortcut.

Â Once we've learned the first restricted Boltzmann machine, what we could do is we

Â could freeze the bottom level weights. We'll freeze the generative weights that

Â define the model. We'll also freeze the weights we're going

Â to use for inference to be the transpose of those generative weights.

Â So we freeze those weights. We keep all the other weights tied

Â together. But now we're going to allow them to be

Â different from the weights in the bottom layer but they're still all tied

Â together. So learning the remaining weights tied

Â together is exactly equivalent to learning another restrictive Boltzmann

Â machine. Namely a restricted Boltzmann machine

Â with H0 as its visible units, V1 as its hidden units.

Â And where the data is the aggregated posterior across H0.

Â That is, if we want to sample a data vector to train this network, what we do

Â is we put in a real data vector, V nought, we do inference through those

Â frozen waits, and we get a binary vector at H nought, and we treat that as data

Â for training the next restricted Boltzmann machine.

Â And we can go up for as many layers as we like.

Â And when we get fed up, we just end up with the restrictive Boltzmann machine at

Â the top which is equivalent to saying, all the weights in the infinite directed

Â net above there are still tied together, but the weights below have now all become

Â different. Now an explanation of why the inference

Â procedure was correct, involved the idea of a complementary prior created by the

Â weights in the layers above but of course, when we change the weights in the

Â layers above, but leave the bottom layer of weights fixed, the prior created by

Â those changed weights is no longer exactly complementary.

Â So now our inference procedure, using the frozen weights in the bottom layer, is no

Â longer exactly correct. The good news is, it's nearly always very

Â close to correct and with the incorrect inference procedure, we still get a

Â variational bound on the low probability of the data.

Â The higher layers have changed because they've learned a prior for the bottom

Â hidden layer that's closer to the aggregated posterior distribution.

Â And that makes the model better. So changing the hidden weights makes the

Â inference that we're doing at the bottom hidden layer incorrect, but gives us a

Â better model. And if you look at those two effects,

Â we prove that the improvement that you get in the variation bound from having a

Â better model is always greater than the loss that you get from the inference

Â being slightly incorrect. So in this variation bound you win when

Â you learn the lights in hire less, assuming that you do it with correct

Â maximizer [INAUDIBLE]. So now let's go back to what's happening

Â in contrasted divergence learning. We have the infinite net on the right and

Â we have a restricted Boltzmann machine on the left.

Â And they're equivalent. If we were to do maximum likelihood

Â learning for the restricted Boltzmann machine, it would be maximum likelihood

Â learning for the infinite sigmoid belief net.

Â But what we're going to do is we're going to cut things off.

Â We're going to ignore the small derivitives for the weights you get in

Â the higher layers of the infinite sigmoid belief net.

Â So, we cut it off were that dotted red line is.

Â And now if we look at the derivatives, the derivatives we're going to get look

Â like this. They've got two terms.

Â The first term comes from that bottom layer of nets.

Â We've seen that before, the router for the bottom layer of weights is just that

Â first line here. The second line comes from the next layer

Â of lights. That's this line here.

Â We need to compute the activities in H1, in order to compute the Sj1 in that

Â second line but we're not actually computing derivatives for the third layer

Â of weights. And when we take those first two terms,

Â and we combine them. We get exactly the learning rule for one

Â step contrasted divergence. So what's going on in contrasted

Â divergence, is we're combining weight derivatives for the lower layers, and

Â ignoring the weight derivatives in higher layers.

Â The question is, why can we get away with ignoring those higher derivatives?

Â When the weights are small, the Markov chain mixes very fast.

Â If the weights are zero, it mixes in one step.

Â And if the Markoff chain mixes fast, the higher layers will be close to the

Â equilibrium distribution, i.e. They will have forgotten what the input

Â was at the bottom layer. And now we have a nice property.

Â If the higher layers are sampled from the equilibrium distribution, we know that

Â the derivatives of the log probability, the data with respect to the weights,

Â must average out to zero. And that's because the current weights in

Â the model are a perfect model of the equilibrium distribution.

Â The equilibrium distribution is generated using those weights.

Â And if you want to generate samples from the equilibrium distribution, those are

Â the best possible weights you could have. So we know the root is there is zero.

Â As the weights get larger, we might have to run more iterations of Contrastive

Â Divergence. Which corresponds to taking into account

Â more layers of that infinite sigmoid belief net.

Â That will allow Contrasive Divergence to continue to be a good approximation to

Â maximum likelihood and so if we're trying to learn a density model, that makes a

Â lot of sense. As the weights grow, you run CD for more

Â and more steps. If there's a statistician around, you

Â give him a guarantee, then in the infinite limit, you'll run CD for

Â infinite many steps. And then you have an asymptotic

Â convergence result, which is the thing that keeps statisticians happy.

Â Of course it's completely irrelevant because you'll never reach a point like

Â that. There is however an interesting point

Â here. If our purpose in using CD is to build a

Â stack of restricted Boltzmann machines, that learn multiple layers of features,

Â it turns out that we don't need a good approximation to maximum likelihood.

Â For learning multiple layers of features, CD1 is just fine.

Â In fact it's probably better than doing maximum l likelihood.

Â