0:00

In this video, I'll talk about a different way of learning sigmoid belief

Â notes. This different method arrived in an

Â unexpected way. I stopped working on sigmoid belief nets

Â and went back to Boltzmann machines. And discovered that restricted Boltz

Â machines could actually be learned fairly efficiently.

Â Given that a restricted Boltzmann machine could efficiently learn a layer of

Â nonlinear features. It was tempting to take those features, treat them as data,

Â and apply another restricted Boltzmann machine to model the correlations between

Â those features. And one can continue like this, stacking

Â one Boltzmann machine on top of the next one to learn lots of layers of nonlinear

Â features. This eventually led to a big resurgence

Â of interest in deep neural nets. The issue then arose. Once you stacked up

Â lots of restricted Boltzmann machines, each which is learned by modeling the

Â patterns of future activities produced by the previous Boltzmann machines.

Â Do you just have a set of separate restricted Boltzmann machines or can they

Â all be combined together into one model? Now, anybody sensible would expect that

Â if you combined a set of restricted Boltzmann machines together to make one

Â model, what you'd get would be a multilayer Boltzmann machine.

Â However, a brilliant graduate student of mine called G.Y.

Â Tay, figured out that that's not what you get.

Â You actually get something that looks much more like a sigmoid belief net.

Â This was a big surprise. It was very surprising to me that we'd

Â actually solved the problem of how to learn deep sigmoid belief nets by giving

Â up on it and focusing on learning undirected models like Boltzmann

Â machines. Using the efficient learning algorithm

Â for restricted Boltzmann machines. It's easy to train a layer of features

Â that receive input directly from the pixels.

Â We can treat the patterns of activation of those feature detectors as if they

Â were pixels, and learn another layer of features in a

Â second hidden layer. We can repeat this as many times as we

Â like with each new layer of features modelling the correlated activity in the

Â features in the layer below. It can be proved that each time we add

Â another layer of features, we improve a variational lower bound on the log

Â probability that some combined model would generate the data.

Â The proof is actually complicated, and it only applies if you do everything just

Â right, which you don't do in practice.

Â But, the proof is very reassuring, because it suggests that something

Â sensible is going on when you stack up restricted Boltzmann machines like this.

Â The proof is based on a neat equivalence between a restricted bolson machine and

Â an infinitely deep belief net. So here's a picture of what happens when

Â you learn two restricted Boltzmann machines, one on top of the other,

Â and then you combine them to make one overall model, which I call a deep belief

Â net. So first we learn one Boltzmann machine

Â with its own weights. Once that's been trained, we take the

Â hidden activity patterns of that Boltzmann machine when it's looking at

Â data and we treat each hidden activity pattern as data for training a second

Â Boltzmann machine. So we just copy the binary states to the

Â second Boltzmann machine, and then we learn another Boltzmann machine.

Â Now one interesting thing about this, is that if we start the second Boltzmann

Â machine off with W2 being the transpose of W1, and with as many hidden units in

Â h2 as there are in v, then the second Boltzmann machine will already be a

Â pretty good model of h1, because it's just the first model upside

Â down. And for a restricted Boltzmann machine,

Â it doesn't really care which you call visible and which you call hidden.

Â It's just a bipartite graph that's learned to model.

Â After we've learned those two Boltzmann machines, we're going to compose them

Â together to form a single model and the single model looks like this.

Â Its top two layers adjust the same as the top restricted Boltzmann machine.

Â So that's an undirected model with symmetric connections, but its bottom two

Â layers are a directed model like a sigmoid belief net.

Â So what we've done is we've taken the symmetric connections between v and h1

Â and we've thrown away the upgoing part of those and just kept the dangering part.

Â To understand why we've done that is quite complicated and that will be

Â explained in video 13F. The resulting combined model is clearly

Â not a Boltzmann machine, because its bottom layer of connections are not

Â symmetric. It's a graphical model that we call a

Â deep belief net, where the lower layers are just like sigmoid belief nets and the

Â top two layers form a restricted Boltzmann machine.

Â So it's a kind of hybrid model. If we do it with three Boltzmann machines

Â stacked up, we'll get a hybrid model that looks like this.

Â The top two layers again are a restricted Boltzmann machine and the layers below

Â are directed layers like in a sigmoid belief net.

Â 5:29

To generate data from this model the correct procedure is,

Â first of all, you go backwards and forwards between h2 and h3 to reach

Â equilibrium in that top level restricted Boltamann machine.

Â This involves alternating Gibbs sampling, where you update all of the units in h3

Â in parallel, and update all of the units in h2 in parallel,

Â then go back and update all of the units in h3 in parallel. And you go backwards

Â and forwards like that for a long time until you've got an equilibrium sample

Â from the top-level restricted Boltamann machine.

Â So the top-level restricted Bolson machine is defining the prior

Â distribution of h2. Once you've done that, you simply go once

Â from h2 to h1 using the generative connections w2.

Â And then, whatever binary patent you get in h1, you go once more to get generated

Â data, using the weights w1. So we're performing a top-down pass from

Â h2, to get the states of all the other layers,

Â just like in a sigmoid belief net. The bottom-up connections, shown in red

Â at the lower levels, are not part of the generative model.

Â They're actually going to be the transposes of the corresponding weights.

Â So they're the transpose of w1 and the transpose of w2,

Â and they're going to be used for influence, but they're not part of the

Â model. Now, before I explain why stacking up

Â Boltzmann machines is a good idea, I need to sort out what it means to average two

Â factorial distributions. And it may surprise you to know that if I

Â average two factorial distributions, I do not get a factorial distribution.

Â What I mean by averaging here is taking a mixture of the distributions, so you

Â first pick one of the two at random, and then you generate from whichever one you

Â picked. So, you don't get a factorial

Â distribution. Suppose we have an RBM with 4 hidden

Â units and suppose we give it a visible vector.

Â And given this visible vector, the posterior distribution over those 4

Â hidden units is factorial. And lets suppose the distribution was

Â that the first and second units have a probability of 0.9 of turning on and the

Â last two have a probability of 0.1 of turning on.

Â What it means for this to be factorial is that, for example, the probability that

Â the first two units were both be on in a sample from this distribution, is exactly

Â 0.81. Now suppose we have a different angle

Â vector v2, and the posterior distribution over the same 4 hidden units is now 0.1,

Â 0.1, 0.9, 0.9, which I chose just to make the math easy.

Â If we average those two distributions, the mean probability of each hidden unit

Â being on, is indeed, the average of the means for each distribution.

Â So the means are 0.5, 0.5, 0.5, 0.5, but what you get is not a factorial

Â distribution defined by those 4 probabilities.

Â To see that, consider the binary vector 1, 1, 0, 0 over the hidden units.

Â In the posterior for v1, that has a probability of 0.9^4, because

Â it's 0.9 * 0.9 * 1 - 0.1 * 1 - 0.1. So that's 0.43.

Â In the posterior for v2, this vector is extremely unlikely.

Â It has a probability of 1 in 10,000. If we average those two probabilities for

Â that particular vector, we'll get a probability of 0.215,

Â and that's much bigger than the probability assigned to the vector 1, 1,

Â 0, 0 by factorial distribution with means of 0.5.

Â That probability will be 0.5^4, which is much smaller.

Â So, the point of all this, is that when you average two factorial posteriors, you

Â get a mixture distribution that's not factorial.

Â Now, let's look at why the greedy learning works.

Â That is why it's a good idea to learn one restricted Boltzmann machine.

Â And then learn a second restricted Boltzmann machine that models the

Â patterns of activity in the hidden units of the first one.

Â The weights of the bottom level restricted Boltzmann machine, actually

Â define four different distributions. Of course, they define them in a

Â consistent way. So the first distribution is the

Â probability of the visible units given the hidden units.

Â And the second one is the probability of the hidden units given the visible units.

Â And those are the two distributions we use for running our alternating mark of

Â chain that updates the visibles given the hiddens and then updates the hiddens

Â given the visibles. If we run that chain long enough, we'll

Â get a sample from the joint distribution of v and h.

Â And so the weights clearly also define the joint distribution.

Â They also define the joint distribution more directly in terms of E to the minus

Â the energy, but for nets with a large number of

Â units, we can't compute that. If you take the joint distribution,

Â p(v|h), and you just ignore v, we now a distribution for h.

Â That's the prior distribution over h, defined by this restricted Boltzmann

Â machine. And similarly, if we ignore h, we have

Â the prior distribution over v, defined by the restricted Boltzmann machine.

Â And now, we're going to pick a rather surprising pair of distributions from

Â those four distributions. We're going to define the probability

Â that the restricted Boltzmann machine assigns to a visible vector v as the sum

Â over all hidden vectors of the probability it assigns to h times the

Â probability of v given h. This seems like a silly thing to do,

Â because defining p(h) is just as hard as defining p(v).

Â And nevertheless, we're going to define p(v) that way.

Â Now, if we now leave p(v|h) alone, but learn a better model of p(h),

Â that is, learn some new parameters that give us a better model of p(h) and

Â substitute that in instead of the old model we had of p(h).

Â We'll actually improve our model of v. And what we mean by a better model of

Â p(h) is a prior over h that fits the aggregated posterior better.

Â The aggregated posterior is the average over all vectors in the training set of

Â the posterior distribution over h. So, what we're going to do, is use our

Â first RBM to get this aggregated posterior and then use our second RBM to

Â build a better model of this aggregated posterior than the first RBM has.

Â And if we start the second RBM off as the first one upside down, it will start with

Â the same model of the aggregated posterior as the first RBM has.

Â And then, if we change the weights we can only make things better.

Â So, that's an explanation of what's happening when we stack up RBMs.

Â Once we've learned to stack up Boltzmann machines, then combine them together to

Â make a deep belief net, we can then actually fine-tune the whole

Â composite model using a variation of the wake-sleep algorithm.

Â So we first learn many layers of features by stacking up IBMs.

Â And then we want to fine-tune both the bottom-up recognition weights and the

Â top-down generative weights to get a better generative model and we can do

Â this by using three different learning routes.

Â First, we do a stochastic bottom-up pass, and we adjust the top down generative

Â weights of the lower layers to be good at reconstructing the feature activities in

Â the layer below. That's just as in the standard wake-sleep

Â algorithm Then, in the top level RBM, we go backwards and forwards a few times,

Â sampling the hiddens of that RBM, and the visibles of that RBM, and the hiddens of

Â the RBM, and so on. So that's just like the learning

Â algorithm for RBMs. And having done a few iterations of that,

Â we do contrastive divergence learning. That is, we update the weights of the RBM

Â using the difference between the correlations when activity first got to

Â that RBM and the correlations after a few iterations in that RBM.

Â We take that difference and use it to update the weights.

Â And then, the third stage, we take the visible units of that top-level RBM by

Â its lower level units. And starting there, we do a top-down

Â stochastic pass, using the directed lower connections, which are just a sigmoid

Â belief net. Then, having generated some data from

Â that sigmoid belief net, we adjust the bottom up rates to be good at

Â reconstructing the feature activities in the layer above.

Â So that's just the sleep phase of the wake-sleep algorithm.

Â The difference from the standard wake-sleep algorithm is that that

Â top-level RBM acts as a much better prior over the top layers, than just a layer of

Â units which are assumed to be independent, which is what you get with a

Â sigmoid belief net. Also, rather than generating data by

Â sampling from the prior, what we're actually doing is looking at a training

Â case, going up to the top-level RBM and just running a few iterations before we

Â generate data. So now we're going to look at an example

Â where we first learn some RBMs, stacking them up,

Â and then we do contrastive wake-sleep to fine-tune it,

Â and then we look to see what it's like. Is it a generative model?

Â And also if we're recognizing things. So first of all, we're going to use 500

Â binary hidden units to learn to model all 10 digit classes in images of 28 by 28

Â pixels. Once we've learned that RBM, without

Â knowing what the labels are, so it's unsupervised learning.

Â We're going to take the patterns of activity in those 500 hidden units that

Â they have when they're looking at data. We're going to treat those patterns of

Â activity as data and we're going to learn another RBM that also has 500 units,

Â and those two are learned without knowing what the labels are.

Â Once we've done that we'll actually tell it the labels.

Â So the first two hidden layers are learned without labels,

Â and then, we add a big top layer and we give it the 10 labels.

Â And you can think that we concatenate those 10 labels with the 500 units that

Â represent features, except that the 10 labels are really one

Â soft match unit. Then we train that top-level RBM to model

Â the concatenation of the soft match unit for the 10 labels with the 500 feature

Â activities that were produced by the two layers below.

Â Once we've trained the top-level RBM, we can then fine-tune the whole system by

Â using contrastive wake-sleep. And then we'll have a very good

Â generative model and that's the model that I showed you in the intro video.

Â So if you go back, and you find the introduction video for this course,

Â you'll see what happens when we run that model.

Â You'll see how good it is at recognition and you'll also see that it's very good

Â at generation. In that introductory video, I promised

Â you, you would eventually explain how it worked,

Â and I think you've now seen enough to know what's going on when this model is

Â learned.

Â