0:00

In this video, I'll describe the first way we discovered for getting Sigmoid Belief

Â Nets to learn efficiently. It's called the wake-sleep algorithm and

Â it should not be confused with Boltzmann machines.

Â They have two phases, a positive and a negative phase, that could plausibly be

Â related to wake and sleep. But the wake-sleep algorithm is a very

Â different kind of learning, mainly because it's for directed graphical models like

Â Sigmoid Belief Nets, rather than for undirected graphical models

Â like Boltzmann machines.

Â The ideas behind the wake-sleep algorithm

Â led to a whole new area of machine learning called variational learning,

Â which didn't take off until the late 1990s,

Â despite early examples like the wake-sleep algorithm, and is now one of the main ways

Â of learning complicated graphical models in machine learning.

Â The basic idea behind these variational methods sounds crazy.

Â The idea is that since it's hard to compute the correct posterior distribution,

Â we'll compute some cheap approximation to it.

Â And then, we'll do maximum likelihood learning anyway.

Â That is, we'll apply the learning rule that would be correct,

Â if we'd got a sample from the true posterior,

Â and hope that it works, even though we haven't.

Â Now, you could reasonably expect this to be a disaster,

Â but actually the learning comes to your rescue.

Â So, if you look at what's driving the weights during the learning,

Â when you use an approximate posterior,

Â there are actually two terms driving the weights.

Â One term is driving them to get a better model of the data.

Â That is, to make the Sigmoid Belief Net more likely to generate the observed data

Â in the training center. But there's another term that's added to that,

Â that's actually driving the weights

Â towards sets of weights for which the approximate posterior it's using is a good fit

Â to the real posterior. It does this by manipulating the real

Â posterior to try to make it fit the approximate posterior.

Â It's because of this effect, the variational learning of these models works

Â quite nicely. Back in the mid 90s,' when we first came

Â up with it, we thought this was an interesting new theory of how the brain

Â might learn. That idea has been taken up since by Karl

Â Friston, who strongly believes this is what's going on in real neural learning.

Â So, we're now going to look in more detail at how we can use an approximation to the

Â posterior distribution for learning. To summarize, it's hard to learn

Â complicated models like Sigmoid Belief Nets because it's hard to get samples from

Â the true posterior distribution over hidden configurations, when given a data vector.

Â And it's hard even to get a sample from that posterior.

Â That is, it's hard to get an unbiased sample.

Â So, the crazy idea is that we're going to

Â use samples from some other distribution and hope that the learning will still work.

Â And as we'll see, that turns out to be true for Sigmoid Belief Nets.

Â So, the distribution that we're going to

Â use is a distribution that ignores explaining away.

Â We're going to assume (wrongly) that the posterior over hidden configurations

Â factorizes into a product of distributions for each separate hidden unit.

Â In other words, we're going to assume that

Â given the data, the units in each hidden layer are independent of one another,

Â as they are in a Restricted Boltzmann machine.

Â But in a Restricted Boltzmann machine, this is correct.

Â Whereas, in a Sigmoid Belief Net, it's wrong.

Â So, let's quickly look at what a factorial distribution is.

Â In a factorial distribution, the probability of a whole vector is just the

Â product of the probabilities of its individual terms.

Â So, suppose we have three hidden units in the layer and they have probabilities of

Â being wrong of 0.3, 0.6, and 0.8. If we want to compute the probability of the

Â hidden layer having the state (1, 0, 1), We compute that by multiplying 0.3

Â by (1 - 0.6), by 0.8. So, the probability of a configuration of

Â the hidden layer is just the product of the individual probabilities.

Â That's why it's called factorial. In general, the distribution of binary

Â vectors of length n will have two to the n degrees of freedom.

Â Actually, it's only two to the n minus one because the probabilities must add to one.

Â A factorial distribution, by contrast, only has n degrees of freedom.

Â It's a much simpler beast. So now, I'm going to describe the

Â wake-sleep algorithm that makes use of the idea of using the wrong distribution.

Â And in this algorithm, we have a neural net that has two different sets of weights.

Â 8:06

It turns out that if you start with random weights and you alternate between wake

Â phases and sleep phases it learns a pretty good model.

Â There are flaws in this algorithm. The first flaw is a rather minor one

Â which is, the recognition weights are learning to invert the generative model.

Â But at the beginning of learning, they're learning to invert the generative model in

Â parts of the space where there isn't any data.

Â Because when you generate from the model, you're generating stuff that looks very

Â different from the real data, because the weights aren't any good.

Â That's a waste, but it's not a big problem.

Â The serious problem with this algorithm is that the recognition weights not only

Â don't follow the gradient of the log probability of the data,

Â They don't even follow the gradient of the variational bound on this probability.

Â And because they're not following the right gradient, we get incorrect mode averaging,

Â which I'll explain in the next slide.

Â A final problem is that we know that the

Â true posterior over the top hidden layer is bound to be far from independent

Â because of explaining away effects. And yet, we're forced to approximate it

Â with a distribution that assumes independence.

Â This independence approximation might not be so bad for intermediate hidden layers,

Â because if we're lucky, the explaining away effects that come from below will be

Â partially canceled out by prior effects that come from above.

Â You'll see that in much more detail later. Despite all these problems, Karl Friston

Â thinks this is how the brain works. When we initially came up with the

Â algorithm, we thought it was an interesting new theory of the brain.

Â I currently believe that it's got too many problems to be how the brain works and

Â that we'll find better algorithms.

Â 10:47

When that unit turns on, there's a probability of a half

Â that the visible unit will turn on.

Â So, if you think about the occassions on which the visible unit turns on,

Â half those occassions have the left-hand hidden unit on,

Â the other half of those occassions have the right-hand hidden unit on

Â and almost none of those occassions have neither or both units on.

Â So now think what the learning would do for the recognition weights.

Â Half the time we'll have a 1 on the visible layer,

Â the leftmost unit will be on at the top,

Â so we'll actually learn to predict that that's on with a probability of 0.5, and the same for the right unit.

Â So the recognition units will learn to produce a factorial distribution

Â over the hidden layer, of (0.5, 0.5)

Â and that factorial distribution puts a quarter of its mass on the configuration (1,1)

Â and another quarter of its mass on the configuration (0,0)

Â and both of those are extremely unlikely configurations given that the visible unit was on.

Â