0:00

In this video, I will introduce Restricted Boltzmann Machines.

Â These have a much simplified architecture in which there are no connections between

Â hidden units. This makes it very easy to get the

Â equilibrium distribution of the hidden units if the visible units are given.

Â That is, once you've clamped the datavector on the visible units,

Â The equilibrium distribution of the hidden units can be computed exactly in one step

Â because they're all independent of one another, given the states of the visible

Â units. The proper Boltzmann machine learning

Â algorithm is still slow for a restricted Boltzmann machine.

Â But in 1998, I discovered a very surprising shortcut that leads to the

Â first efficient learning algorithm for Boltzmann machines.

Â Even though this algorithm has theoretical problems, it works quite well in practice.

Â And it led to a revival of interest in Boltzmann machine learning.

Â In a restricted Boltzmann machine, we restrict the connectivity of the network

Â in order to make both inference and learning easier.

Â So, it only has one layer of hidden units and there's no connections between the

Â hidden units. There's also no connections between the

Â visible units. So, the architecture looks like that, it's

Â what computer scientists call a bipartite graph.

Â There's two pieces, and within each piece, there's no connections.

Â 1:37

That means with a datavector clamped, we can quickly compute the expected value of

Â vihj because we can compute the exact probability with each j will turn on, and

Â that is independent of all the other units in the hidden layer.

Â 1:55

The probability that j will turn on is just the logistic function of the input

Â that it gets from the visible units and quite independent of what other hidden

Â units are doing. So, we can compute that probability all in

Â parallel and that's a tremendous win. If you want to make a good model of a set

Â of binary vectors, then the right algorithm to use for a restricted

Â Boltzmann machine is one introduced by Tieleman in 2008 that's based on earlier

Â work by Neal. In the positive phase, you clamp the

Â datavector on the visible units. You then compute the exact value of the

Â expectation vihj for all pairs of invisible in the hidden unit.

Â And you could do that cuz vi is fixed, and you can compute vj exactly.

Â 2:58

For the negative phase, you keep a set of fantasy particles that is global

Â configurations. And then, you update each fantasy particle

Â a few times by using alternating parallel updates.

Â So, after each weight update, you update the fantasy particles a little bit and

Â that should bring them back to close to equilibrium.

Â 3:21

And then, for every connected pair of units, you average vihj over all the

Â fantasy particles, and that gives you your negative statistics.

Â This algorithm actually works very well, and allows RBMs to build good density

Â models or sets of binary vectors. Now, I am going to go on to our learning

Â algorithm that is not as good at building density model but is much faster.

Â So, I'm going to start with a picture of an inefficient learning algorithm for

Â restrictive Boltzmann machines. We're going to start by clamping a

Â datavector on the visible units, and we're going to call that time t0.

Â So, we're going to use times now, not to denote weight updates, but to denote steps

Â in a Markov chain. Given that visible vector, we now update

Â the hidden units. So, we choose binary states for the hidden

Â units and we measure the expected value, vihj, for all pairs of visible and binary

Â units that are connected. And I'll call that vihj zero to indicate

Â that it's measured at time zero, With the hidden units being determined by

Â the visible units. And, of course, we can update all the

Â hidden units in parallel. We then use the hidden vector to update

Â all the visible units in parallel, and again we update all the hidden units in

Â parallel. So, the visible vector t1 = one, we'll

Â call a reconstruction, or a one-step reconstruction,

Â And we can keep going with the alternating chain that way,

Â Updating visible units, and then hidden units,

Â Each set being updated in parallel. And after we've gone for a long time,

Â 5:10

We'll get to some state of the visible units, or I'll call t infinity to indicate

Â it needs to be a long time and the system will be at thermal equilibrium, and now,

Â we can measure the correlation of vi and hj after the chains run for a long time

Â and I'll call that vihj infinity. And the visible state we have after a long

Â time, I'll call it fantasy. So now, the learning rule is simply, we

Â change Wij by the learning rate times the difference between vihj at time zero and

Â vihj at time infinity. And, of course, the problem with this

Â algorithm is that we have to run this chain for a long time before it reaches

Â thermal equilibrium. And if we don't run it for long enough,

Â the learning may go wrong. In fact, that last statement is very

Â misleading. It turns out that even if we only run the

Â chain for a short time, the learning still works.

Â 6:13

So, here's the very surprising shortcut. You just run the chain up, down, and up

Â again. So, from the data, you generate a hidden

Â state, from that. You generate a reconstruction, and from

Â that, you generate another hidden state. And you may have a statistics once you've

Â done that. So, instead of using the statistics

Â measured at equilibrium, we're using the statistics measured after doing one full

Â update of the Markov chain. The learning rule is, and the same as

Â before, except this much quicker to compute, and this is clearly is not doing

Â maximum likelihood learning because the term we are using for negative statistics

Â 7:00

is wrong. But the learning, nevertheless, works

Â quite well. Next week, we'll understand a bit more

Â about why it works well. But for now, we'll just see that it does.

Â 7:14

So, the obvious question is why does actual cut work at all?

Â And here's the reasoning. If we start the chain at the data, the

Â Markov chain will wander away from the data and towards its equilibrium

Â distribution. That is towards things that is initial

Â weights like more than the data. We can see what direction it's wandering

Â in after only a few steps. And if we know the initial weights aren't

Â very good, it's a waste of time to go all the way to equilibrium.

Â We know how to change them to stop it wandering away from the data without going

Â all the way to equilibrium. All we need to do is lower the probability

Â of the reconstructions of confabulations as a psychologist would call them, it

Â produces after one full step, and then, raise the probability of the data.

Â 8:31

Here's a data point on the energy surface, and by data point, I mean, both the

Â visible vector and the particular hidden vector that we got by stochastic updating

Â the hidden units. So, that hidden vector is a function of

Â what the data point is. So, starting at that data point, we run

Â the Markov chain for one full step to get a new visible vector and the hidden vector

Â that goes with it. So, a reconstruction of the data point

Â plus the hidden vector that goes with that reconstruction.

Â 9:05

We then change the weights to pull the energy down at the data point, and pull to

Â the energy up the reconstruction. And the effect of that would be to make

Â the surface look like this. And you'll notice we're beginning to

Â construct an energy minimum at the data. You'll also notice that far away from the

Â data, things have stayed pretty much as they were before.

Â 9:51

These low energy holes cause the normalization term to be big, and we can't

Â sense them if we use the shortcut. If we use persistent particles, where we

Â remembered their states, and after each update, we updated them a few more times,

Â then they would eventually find these holes.

Â They'd move into the holes, and the learning would cause the holes to fill up.

Â 10:17

A good compromise between speed and correctness is to start with small weights

Â and to use CD1, that is contrust divergence with one full step to get the

Â negative data. Once the weights have grown a bit, the

Â Markov chain is mixing more slowly, and now we can use CD3.

Â