0:00

In the previous video, I showed how a Boltzmann machine can be

Â used a probabilistic model of a set of binary data vectors.

Â In this video we're finally going to get around to the Boltzmann machine learning

Â algorithm. This is a very simple learning model which

Â has an elegant theoretical justification, but it turned out in practice, it was

Â extremely slow and noisy, and just wasn't practical.

Â And for many years, people thought that Boltzmann machines would never be

Â practical devices. Then we found several different ways of

Â greatly speeding up the learning algorithm.

Â And now the algorithm is much more practical, and has, in fact, been used as

Â part of the winning entry for a million dollar machine learning competition, which

Â I'll talk about in a later video. The Bolton machine learning algorithm is

Â an unsupervised learning algorithm. Unlike the typical user back propagation,

Â where we have a input vector and we provide it with a desired output. In

Â Boltzmann machine learning we just give it the input vector.

Â There are q labels. What the algorithm is trying to do is build a model of a set of

Â input vectors, though it might be better to think of them as output vectors.

Â 1:15

What we want to do is maximize the product of the probabilities, that the Boltzmann

Â machine assigns to a set of binary vectors,

Â The ones in the training set. This is equivalent to maximizing the sum

Â of the log probabilities that the Boltzmann machine assigns to the training

Â vectors. It's also equivalent to maximizing the

Â probability that we'd obtain exactly the end training cases, if we ran the

Â Boltzmann machine in the following way. First, we let it settle to its stationary

Â distribution, and different times, with no external input.

Â 2:07

Now the main reasons why the learning could be difficult.

Â This is probably the most important reason.

Â If you consider a chain of units, A chain of hidden units here, with visible

Â units attached to the two ends, And if we use a training set that consist

Â of one, zero and zero, one. In other words, we want the two visible

Â units to be in opposite states. Then the way to achieve that is by making

Â sure that the product of all those weights is negative.

Â So, for example, if all of the weights are positive, turning on W1 will tend to turn

Â on the first hidden unit. And that will tend to turn on the second

Â hidden unit, and so on. And the fourth hidden unit will tend to

Â turn on the other visible unit. If one of those weights is negative, then

Â we'll get an anti-correlation between the two visible units.

Â 3:01

What this means is, that if we're thinking about learning weight W1, we need to know

Â other weights. So there's W1.

Â To know how to change that weight, we need to know W3.

Â We need to have information about W3, because if W3 is negative what we want to

Â do with W1 is the opposite of what we want to do with W1 if W3 is positive.

Â 3:28

So given that one weight needs to know about other weights in order to be able to

Â change even in the right direction, it's very surprising that there's a very simple

Â learning algorithm, and that the learning algorithm only requires local information.

Â 3:44

So it turns out that everything that one weight needs to know about all the other

Â weights and about the data is contained in the difference of two correlations.

Â Another way of saying that is that if you take the log probability that the

Â Boltzmann machine assigns to a visible vector V.

Â And ask about the derivative of that log probability with respect to a weight, WIJ.

Â It's the difference of the expected value of the products of the states of I and J.

Â When the networks settle to thermal equilibrium with v clamped on the visible

Â units. That is how often are INJ on together when

Â V is clamped in visible units and the network is at thermal equilibrium, minus

Â the same quantity. But when V is not clamped on visible

Â units, so because the derivative of the log probability of a visible vector is

Â this simple difference of correlations we can make the change in the weight be

Â proportional to the expected product of the activities average over all visible

Â vectors in the training set, that's what we call data.

Â Minus the product of the same two activities when your not clamping anything

Â and the network has reached thermal equilibrium with no external interference.

Â 5:11

So this is a very interesting learning rule.

Â The first term in the learning rule says raise the weights in proportion to the

Â product of the activities the units have when you're presenting data.

Â That's the simplest form of what's known as a Hebbian learning rule.

Â Donald Hebb, a long time ago, in the 1940s or 1950s, suggested that synapses in the

Â brain might use a rule like that. But if you just use that rule, the synapse

Â strengths will keep getting stronger. The weights will all become very positive,

Â and the whole system will blow up. You have to somehow keep things under

Â control, and this learning algorithm is keeping things under control by using that

Â second term. It's reducing the weights in proportion to

Â how often those two units are on together, when you're sampling from the model's

Â distribution. You can also think of this as the first

Â term is like the storage term for a Hopfield Net.

Â And the second term is like the term for getting rid of spurious minima.

Â And in fact this is the correct way to think about that.

Â This rule tells you exactly how much unlearning to do.

Â 6:26

Well, the probability of a global configuration at thermal equilibrium, that

Â is once you've let it settle down, is an exponential function of its energy.

Â The probability is related to E to the minus energy.

Â 6:41

So when we settle to equilibrium we achieve a linear relationship between the

Â log probability and the energy function. Now, the energy function is linear in the

Â weights. So, we have a linear relationship between

Â the weights and the log probability. And since we're trying to manipulate log

Â probabilities by manipulating weights, that's a good thing to have.

Â It's a log linear model. In fact, the relationship's very simple.

Â It's that the derivative of the energy with respect to a particular weight WIJ is

Â just the product of the two activities that, that weight connects.

Â 7:23

So what's happening here? Is the process of settling to thermal

Â equilibrium is propagating information about weights?

Â We don't need an explicit back propagation stage.

Â We do need two stages. We need to settle with the data.

Â And we need to settle with no data. But notice that the networks behaving in

Â pretty much the same way in those two phases.

Â The unit deep within the network is doing the same thing, just with different

Â boundary conditions. With back prop the forward pass and the

Â backward pass are really rather different. Another question you could ask is what's

Â that negative phase for. I've already said it's like the unlearning

Â we do in a Hopfield net to get rid of spurious minima.

Â But let's look at it in more detail. The equation for the probability of a

Â visible vector, is that it's a sum overall hidden vectors of E to the minus the

Â energy of that visible and hidden vector together.

Â Normalized by the same quantity, summed overall visible vectors.

Â So if you look at the top term, what the first term in the learning rule is doing

Â is decreasing the energy of terms in that sum that are already large and it finds

Â those terms by settling to thermal equilibrium with the vector V clamped so

Â that it can find an H that goes nicely with V, that is gives a nice low energy

Â with V. Having sampled those vectors H, it then

Â changes the weights to make that energy even lower.

Â 9:03

The second phase in the learning, the negative phase, is doing the same thing,

Â but for the partition function. That is, the normalizing term on the

Â bottom line. It's finding global configurations,

Â combinations of visible and hidden states that give low energy,

Â And therefore, are large contributors to the partition function.

Â And having find those global configurations, it tries to raise their

Â energy so that the can contribute less. So the first term is making the top line

Â big, and the second term is making the bottom line small.

Â 9:43

Now in order to run this learning rule, you need to collect those statistics.

Â You need to collect what we call the positive statistics, those are the ones

Â when you have data clamped on the visible units, and also the negative statistics,

Â those are the ones when you don't have data clamped and that you're going to use

Â for unlearning. An inefficient way to track these

Â statistics was suggested by me and Terry Sejnowski in 1983.

Â And the idea is, in the positive phase you clamp a data vector on the visible units,

Â you set the hidden units to random binary states,

Â 10:29

We actually did that by starting at a high temperature and reducing it, but that's

Â not the main point here. And then once you reach thermal

Â equilibrium, you sample how often two units are on together.

Â So you're measuring the correlation of INJ with that visible vector clamped.

Â You then repeat that, over all the visible vectors, so that, that correlation you're

Â sampling is averaged over all the data. Then in the negative phase, you don't

Â clamp anything. The network is free from external

Â interference. So, you set all of the units, both visible

Â and hidden, to random binary states. And then you update the units, one at a

Â time, until the network reaches thermal equilibrium, at a temperature of one.

Â Just like you did in the positive phase. And again, you sample the correlation of

Â every pair of units INJ, And you repeat that many times.

Â Now it's very difficult to know how many times you need to repeat it, but certainly

Â in the negative phase you expect the energy landscape to have many different

Â minima, but are fairly separated and have about the same energy.

Â The reason you expect that is we're going to be using Boltzmann machines to do

Â things like model a set of images. And you expect there to be reasonable

Â images, all of which have about the same energy.

Â And then very unreasonable images, which have much higher energy.

Â And so you expect a small fraction of the space to be these low energy states.

Â And a very large fraction of the space to be these bad high energy states.

Â If you have multiple modes, it's very unclear how many times you need to repeat

Â this process to be able to sample those modes.

Â