0:00

In this video, I'm going to explain how adding noise can help systems escape from

Â local minima. And, I'm going to show what you have to do

Â to the units in Hopfield net to add noise in the appropriate way.

Â I'm not going to introduce the idea that we confined better minima by using noise.

Â So, Hopfield net always makes decisions that reduce the energy, or if it doesn't

Â state of the unit, the energy stays the same.

Â This makes it impossible to climb out of a local minimum.

Â So, if you look at the landscape here. If we get into the local minimum A,

Â there's no way we're going to get over the energy barrier to get to the better

Â minimum B because we can't go uphill in energy.

Â If we add random noise, we can escape from poor minima, especially minima that is

Â shallow, that is, ones that don't have big energy barriers around them.

Â It turns out, rather than using a fixed noise level, the most effective strategy

Â is to start with a lot of noise which allows you to explore the space on a

Â coarse scale and find the generally good regions of the space, and then to decrease

Â the noise level. With a lot of noise, you can cross big

Â barriers. As you decrease the noise level, you start

Â concentrating on the best nearby minima. If you slowly reduce the noise, so the

Â system ends up in a deep minimum, that's called simulated annealing.

Â And this ideal was, propogated by Kirkpatrick at around the same time as

Â Hopfield nets were proposed. So, the reason for simulated annealing is

Â because the temperature, in a physical system, or in a simulated system with a

Â energy function, Affects the transition probabilities.

Â So, in a high temperature system, the probability of going uphill from B to A is

Â lower than the probability of going downhill from A to B.

Â But it's not much lower. In effect, the temperature flattens the

Â energy landscape, and so the little black dots are meant to be particles.

Â And what we are imagining is particles moving about according to the transition

Â probabilities that you get with an energy function and a temperature.

Â And this might be a typical distribution if you're on the system of high

Â temperature where it's easier to cross barriers, but it's also hard to stay in a

Â deep minimum once you've got that. If you are in the system of much lower

Â temperature, Then your probability of crossing barriers

Â gets much smaller but your ratio gets much better.

Â So, the ratio of the probability of going from A to B versus the probability of

Â going from B to A is much better in the low temperature system.

Â And so, if we run it long enough, we would expect all of the particles to end up in

Â B. But if we just run it for a long time at

Â low temperature, it will take a very long time for particles to escape from A.

Â And it turns out a good compromise is to start at a high temperature and gradually

Â reduce the temperature. The way we get noise in to Hopfield net is

Â to replace the binary threshold units by binary stochastic units and make biased

Â random decisions. And the amount of noise is controlled by

Â something called temperature, Which you'll see in a minute in the

Â equation. Raising the noise level is equivalent to

Â decreasing all the energy gaps between configurations.

Â 4:09

As we lower the temperature, Depending on the sign of delta E, the unit

Â will become either more and more firmly on and more and more firmly off.

Â At zero temperature, which is what we're be using in a Hopfield net,

Â Then the sign of delta E determines whether the right hand side goes to zero

Â or goes to one. But, with T zero, it will either be zero

Â or one on the right hand side. And so, the unit will behave

Â deterministically and that's a binary threshold unit.

Â It will always adopt whatever of the two states is the lowest energy.

Â So, the energy gap we saw on a previous slide, and it's just the difference in the

Â energy of the whole system depending on whether unit I is off, or the unit I is

Â on. Although simulated annealing is a very

Â powerful method for improving searches that get stuck in local optima, and

Â although it was influential in leading Terry Sejnowski and I to the ideas behind

Â Boltzmann machines, it's actually a big distraction from understanding Boltzmann

Â machines. So, I'm not going to talk about it anymore

Â in this course even though it's a very interesting idea.

Â And, from now on, I'm going to use binary stochastic units that have a temperature

Â of one. That is, it's the standard logistic

Â function in the energy gap. So, one concept that you need to

Â understand in order to understand the learning procedure for both the machines,

Â is the concept of thermal equilibrium. And because we're setting the temperature

Â to one, this the concept of thermal equilibrium at a fix temperature.

Â It's a difficult concept. Most people think that it means the system is settled

Â down and isn't changing anymore. That's normally what equilibrium means. But it's

Â not the states of the individual units that are settled down.

Â 6:10

The individual units are still rattling around at thermal equilibrium, and less

Â temperature zero. The thing that settles down is the probability distribution over

Â configurations. That's a difficult concept the first time you meet it, and so I'm

Â going to give you an example. The probability distribution settles to a

Â particular distribution called the Stationary Distribution.

Â The stationary distribution is determined by the energy function of the system.

Â And, in fact, in the stationary distribution, the probability of any

Â configuration is proportional to each of the minus its energy.

Â A nice intuitive way to think about thermal equilibrium is to imagine a huge

Â ensemble of identical systems that all have exactly the same energy function.

Â So, imagine a very large number of stochastic Hopfield nets all with the same

Â weights. Now, in that huge ensemble, we can define

Â the probability of configuration as the fraction of the systems that are in that

Â configuration. So, now we can understand what's happening

Â as we approach thermal equilibrium. We can start with any distribution we like

Â over all these identical systems. We could make them all, be in the same

Â configuration. So, that's the distribution with a property of one on one

Â configuration, and zero on everything else. Or we could start them off, with an

Â equal number of systems in each possible configuration.

Â So that's a uniform distribution. And then, we're going to keep applying our

Â stochastic update rule. Which, in the case of a stochastic

Â Hopfield net would mean, You pick a unit, and you look at its

Â energy gap. And you make a random decision based on

Â that energy gap about whether to turn it on or turn it off.

Â Then, you go and pick another unit, and so on.

Â We keep applying that stochastic rule. And after we've run systems stochastically

Â in this way, We may eventually reach a situation where

Â the fraction of the systems in each configuration remains constant.

Â In fact, that's what will happen if we have symmetric connections.

Â That's the stationary distribution that physicists call thermal equilibrium.

Â Any given system keeps changing its configuration.

Â We apply the update rule, And the states of its units will keep

Â flipping between zero and one. But, the fraction of systems in any

Â particular configuration doesn't change. And that's because we have many, many more

Â systems than we have configurations. So, here's an analogy kust to help with

Â the concept. Imagine a very large casino in Las Vegas

Â with lots of card dealers. And, in fact, we have many more than 52 factorial card

Â dealers. We start with all the card packs in the standard order that they come from

Â the manufacturer. Let's suppose that has the ace of spades, and the king of spades,

Â and the queen of spades. And then, the dealers all start shuffling.

Â And they do random shuffles, they don't do fancy shuffles that bring them back to the

Â same order again. After a few shuffles, there's still a good

Â chance that the king of spades will be next to the queen of spades in any given

Â pack. So, the packs have not yet forgotten where

Â they started. Their initial order is still influencing

Â their current order. If we keep shuffling, eventually the

Â initial order will be irrelevant. The packs will have forgotten where they

Â started. And, in fact, in this example, there will

Â be an equal number of packs in each of the 52 factorial possible orders.

Â Once this has happened, if we carry on shuffling,

Â There'll still be an equal number of packs in each of the 52 factorial orders.

Â That's why it's called equilibrium. It's because the fraction in any one

Â configuration doesn't change, Even though the individual systems are

Â still changing. The thing that's wrong with this analogy

Â is that once we've each equilibrium here, all configurations have equal energy.

Â And so, they all have the same probability.

Â In general, we're interested in reaching equilibrium for systems where some

Â configurations have lower energy than others.

Â