0:00

In this video, I'm going to talk about some recent work on learning a joint

Â model of captions and feature vectors that describe images.

Â In the previous lecture, I talked about how we might extract semantically

Â meaningful features from images. But we were doing that with no help from

Â the captions. Obviously the words in a caption ought to

Â be helpful in extracting appropriate semantic categories from images.

Â And similarly, the images ought to be helpful in disambiguating what the words

Â in the caption mean. So the idea is we're going to try in a

Â great big net that gets its input, stand to computer vision feature vectors

Â extractive for images and pack up words representations of captions and learns

Â how the two input representations are related to each other.

Â At the end of the video I'll show you a movie of the final network using words to

Â create feature vectors for images and then showing you the closest image in its

Â data base. And also using images to create bytes of

Â words. I'm now going to describe some work by

Â Natish Rivastiva, who's one of the TAs for this course, and Roslyn Salakutinov,

Â that will appear shortly. The goal is to build a joint density

Â model of captions and of images except that the images represented by the

Â features standardly used in computeration rather than by the ropic cells.This needs

Â a lot more computation than building a joint density model of labels and digit

Â images which we saw earlier in the course.

Â So what they did was they first trained a multi-layer model of images alone.

Â That is it's really a multi-layer model of the features they extracted from

Â images using the standard computer vision features.

Â Then separately, they train a multi-layer model of the word count vectors from the

Â captions. Once they trained both of those models,

Â they had a new top layer, that's connected to the top layers of both of

Â the individual models. After that, they use further joint

Â training of the whole system so that each modality can improve the earlier layers

Â of the other modality. Instead of using a deep belief net, which

Â is what you might expect, they used a deep Bolton machine, where the symmetric

Â connections bring in all pairs of layers. The further joint training of the whole

Â deep Boltzmann machine is then what allows each modality to change the

Â feature detectors in the early layers of the other modality.

Â That's the reason they used a deep Boltzmann machine.

Â They could've also used a deep belief net, and done generative fine tuning with

Â contrastive wake sleep. But the fine tuning algorithm for deep

Â Boltzmann machines may well work better. This leaves the question of how they

Â pretrained the hidden layers of a deep Boltzmann machine.

Â because what we've seen so far in the course is that if you train a stack of

Â restricted Boltzmann machines and you combine them together into a single

Â composite model what you get is a deep belief net not a deep Boltzmann machine.

Â So I'm now going to explain how, despite what I said earlier in the course, you

Â can actually pre-trail a stack of restrictive Boltzmann machines in such a

Â way that you can then combine them to make a deep Boltzmann machine.

Â The trick is that the top and the bottom restrictive bowser machines in the stack

Â have to trained with weights that it twices begin one directions the other.

Â So, the bottom Boltzmann machine, that looks at the visible units is trained

Â with the bottom up weights being twice as big as the top down weights.

Â Apart from that, the weights are symmetrical.

Â So, this is what I call scale symmetrical.

Â But the bottom up weights are always twice as big as their top down

Â counterparts. This can be justified, and I'll show you

Â the justification in a little while. The next restrictive Boltzmann machine in

Â the stack, is trained with symmetrical weights.

Â I've called them two W, here rather then W for reasons you'll see later.

Â We can keep training with restrictive bowsler machines like that with genuinely

Â symmetrical weights. But then the top one in the stack has

Â be-trained with the bottom up weights being half of the top down weights.

Â So again, these are scale symmetric weights, but now, the top down weights

Â are twice as big as the bottom up weights.

Â That's the opposite of what we had when we trained the first restricted Bolton

Â machine in the stack. After having trained these three

Â restricted Bolton machines, we can then combine them to make a composite model,

Â and the composite model looks like this. For the restricted Bolton machine in the

Â middle, we simply halved its weights. That's why they were 2W2 to begin with.

Â 5:01

For the one at the bottom, we've halved the up-going weights but kept the

Â down-going weights the same. And for the one at the top we've halved

Â the down-going weights and kept the up-going weights the same.

Â Now the question is: Why do we do this funny business of halving the whites?

Â The explanation is quite complicated but I'll give you a rough idea of what's

Â going on. If you look at the layer H1.

Â We have two different ways of inferring the states of the units in h1, in the

Â stack of restricted bolts and machines on the left.

Â We can either infer the states of H1 bottom up from V or we can infer the

Â states of H1 top down from H2. When we combine these Boltzmann machines

Â together, what we're going to do is we're going to an average of those two ways of

Â inferring H1. And to take a geometric average, what we

Â need to do, is halve the weights. So we're going to use half of what the

Â bottom up model says. So that's half of 2W1.

Â And we're going to use half of what the top down model says.

Â That's half of 2W2. And if you look at the deep Boltzmann

Â machine on the right, that's exactly what's being used to infer the state of

Â H1. In other words, if you're given the

Â states in H2, and you're given the states in V, those are the weights you'll use

Â for inferring the states of H1. The reason we need to halve the weights

Â is so that we don't double count. You see, in the Boltzmann machine on the

Â right. The state of H2 already depends on V.

Â At least it does after we've done some settling down in the Boltzmann Machine.

Â So if we were to use the bottom up input coming from the first restricted

Â Boltzmann Machine in the stack. And we use the top down input coming from

Â the second Boltzmann Machine in the stack, we'd be counting the evidence

Â twice.'Cause we'd be inferring H1 from V. And we'd also be inferring it from H2,

Â which, itself, depends on V. In order not to double count the

Â evidence, we have to halve the weights. That's a very high level and perhaps not

Â totally clear description of why we have to half the weights.

Â If you want to know the mathematical details, you can go and read the paper.

Â But that's what's going on. And that's why we need to halve the

Â weights. So that the intermediate layers can be

Â doing geometric averaging of the two different models of that layer, from the

Â two different restricted Boltzmann machines in the original stack.

Â