0:00

The next few videos are about using the back propagation algorithm to learn a

Â feature representation of the meaning of the word.

Â I'm gonna start with a very simple case from the 1980s, when computers were very

Â slow. It's a small case, but it illustrates the

Â idea about how you can take some relational information, and use the back

Â propagation algorithm to turn relational information into feature vectors that

Â capture the meanings of words. This diagram shows a simple family tree,

Â in which, for example, Christopher and Penelope marry, and have children Arthur

Â and Victoria. What we'd like is to train a neural

Â network to understand the information in this family tree.

Â We've also given it another family tree of Italian people which has pretty much the

Â same structure as the English tree. And perhaps when it tries to learn both

Â sets of facts, the neural net is going to be able to take advantage of that analogy.

Â The information in these family trees can be expressed as a set of propositions.

Â If we make up names for the relationships depicted by the trees.

Â So we're gonna use the relationships son daughter, nephew niece, father mother,

Â uncle aunt, brother sister and husband wife.

Â And using those relationships we can write down a set of triples such as, Colleen has

Â father James, Colleen has mother Victoria and James has wife Victoria.

Â And in the nice simple families depicted in the diagram, the third proposition

Â follows from the previous two. Similarly, the third proposition in the

Â next set follows from the previous two. So the relational learning task, that is,

Â the task of learning the information in those family trees, can be viewed as

Â figuring out the regularities in a large set of triples that express the

Â information in those trees. Now the obvious way to express

Â irregularities is as symbolic rules. For example, X has mother Y, and Y has

Â husband Z, implies X has father Z. We could search for such rules, but this

Â would involve a search through quite a large space, a combinatorially large

Â space, of discrete possibilities. A very different way to try and capture

Â the same information is to use a neural network that searches through a continuous

Â space of real valued weights to try and capture the information.

Â And the way it's going to do that is we're going to say it's captured the information

Â if it can predict the third terminal triple from the first two terms.

Â So at the bottom of this diagram here, We're going to put in a person and a

Â relationship and the information is going to flow forwards through this neural

Â network. And what we are going to try to get out of

Â the neural network after it's learned is the person who's related to the first

Â person by that relationship. The architecture of this net was designed

Â by hand, that is I decided how many layers it should have.

Â And I also decided where to put bottle necks to force it to learn interesting

Â representations. So what we do is we encode the information

Â in a neutral way, because there are 24 possible people.

Â So the block at the bottom of the diagram that says, local encoding of person one,

Â has 24 neurons, and exactly one of those will be turned on for each training case.

Â Similarly there are twelve relationships. And exactly one of the relationship units

Â will be turned on. And then for a relationship that has a

Â unique answer, we would like one of the 24 people at the top, one of the 24 output

Â people to turn on to represent the answer. By using a representation in which exactly

Â one of the neurons is on, we don't accidentally give the network any

Â similarities between people. All pairs of people are equally

Â dissimilar. So, we're not cheating by giving the

Â network information about who's like who. The people, as far as the network is

Â concerned, are uninterpreted symbols. But now in the next layer of the network,

Â we've taken the local encoding of person one, and we've connected it to a small set

Â of neurons, actually six neurons for this. And because there are 24 people, it can't

Â possibly dedicate one neuron to each person.

Â It has to re-represent the people as patterns of activity over those six

Â neurons. And what we're hoping is that when it

Â learns these propositions, the way in which thing encodes a person, in that

Â distributive panel activities. Will reveal structuring the task, or

Â structuring the domain. So what we're going to do is we're going

Â to train it up on 112 of these propositions.

Â And we go through the 112 propositions many times.

Â Slowly changing the weights as we go, using back propagation.

Â And after training we're gonna look at the six units in that layer that says

Â distributed encoding of person one to see what they are doing.

Â So here are those six units as the big gray blocks.

Â And I laid out the 24 people, with the twelve English people in a row along the

Â top, and the twelve Italian people in a row underneath.

Â So each of these blocks you'll see, has 24 blobs in it.

Â And the blobs tell you the incoming weights for one of the hidden units in

Â that layer. So going back to the previous slide.

Â If you look at that layer that says distributed and coding of person one.

Â There are six neurons there. And we're looking at the incoming weights

Â of each of those six neurons. If you look at the big gray rectangle on

Â the top right, you'll see an interesting structure to the weights.

Â The weights along the top that come from English people are all positive.

Â And the weights along the bottom are all negative.

Â That means this unit tells you whether the input person is English or Italian.

Â We never gave it that information explicitly.

Â But obviously, it's useful information to have in this very simple world.

Â Because in the family trees that we're learning about, if the input person is

Â English, the output person is always English.

Â And so just knowing that someone's English already allows you to predict one bit of

Â information about the output. Which is according to saying it halves the

Â number of possibilities. If you look at the gray blob immediately

Â below that, the second one down on the right, you'll see that it has four big

Â positive weights at the beginning. Those correspond to Christopher and Andrew

Â with our Italian equivalents. Then it has some smaller weights.

Â Then it has two big negative weights, that correspond to Collin, or his Italian

Â equivalent. Then there's four more big positive

Â weights, corresponding to Penelope or Christina, or their Italian equivalents.

Â And right at the end, there's two big negative weights, corresponding to

Â Charlotte, or her Italian equivalent. By now you've probably realized that, that

Â neuron represents what generation somebody is.

Â It has big positive weights to the oldest generation, big negative weight to the

Â youngest generation, and intermediate weights which are roughly zero to the

Â intermediate generation. So that's really a three-value feature,

Â and it's telling you the generation of the person.

Â Finally, if you look at the bottom gray rectangle on the left hand side, you'll

Â see that has a different structure, and if we look at the top row and we look at the

Â negative weights in the top row of that unit.

Â It has a negative weight to Andrew, James, Charles, Christine and Jennifer and now if

Â you look at the English family tree you'll see Andrew, James, Charles, Christine, and

Â Jennifer are all in the right hand branch of the family tree.

Â So that unit has learned to represent which branch of the family tree someone is

Â in. Again, that's a very useful feature to

Â have for predicting the output person, because if you know it's a close family

Â relationship, you expect the output to be in the same branch of the family tree as

Â the input. So the networks in the bottleneck have

Â learned to represent features of people that are useful for predicting the answer.

Â And notice, we didn't tell it anything about what features to use.

Â We never mentioned things like nationality or brunch or family tree or generation.

Â It figured out that those are good features for expressing the regularity in

Â this domain. Of course, those features are only useful

Â if the other bottlenecks, the one for relationships, and the one near the top of

Â the network before the output person, use similar representations.

Â And the central layer is able to say how the features of the input person and the

Â features of the relationship predict the features of the output person.

Â So for example if the input person is a generation three, and the relationship

Â requires the output person to be one generation up, then the output person is a

Â generation two. But notice to capture that rule, you have

Â to extract appropriate features at the first hidden layer, and the last hidden

Â layer of the network. And you have to make the units in the

Â middle, relate those features correctly. Another way to see that the network works,

Â is to train it on all but a few of the triples.

Â And see if it can complete those triples correctly.

Â So does it generalize? So there's 112 triples, and I trained it

Â on 108 of them and tested it on the remaining four, I did that several times

Â and it got either two or three of those right.

Â That's not so bad for a 24 way choice, so it's true it makes mistakes, but it didn't

Â have much training data, there's not enough triples in this domain to really

Â nail down the regularities very well. And it does much better than chance.

Â If you train it on a much bigger data set, it can generalize from a much smaller

Â fraction of the data. So if you have thousands and thousands of

Â relationships you only need to show a small percentage before it can start

Â guessing the other ones correctly. That research was done in the 1980s, and

Â was a way of showing that back-propagation could learn interesting features.

Â And it was a toy example. Now we have much bigger computers, and we

Â have databases of millions of relational facts.

Â Many of which of the form A, R, B, A has relationship R to B, we could imagine

Â training a net to discover feature vector representations of A and R, that allow it

Â to predict the feature vector representation of B.

Â If we did that, it would be a very good way of cleaning a database.

Â It wouldn't necessarily be able to make perfect predictions.

Â But it could find things in the database that it thought were highly implausible.

Â So if the database contained information, like, for example, Bach was born in 1902.

Â It could probably realize that was wrong, 'cuz Bach's a much older kind of person,

Â and everything else he's related to is much older than 1902.

Â Instead of actually using the first two terms to predict the third term, we could

Â use the whole set of terms, three of them in this case, but possibly more, and

Â predict the probability that the fact is correct.

Â To train a net to do that, we'd need examples of a whole bunch of correct

Â facts, and we'd ask it to give a high output.

Â We'd also need a good source of incorrect facts, and we'd ask it to give a low

Â output when we're told it was something that was false.

Â