0:00

In this video, I'm going to talk about applying deep autoencoders to document

Â retrieval. There was a method developed some time

Â ago called latent semantic analysis, that amounts to applying principal components

Â analysis to vectors of word counts extracted from documents.

Â The codes produced by latent semantic analysis can then be used for judging

Â similarity between documents so they can be used for document retrieval.

Â Obviously, if deep autoencoders work much better than PCA, we would expect to be

Â able to extract much better codes using a deep autoencoder than using latent

Â semantic analysis. And Rus-lan Salakhutdinov and I showed

Â that, that was indeed the case. Using a big database of documents, we

Â showed that ten components extracted with a deep autoencoder are actually worth

Â more than 50 components extracted with a linear method, like latent semantic

Â analysis. We also showed that if you make the code

Â very small, having just two components, you can use those two components for

Â visualizing documents as a point in a two-dimensional map.

Â And this, again, works much better than just extracting the first two principal

Â components. To find documents that are similar to a

Â query document, the first thing we do is convert each

Â document into a big bag of words. In other words, we have a vector of word

Â counts that ignores the order of the words.

Â This clearly throws away quite a lot of information.

Â But it also retains a lot of information about the topic of the document.

Â We ignore words like the or over" which are called stop words," because they

Â don't have much information about the topic.

Â So if you look on the right, I've done the counts for various words, and they're

Â actually the counts for the document on the left.

Â So, if you look at what words we have nonzero counts for,

Â they are vector, and count, and query, and reduce, and bag, and a word,

Â that tells you quite a lot about what the document is about.

Â Now, we could compare the word counts of the query document with the word counts

Â of millions of other documents. But that would involve comparing quite

Â big vectors. In fact, we use vectors of size 2000.

Â So, that would be slow. Alternatively, we could use each query

Â vector to a much smaller vector that still contained most of the information

Â about the content. So, here's how we do the reduction.

Â We take the deep autoencoder and we compress the 2,000-word counts down to

Â ten real numbers, from which we can reconstruct the 2,000-word counts,

Â although we can't reconstruct them very well.

Â We train the neural network to reproduce its input vector as its output vector as

Â well as possible. And that forces it to put as much

Â information about the input into those ten numbers as possible.

Â We can then compare documents using just ten numbers.

Â That's going to be much faster. So, there's one problem with this, which

Â is word counts aren't quite the same as pixels or real values.

Â What we do is we divide the counts in a bag of words by the total number of

Â non-stop words and that converts the vector of counts into a probability

Â vector where the numbers add up to one. You can think of it as the probability of

Â getting a particular word if you picked a word at random in the document, as long

Â as that is not a stop word. So, the output of the autoencoder, we're

Â using a great, big 2,000-way softmax. And our target values are the

Â probabilities of words when we reduce the count vector to a probability factor.

Â There's one further trick we have to do. We treat the word counts as probabilities

Â when we're reconstructing them. But when we're using them to activate the

Â first hidden layer, we multiply all the weights by N.

Â And that's because we have N different observations from that probability

Â distribution. If we left them as probabilities, the

Â input units would have very small activities and wouldn't provide much

Â input to the first hidden layer. So, we have this funny property that for

Â the first restricted Boltzmann machine, the bottom up weights, are N times bigger

Â than the top down weights. So, how well does this work?

Â We trained using bags of 2,000 words on 4,000 business documents from the Reuters

Â data set. And these documents had been hand

Â labelled with about a 100 different categories.

Â We first trained a stack of restricted Boltzmann machines, and then, we fine

Â tuned with back propagation using a 2,000-way softmax as the output.

Â And then, we test it on a different set of 4,000 documents.

Â 5:09

And to test, you pick one document to be the query, one of the test documents,

Â and then you rank order all the other test documents by using the cosine of the

Â angles between the ten-dimensional vectors that the ordering code gives you.

Â You repeat this for each of the 4,000 possible test documents and then you plot

Â the number of documents you're going to retrieve, that is how far down that rank

Â list you're going to go, against the proportion that are in the same hand

Â labelled class as the query document. This is not a very good measure of the

Â quality of the retrieval. But we're going to use the same measure

Â for comparing the LSA. And so, at least, it's a fair comparison.

Â So, here's the accuracy of the retrieval as a function of the number of retrieved

Â documents. When you see that an autoencoder was just

Â using a code with ten real numbers is doing better than latent emantic

Â analysis, using 50 real numbers. And, of course, it's five times less work

Â per document after you've got the code. Latent semantic analysis with ten real

Â numbers is much worse. We can also do the same thing where we

Â reduce to two real numbers, and then, instead of doing retrieval, we're just

Â going to plot all the documents in a map but we're going to color that

Â two-dimensional point that corresponds to the two numbers produced by PCA by the

Â class of the document. So, we took the major classes of the

Â documents. We gave those major classes different

Â colors. And then, we used PCA on log of one plus

Â the count. The point of doing that is that it

Â suppresses counts with very big numbers which tends to make PCA work better.

Â 7:11

This is the distribution you get. As you can see, there is some separation

Â of the classes. The green class is in one place.

Â The red class is in a slightly different place.

Â But the classes are very mixed up. Then, we did the same thing by using a

Â deep autoencoder to reduce the documents to two numbers, and, again, plotting the

Â documents in a two-dimensional space using those two numbers as the

Â coordinates. And here's what we got.

Â It's a much better layout. It tells you much more about the

Â structure of the data set. You can see the different classes, and

Â you can see that they're quite well-separated.

Â We assume that the documents in the middle are ones which didn't have many

Â words in them, and therefore, it was hard to distinguish between the classes.

Â A visual display like this could be very useful. If, for example, you saw one of

Â green dots was the accounts and earning reports from Enron, you probably wouldn't

Â want to buy shares in a company that has a green dot nearby.

Â