0:00

In this video, we're going to look at the issue of training deep autoencoders.

Â People thought of these a long time ago, in the mid 1980s.

Â But they simply couldn't train them well enough for them to do significantly

Â better than principal components analysis.

Â There were various papers published about them, but no good demonstrations of

Â impressive performance. After we developed methods of

Â pre-training deep networks one layer at a time.

Â Russ Salakhutdinov and I applied these methods to pretraining deep autoencoders,

Â and for the first time, we got much better representations out of deep

Â autoencoders than we could get from principal components analysis.

Â Deep autoencoders always seemed like a really nice way to do dimensionality

Â reduction because it seemed like they should work much better than principal

Â components analysis. They provide flexible mappings in both

Â directions, and the mappings can be non-linear.

Â Their learning time should be linear or better in the number of training cases.

Â And after they've been learned, the encoding part of the network is fairly

Â fast because it's just a matrix multiplier for each layer.

Â Unfortunately, it was very difficult to optimize deep autoencoders using back

Â propagation. Typically people try small initial

Â weights, and then the back propagated gradient died, so for deep network, they

Â never got off the ground. But now we have much better way to

Â optimize them. We can use unsupervised layer by layer

Â pre-training, or we can simply initialize the weight

Â sensibly, as an echo statement it works. The first really successful deep water

Â encoders were learned by Russ Salakhutdinov and I in 2006.

Â We applied them to the N-ness digits. So we started with images with 784

Â pixels. And we then encoded those via three

Â hidden layers, into 30 real valued activities in a central code layer.

Â We then decoded those 30 real valued activities,

Â back to 784 reconstructed pixels. We used a stack of restricted Boltzmann

Â machine to initialize the weights used for encoding,

Â and we then took the transposers of those weights and initialized the decoding

Â network with them. So initially, the 784 pixels were

Â reconstructed, using a weight matrix that was just the transpose of the weight

Â matrix used for encoding them. But after the four restricted Boltzmann

Â machines have being trained and unrolled to give the transposes for decoding,

Â we then applied back propagation to minimize the reconstruction error of the

Â 784 pixels. In this case we were using a

Â cross-entropy error, because the pixels were represented by logistic units.

Â So that error was back propagated through this whole deep net.

Â And we once started back propagating the error,

Â the weights used for reconstructing the pixels became different from the weights

Â used for encoding the pixels. Although they, typically stayed fairly

Â similar. This worked very well.

Â 3:24

So if you look at the first row, that's one random sample from each digit class.

Â If you look at the second row, that's the reconstruction of the random sample by

Â the deep autoencoder that uses 30 linear hidden units in its central layer.

Â So the data has been compressed to 30 real numbers and then reconstructed.

Â If you look at the eight, you can see that the reconstruction is actually

Â better than the eight. It's got rid of the little defect in the

Â eight because it doesn't have the capacity to encode it.

Â If you compare that with linear principal commands analysis, you can see it's much

Â better. A linear mapping to 30 real numbers

Â cannot do nearly as good a job of representing the data.

Â