0:08

Another very important family of unsupervised learning methods that fall into

Â the transformation category are known as dimensionality reduction algorithms.

Â As the name suggests, this kind of transform

Â takes your original dataset that might contain say,

Â 200 features and finds an approximate version of dataset that uses,

Â say, only 10 dimensions.

Â One very common need for dimensionality reduction arises when first exploring a dataset,

Â to understand how the samples may be grouped or related to each

Â other by visualizing it using a two-dimensional scatterplot.

Â One very important form of dimensionality reduction is called

Â principal component analysis, or PCA.

Â Intuitively, what PCA does is take

Â your cloud of original data points and finds a rotation of it.

Â So the dimensions are statistically uncorrelated.

Â PCA then typically drops all but

Â the most informative initial dimensions that

Â capture most of the variation in the original dataset.

Â Here's a simple example of what I mean with a synthetic two-dimensional dataset.

Â Here, if we have two original features that are highly

Â correlated represented by this cloud of points,

Â PCA will rotate the data so

Â the direction of highest variance - called the first principal component,

Â which is along the long direction of the cloud,

Â becomes the first dimension.

Â It will then find the direction at right angles

Â that maximally captures the remaining variance.

Â This is the second principle component.

Â In two dimensions, there's

Â only one possible such direction at right angles of the first principal component,

Â but for higher dimensions,

Â there would be infinitely many.

Â With more than two dimensions,

Â the process of finding successive principal components at right angles to

Â the previous ones would continue until

Â the desired number of principal components is reached.

Â One result of applying PCA is that we now know

Â the best one-dimensional approximation to the original two-dimensional data.

Â In other words, we can take any data point that used two features

Â before - x and y - and approximate it using just one feature,

Â namely its location when projected onto the first principal component.

Â Here's an example of using scikit learn to apply PCA to a higher dimensional dataset;

Â the breast cancer dataset.

Â To perform PCA, we import the PCA class from sklearn.decomposition.

Â It's important to first transform the dataset so that

Â each features range of values has zero mean and unit variance.

Â And we can do this using the fit and transform

Â methods of the standard scalar class, as shown here.

Â We then create the PCA object,

Â specify that we want to retain just the first two principal components to reduce

Â the dimensionality to just two columns and call the fit method using our normalized data.

Â This will set up PCA so that it learns the right rotation of the dataset.

Â We can then apply this properly prepared PCA object to project

Â all the points in our original input dataset to this new two-dimensional space.

Â Notice here since we're not doing

Â supervised learning in evaluating a model against a test set,

Â we don't have to split our dataset into training and test sets.

Â You see that if we take the shape of the array that's returned from PCA,

Â it's transformed our original dataset with

Â 30 features into a new array that has just two columns,

Â essentially expressing each original data point in terms of

Â two new features representing the position of

Â the data point in this new two-dimensional PCA space.

Â We can then create a scatterplot that uses

Â these two new features to see how the data forms clusters.

Â In this example, we've used the dataset that has labels for supervised learning;

Â namely, the malignant and benign labels on cancer cells.

Â So we can see how well PCA serves to find clusters in the data.

Â Here's the result of plotting

Â all the 30 feature data samples using the two new features computed with PCA.

Â We can see that the malignant and benign cells do

Â indeed tend to cluster into two groups in the space.

Â In fact, we could now apply a linear classifier to

Â this two-dimensional representation of

Â the original dataset and we can see that it would likely do fairly well.

Â This illustrates another use of dimensionality reduction methods like PCA to find

Â informative features that could then be used in a later supervised learning stage.

Â We can create a heat map that visualizes the first two principal components of

Â the breast cancer dataset to get an idea of

Â what feature groupings each component is associated with.

Â Note that we can get the arrays representing

Â the two principal component axes that define the PCA space using the

Â PCA.components_attribute that's filled in after the PCA fit method is used on the data.

Â We can see that the first principle component is all positive,

Â showing a general correlation between all 30 features.

Â In other words, they tend to vary up and down together.

Â The second principle component has a mixture of

Â positive and negative signs; but in particular,

Â we can see a cluster of negatively signed features that co-vary

Â together and in the opposite direction of the remaining features.

Â Looking at the names, it makes sense the subset wold co-vary together.

Â We see the pair mean texture and worst texture and

Â the pair mean radius and worst radius varying together and so on.

Â PCA gives a good initial tool for exploring a dataset,

Â but may not be able to find more subtle groupings that

Â produce better visualizations for more complex datasets.

Â There is a family of unsupervised algorithms called Manifold Learning Algorithms that are

Â very good at finding low dimensional structure in

Â high dimensional data and are very useful for visualizations.

Â One classic example of

Â a low dimensional subset in

Â a high dimensional space is this data set in three dimensions,

Â where the points all lie on a two-dimensional sheet with an interesting shape.

Â This lower dimensional sheet within a higher dimensional space is called the manifold.

Â PCA is not sophisticated enough to find this interesting structure.

Â One widely used manifold learning method is called multi-dimensional scaling, or MDS.

Â There are many flavors of MDS,

Â but they all have the same general goal;

Â to visualize a high dimensional dataset and project

Â it onto a lower dimensional space - in most cases,

Â a two-dimensional page - in a way that preserves

Â information about how the points in the original data space are close to each other.

Â In this way, you can find and visualize

Â clustering behavior in your high dimensional data.

Â Using a technique like MDS and scikit learn is quite similar to using PCA.

Â Like with PCA, each feature should be normalized

Â so its feature values have zero mean and unit variants.

Â After importing the MDS class from sklearn.manifold and transforming the input data,

Â you create the MDS object,

Â specifying the number of components - typically set to two dimensions for visualization.

Â You then fit the object using the transform data,

Â which will learn the mapping and then you can apply

Â the MDS mapping to the transformed data.

Â Here's an example of applying MDS to the fruit dataset.

Â And you can see it does a pretty good job of

Â visualizing the fact that the different fruit types

Â do indeed tend to cluster into groups.

Â An especially powerful manifold learning algorithm

Â for visualizing your data is called t-SNE.

Â t-SNE finds a two-dimensional representation of your data,

Â such that the distances between points in the 2D scatterplot match as closely

Â as possible the distances between

Â the same points in the original high dimensional dataset.

Â In particular, t-SNE gives much more weight to preserving

Â information about distances between points that are neighbors.

Â Here's an example of t-SNE applied to the images in the handwritten digits dataset.

Â You can see that this two-dimensional plot preserves

Â the neighbor relationships between images that are similar in terms of their pixels.

Â For example, the cluster for most of the digit

Â eight samples is closer to the cluster for the digits three and five,

Â in which handwriting can appear more similar than to say the digit one,

Â whose cluster is much farther away.

Â And here's an example of applying t-SNE on the fruit dataset.

Â The code is very similar to applying MDS and essentially just replaces MDS with t-SNE.

Â The interesting thing here is that t-SNE does a poor job of

Â finding structure in this rather small and simple fruit dataset,

Â which reminds us that we should try at least a few different approaches when

Â visualizing data using manifold learning to

Â see which works best for a particular dataset.

Â t-SNE tends to work better on datasets that have more well-defined local structure;

Â in other words, more clearly defined patterns of neighbors.

Â