Another very important family of unsupervised learning methods that fall into

the transformation category are known as dimensionality reduction algorithms.

As the name suggests, this kind of transform

takes your original dataset that might contain say,

200 features and finds an approximate version of dataset that uses,

say, only 10 dimensions.

One very common need for dimensionality reduction arises when first exploring a dataset,

to understand how the samples may be grouped or related to each

other by visualizing it using a two-dimensional scatterplot.

One very important form of dimensionality reduction is called

principal component analysis, or PCA.

Intuitively, what PCA does is take

your cloud of original data points and finds a rotation of it.

So the dimensions are statistically uncorrelated.

PCA then typically drops all but

the most informative initial dimensions that

capture most of the variation in the original dataset.

Here's a simple example of what I mean with a synthetic two-dimensional dataset.

Here, if we have two original features that are highly

correlated represented by this cloud of points,

PCA will rotate the data so

the direction of highest variance - called the first principal component,

which is along the long direction of the cloud,

becomes the first dimension.

It will then find the direction at right angles

that maximally captures the remaining variance.

This is the second principle component.

In two dimensions, there's

only one possible such direction at right angles of the first principal component,

but for higher dimensions,

there would be infinitely many.

With more than two dimensions,

the process of finding successive principal components at right angles to

the previous ones would continue until

the desired number of principal components is reached.

One result of applying PCA is that we now know

the best one-dimensional approximation to the original two-dimensional data.

In other words, we can take any data point that used two features

before - x and y - and approximate it using just one feature,

namely its location when projected onto the first principal component.

Here's an example of using scikit learn to apply PCA to a higher dimensional dataset;

the breast cancer dataset.

To perform PCA, we import the PCA class from sklearn.decomposition.

It's important to first transform the dataset so that

each features range of values has zero mean and unit variance.

And we can do this using the fit and transform

methods of the standard scalar class, as shown here.

We then create the PCA object,

specify that we want to retain just the first two principal components to reduce

the dimensionality to just two columns and call the fit method using our normalized data.

This will set up PCA so that it learns the right rotation of the dataset.

We can then apply this properly prepared PCA object to project

all the points in our original input dataset to this new two-dimensional space.

Notice here since we're not doing

supervised learning in evaluating a model against a test set,

we don't have to split our dataset into training and test sets.

You see that if we take the shape of the array that's returned from PCA,

it's transformed our original dataset with

30 features into a new array that has just two columns,

essentially expressing each original data point in terms of

two new features representing the position of

the data point in this new two-dimensional PCA space.

We can then create a scatterplot that uses

these two new features to see how the data forms clusters.

In this example, we've used the dataset that has labels for supervised learning;

namely, the malignant and benign labels on cancer cells.

So we can see how well PCA serves to find clusters in the data.

Here's the result of plotting

all the 30 feature data samples using the two new features computed with PCA.

We can see that the malignant and benign cells do

indeed tend to cluster into two groups in the space.

In fact, we could now apply a linear classifier to

this two-dimensional representation of

the original dataset and we can see that it would likely do fairly well.

This illustrates another use of dimensionality reduction methods like PCA to find

informative features that could then be used in a later supervised learning stage.

We can create a heat map that visualizes the first two principal components of

the breast cancer dataset to get an idea of

what feature groupings each component is associated with.

Note that we can get the arrays representing

the two principal component axes that define the PCA space using the

PCA.components_attribute that's filled in after the PCA fit method is used on the data.

We can see that the first principle component is all positive,

showing a general correlation between all 30 features.

In other words, they tend to vary up and down together.

The second principle component has a mixture of

positive and negative signs; but in particular,

we can see a cluster of negatively signed features that co-vary

together and in the opposite direction of the remaining features.

Looking at the names, it makes sense the subset wold co-vary together.

We see the pair mean texture and worst texture and

the pair mean radius and worst radius varying together and so on.

PCA gives a good initial tool for exploring a dataset,

but may not be able to find more subtle groupings that

produce better visualizations for more complex datasets.

There is a family of unsupervised algorithms called Manifold Learning Algorithms that are

very good at finding low dimensional structure in

high dimensional data and are very useful for visualizations.

One classic example of

a low dimensional subset in

a high dimensional space is this data set in three dimensions,

where the points all lie on a two-dimensional sheet with an interesting shape.

This lower dimensional sheet within a higher dimensional space is called the manifold.

PCA is not sophisticated enough to find this interesting structure.

One widely used manifold learning method is called multi-dimensional scaling, or MDS.

There are many flavors of MDS,

but they all have the same general goal;

to visualize a high dimensional dataset and project

it onto a lower dimensional space - in most cases,

a two-dimensional page - in a way that preserves

information about how the points in the original data space are close to each other.

In this way, you can find and visualize

clustering behavior in your high dimensional data.

Using a technique like MDS and scikit learn is quite similar to using PCA.

Like with PCA, each feature should be normalized

so its feature values have zero mean and unit variants.

After importing the MDS class from sklearn.manifold and transforming the input data,

you create the MDS object,

specifying the number of components - typically set to two dimensions for visualization.

You then fit the object using the transform data,

which will learn the mapping and then you can apply

the MDS mapping to the transformed data.

Here's an example of applying MDS to the fruit dataset.

And you can see it does a pretty good job of

visualizing the fact that the different fruit types

do indeed tend to cluster into groups.

An especially powerful manifold learning algorithm

for visualizing your data is called t-SNE.

t-SNE finds a two-dimensional representation of your data,

such that the distances between points in the 2D scatterplot match as closely

as possible the distances between

the same points in the original high dimensional dataset.

In particular, t-SNE gives much more weight to preserving

information about distances between points that are neighbors.

Here's an example of t-SNE applied to the images in the handwritten digits dataset.

You can see that this two-dimensional plot preserves

the neighbor relationships between images that are similar in terms of their pixels.

For example, the cluster for most of the digit

eight samples is closer to the cluster for the digits three and five,

in which handwriting can appear more similar than to say the digit one,

whose cluster is much farther away.

And here's an example of applying t-SNE on the fruit dataset.

The code is very similar to applying MDS and essentially just replaces MDS with t-SNE.

The interesting thing here is that t-SNE does a poor job of

finding structure in this rather small and simple fruit dataset,

which reminds us that we should try at least a few different approaches when

visualizing data using manifold learning to

see which works best for a particular dataset.

t-SNE tends to work better on datasets that have more well-defined local structure;

in other words, more clearly defined patterns of neighbors.