As you can see finding clusters using the XLMiner is straight forward.

However, before we go we need to address the question of how many clusters to

use in a cluster analysis.

There is basically no theory about how to find the right number of clusters.

In fact in some settings it might not be completely clear what the right number

of cluster means.

We know that the two streams are to either put every observation

in its own cluster or to put all observations in a single cluster.

The first option we have no predictive power,

because we will not have a cluster where to put a new observation.

The second option results in a trivial amount

that tells us nothing about new observations.

Most analysts would that agree with applying parsimony to cluster analysis.

Under this principle, we choose the smallest number of clusters

that generalize best to a new observation.

There are a few ways in which generalization could be measured.

For instance, we could measure how much the centroids will move if we

re-run the clustering method with the addition of new data.

We could also measure how much the cluster assignments

would change in the presence of new data.

Or, we could also check how much the total sum of squares would

change when assigning new data to the existing cluster.

In addition to these measures, we can also use a procedure where

we start with a small number of clusters and then we add one cluster at a time.

After each addition, we decide whether the new clusters have a more

meaningful interpretation than the clusters in the previous iteration.

Let's try this approach on our online retailer example.

At the moment we have a solution with three clusters.

We have interpreted three clusters and determined that there

are three identifiable markets, namely families, professionals and students.

If we use XLMiner to re-run K-means to find a solution with four clusters,

we find the normalized centroids shown in this table.

The first cluster is the one corresponding to families,

the second cluster corresponds to what we labeled as professionals.

So far nothing new has emerged compared to our solution with three clusters.

In fact, the centers for clusters 1 and 2 in this solution

are identical to the centers of cluster 1 and 2 in the solution with 3 clusters.

Now we examine clusters 3 and

4 and we observe that they are nearly identical to each other.

We compare the data summary tables for the solution with three clusters and

the solution with four clusters.

We can now clearly see what has happened.

Because we want a solution with four clusters,

we forced the third cluster in the solution with three clusters to be split.

The 100 observations in this cluster were split into 49 and

51 when we asked K-means to produce a solution with four cluster.

For this analysis it seems clear that the data

do not support the hypothesis of the existence of a fourth market.

The right number of clusters in this case seems to be three.

The goal of this module was to provide you with that good understanding of what

cluster analysis is, how this analysis is used in business and

how the analysis is performed.

You're now ready to apply this knowledge and the tools that we have discussed.

The assignment at the end of this module is a good as starting point to put cluster

analysis into action.