0:22

Let's start with hierarchical clustering.

Â This technique is conceptually simple.

Â The process begins with placing each object in its own cluster.

Â For example, suppose that we have five objects labeled A, B, C, D and E.

Â Then in the first step these objects are placed each in its own cluster.

Â In subsequent steps, the method calculates the distance between

Â each pair of clusters and then it merges the two closest to each other.

Â Let's suppose that objects A and

Â B are the objects that are closest to each other of all the possible pairs.

Â These objects are merged into a single cluster.

Â In the third step, the procedure merges the AB cluster with C.

Â And finally, the procedure stops when merging D and E.

Â The final result shows two clusters.

Â One with A, B, and C and the other one with D and E.

Â The procedure is simple.

Â But it needs to know how to measure distance from one cluster to another.

Â And it also needs to know when to stop merging clusters, otherwise,

Â it will always result in one cluster, with all the objects in it.

Â You might recall that, in the previous video,

Â we introduced five measures to calculate the distance between a pair of clusters.

Â These measures were single linkage, complete linkage,

Â average linkage, average group linkage, and Ward's method.

Â Any of these measures can be used in hierarchical clustering.

Â Software packages allow you to choose which measure to use.

Â In terms of when to stop, this is determined by the analyst.

Â Cluster analysis requires experimentation and

Â often there is no single one correct answer.

Â Experimenting with several distance measures and with different number of

Â clusters is a common practice in order to find meaningful groupings of data.

Â We will address some of these issues in the last video of this module.

Â Hierarchical clustering may be represented by a two-dimensional diagram

Â called dendogram.

Â This diagram has the form of an inverted tree.

Â The horizontal axis represents the observations and

Â the vertical axis represents distance.

Â The length of the branches represented by the vertical lines

Â indicate the distance between the two clusters that are merging.

Â The diagram is read from the bottom to the top

Â by sweeping a horizontal line across the entire tree.

Â If we start a distance of zero and move up,

Â the first merger that we find is the one of observations A and B.

Â At this point, we have a solution with six clusters.

Â One with A and B in it and the other five with one observation each.

Â That is C, D, E, F, and G.

Â The next merger of the AB cluster with observation

Â C occurs at the distance of 1.5.

Â Then observations D and E merged at distance of 2.

Â And observations F and G merged at a distance of 2.5.

Â At this level, the dendrogram shows a solution with three clusters.

Â One with A, B, and C, another one with D and E.

Â And a third one with F and G.

Â A two cluster solution is found is at the distance of three.

Â Since the dendogram contains all possible solutions,

Â it is only practical to examine it for datasets with relatively few observations.

Â Now let's talk about k-means.

Â The k-means method is very well known and widely used.

Â It is scalable, meaning that it can deal with very large datasets.

Â The overall idea is to assign objects to the nearest cluster.

Â Where distance is measured from the object to the centroid of the cluster.

Â Recall that the centroid is the cluster average.

Â The value of k indicates the number of clusters.

Â The process starts with the selection of k observations at the centroid

Â of the initial clusters.

Â This selection is somewhat arbitrary.

Â But the procedure works better if the initial centroids are as far apart

Â as possible.

Â Then, the rest of the observations are assigned to the closest centroid.

Â Once the assignment is completed, the cluster averages are recalculated.

Â These recalculation could change the position of the centroids and

Â cause a reassignment of the observations.

Â These two steps are repeated until the centroids do not change.

Â So let's take a look at a simple example.

Â We want to divide seven individuals into two groups based on their age and

Â annual income.

Â To initialize a procedure, we select David and

Â Clara as the initial members for the two clusters.

Â Sometimes these initial objects are referred to as seeds.

Â Since there is only one member in each cluster, the centroid of these initial

Â clusters are the values corresponding to the age and income of David and Clair.

Â Since income and age have values at very different scales,

Â we use the normalized values to calculate nucleon distances.

Â According to these calculations, we assign Ann and

Â George to David to form the red cluster.

Â The second cluster, colored in blue, has Clara, Bob, Erin and Frank.

Â The clusters centers are then recalculated and they move to a new position.

Â The new cluster positions cause Bob to be a little bit closer to the red centroid

Â than to the blue centroid, and therefore Bob is reassigned to the red cluster.

Â The centroids are recalculated once again and

Â this time no reassignments are possible.

Â The final clusters seem to be characterized by age,

Â with the red cluster consisting of all of the people under 40 and

Â the blue clusters with all the people over 40 years old.

Â The beauty of k-Means is that it is a very simple procedure.

Â The weakness is that the final clusters depend on the initial choices.

Â This is why cluster analysis software gives you the option of

Â running the procedure multiple times from seeds that are randomly chosen.

Â Process of creating clusters from a dataset can be viewed

Â as an optimization problem.

Â Both hierarchical clustering and K-means are procedures that find approximate

Â solutions to the problem maximizing the similarity of the objects in each cluster.

Â The maximization must take into consideration that there must be k

Â clusters and that all objects must belong to one cluster and only one cluster.

Â This is not easy optimization problem and

Â this is the reason business analytic softwares employ

Â approximation procedure such as hierarchical clustering and k-means.

Â