By now, you should have an understanding of what cluster analysis is. And also, a good grasp on the type of business questions that this technique is able to answer. Before we review the methods for performing the analysis, we need to discuss data preparation. And we also need to discuss how clustering methods measure the similarity between observations. While datasets could contain a wide variety of data types. For the purpose of this discussion we will focus on two most common data types, numerical and categorical. Numerical variables include quantities that may be continuous, such as time. Or integer, such as number of purchases or number of dependents. Categorical variables may be ordinal or nominal. An ordinal variable implies some sort of ranking. For instance, a customer satisfaction rating is stated as high, medium, and low, implies an order. Therefore, a value transformation to a numerical variable will be to make high equal to 3, medium equal to 2 and low equal to 1. Note that these transformations imply that the difference between high and medium is the same as the difference between medium and low. Nominal variables on the other hand, can be thought of representing choices. Political party affiliations are nominal data. In the United States for example, this nominal data would indicate Democrat, Republican, or Independent voters. These choices do not imply any particular order, and therefore they cannot be transformed into a single numerical variable. The transformation requires binary variables. A binary variable has two possible values, 0 and 1. The number of binary variables needed for the transformation is equal to the number of categories minus 1. Note that in the transformation for the political affiliation, we use two binary variables, because there are three categories. A Democrat is transformed into a value of 1 for variable 1, and a variable of 0 for a variable 2. A Republican is transformed into a value of 0 for variable 1, and a value of 1 for variable 2. And then Independent has a value of 0 for both variables. A special case of nominal data occurs when there are only two categories. For instance, yes or no options. For yes is typically given the value of 1, and no is given the value of 0. Most software packages for data analysis include tools that can transform categorical data that is given in the form of text to numerical variables. Although software can perform these transformations automatically. It is always good to verify how categorical variables are being transformed, to avoid problems such as the software treating nominal data as ordinal. Datasets may contain variables with values that are on very different scales. Therefore it is recommended to perform data analysis, such as clustering. On normalized, or also called standardized data instead of the original data. Normalization takes care of differences in scale by transforming each original value to its standard value. The operation consists of subtracting the mean and dividing by the standard deviation. In this example, we have age and income data for five people. The average age of the sample is 42.20 years and the average income is 105,000. The last two columns of the table show the normalized values, for instance, the normalized age of Ann is -0.4948. Which is the result of subtracting the average age of the group, which is 42.20 years from Ann's age, which is 35 years. And then dividing by the standard deviation of 14.55. The normalized value means that Ann's age is 0.4948 standard deviations below the mean. The normalized values for age and income are now on the same scale. That is with an average of 0 and a standard deviation of 1. The normalized values allow us to identify the outliers in our dataset. And they eliminate biases from variables with relatively large original values. Also, normalized values enable an easier interpretation of cluster analysis results. Since the mean of a normalized variable is zero, we are able to easily detect values that are above the mean and those that are below the mean. We're also able to know how far a value is from the mean in terms of standard deviations. A normalized value for instance of 1.7 means that the original value is 1.7 standard deviations above the mean. To do cluster analysis we need a measure of distance, or the similarity between observations. The Euclidean distance is the most commonly used measure of the similarity between two observations. This measure is the equivalent of the straight line distance between two objects in a two-dimensional space. Using the normalized age and income values in our previous example, we can compute the distances from each pair of persons in our set. For instance, the distance between David and Ann, or the distance between David and Clara. Then a scatter plot can be used to create a graphical representation of the distance between each pair of observations. The plot shows that in terms of age and income, David is at least three times closer or more similar to Ann, than he is to Clara. This make sense, since David is both closer in age and income to Ann, than he is to Clara. Now that we have a way to measure distances between observations, we need to establish a measure of distance between clusters. There are five distance measures between clusters. And they are single linkage, complete linkage, average linkage, average group linkage, and Ward's method. In single linkage the distance between two clusters is determined by the minimum distance between every pair of objects that are not in the same cluster. Complete linkage uses a maximum distance between objects that are not in the same cluster. Average linkage calculate the average of all distances across the two clusters. Average group linkage is the distance between the center of one cluster to the center of the other. Ward's method uses a sum of squares criterion. The sum of squares refers to the squared distance from each observation to the centroid of the cluster to which it is assigned. Let's go through the calculations using the data in our simple example. Suppose that Clara and Erin form a cluster, the centroid of this cluster is the average normalized values. So in our example we have 0.9828 and 0.0906 as the average of the normalized age and the average of the normalized income for the members of the cluster. We then calculate the squared distance from each cluster member to the centroid. The squared distance is calculated by adding the squared differences for each variable value and the centroid. For example, Clara has a normalized age of 1.567 and the centroid is 0.9828. So we square the difference between these two numbers, we do the same for the normalized income. The result is a squared distance of 0.3495. Then we go through the same calculations for Erin. Because the cluster has only two members, the centroid is exactly halfway between them. The sum of the squares for this cluster is 0.699. The sum of squares for our complete solution is the aggregate of sum of the squares for all the clusters. Since each cluster method may generate a different outcome, it is generally recommended to experiment and compare results. In this slide, we show how single linkage does better than k-means and Ward's method in these two-dimensional problems. However, no single method will always outperform the others. We have reviewed three concepts that are critical to performing a valid cluster analysis. First, data should be in the correct form by taking into consideration what each variable represents. Second, and a proper metric should be established to be able to measure the distance between every pair of observations. And third, we must decide how distance between clusters is going to be measured. We are now ready to review how the most common clustering methods operate.