Welcome back for the second example. Let's now review the HR example that you explored in class with Professor Gladdy. As a quick reminder, we are now the HR department of a big consulting company and we're worried about the high number of employees leaving the firm. We want to use HR analytics to understand why employees leave. And to discover the actions that we can take in order to retain our best employees. Let's turn to R. The first thing you always do is set your working directory. Once that is done you can clean up the memory of your current R session by running this line. We are now ready to load our datasets, and we can do so using the read.table function that we already know from the SKU example. As always, let's explore our dataset using the str function. What we discover here is that our dataset contains 2,000 observations and has 6 variables. What are these variables? First, the S variable, which is numeric. As you may remember from the lectures, the S variable is the satisfaction level on a scale of 0 to 1. Then we have the LPE variable which is also a numeric, and which is the last project evaluation by a client on a scale of 0 to 1 again. Then the NP variable, which is an integer variable and which represents the number of projects worked on by employee in the last 12 month. Then the ANH variable, which is an integer variable and which represents the average number of hours worked in the last 12 month by the employee. The TIC variable, again, which is an integer variable and which represents the time spent in the company in years by the employee. The last variable is the Newborn variable. It will take the value 1 if the employee had a newborn within the last 12 month and 0 otherwise. Now as we did in the SKU example, and as we will often do, let's check out some summary statistic for our variables by using the summary function. As you can see, the mean value for the satisfaction level is 0.44 and you can check out some other statistics that might be able to guide you through your analysis. Now in order to make your variables comparable to one another, we need to normalize them. In order to do that, we will create a copy of our dataset and we will call it testdata. And in order to normalize our datasets, we use the scale function. As was explained in the lectures, the scale function subtracts the mean and divides by the standard deviation of your variable. We can now compute the distances of our data points using the dist function. And as our first argument, we've got our dataset, which is testdata. As a second argument, we've got method and we set it equal to "euclidian", don't forget the quotes, in order to compute the distances using the Euclidian method. We will store the results in d. Let's run this line of code. As we did previously, let's perform hierarchical clustering using the hclust function. We pass it the distances as a first argument and set the method equal to "word.D" in quotes as a second argument. And we store the results in hcword. Let's decide like you did in class to have four different clusters. As we did in the SKU example, we can assign our points to our clusters using the cutree function and creating a new variable to our original datasets called groups, in which for each observation, this is going to indicate the group to which the observation belongs. We can now run this line. In order to get a nice output table, we need to compute the mean values for each variable and for each group using the aggregate function. And passing it as arguments ., which means we want to use our variables, ~ and then groups which is the different groups that we want to have. The second argument is the data we want to perform this on. And note here that it's the original dataset, so data=data, which was the name of our original dataset. And the FUN argument is the function that we want to use in order to group, to aggregate the variables. In here it's the mean function. We can now run this line. One thing that would be really nice to have is the proportion of our data that is in each cluster. To compute that, we're going to add it to aggdata. We create a variable called proptemp which computes the number of observations in each group using the S variable here, but we could have used any. And the function we aggregate with is length, which count the number of observations. Let's run this line. Let's type proptemp in the command line to see the output. What we see here is in group 1 there are 793 observations, 626 in group 2, 476 in group 3, and 105 in group 4. We are now ready to add the proportion variable to our aggdata. And we can do that by computing the ratio between the number of observations in each group, which is proptemp$S, and the total number of observations, which is 2,000 if you remember. But that you could have the sum, proptemp$S. Let's run this line. Now to better organize our output table, we order the groups from the one with the largest number of observations to the one with the smallest by using the order variable. We run the line. To see our output table, we type aggdata in the command line. And we can see now our aggdata output table, with the proportion as the last variable. Now if you do a view(aggdata), you can see that one segment has 100% of employees with newborns, while the other segments have no newborn at all. This variable is a 0 or 1 outcome and what we observe is an artifact of a binary outcome in a clustering. Those types of variable will typically drive the results. On top of that, as discussed in the videos, we could wonder if the Newborn variable is really relevant in this context. So as discussed in the videos, let's remove the Newborn variable. We create a new dataset called testdata and we set it equal to data. We include all the rows of the dataset, and only column one through five, which is all the columns except the Newborn column. We then rerun the code used above. If you take some time to read it, you'll see that it's exactly the same thing but we're calling it on the reduced dataset now. Assign our observations to the groups. Aggregate the values again. Compute the proportion and order the result again. Let's see the output by calling aggdata. At this point, you should get the same output that what Professor Gladdy showed in class. A good visualization idea is to save the output in a CSV file to work on it later in a spreadsheet tool like Microsoft Excel or Google Spreadsheet or the sort. And you can do that with the write.csv function that you can see right here. You call the dataset on it. Your second argument is the name of the file that you want to save. And our third argument here just indicates that we do not want to have the display of the row names in the output file. If for any reason with the write.csv you encounter an issue, it is most likely due to regional settings with the separator and I recommend that you use the write.csv2 function instead. This wraps up our second example, and I will see you in the next video.