[MUSIC] Let's now discuss the situation where we need to understand what makes the difference between two categories. And let's take again the HR analytics example we've seen previously. There we can investigate the question of what's driving the employees' attrition. First, you know that to understand what's driving this attrition, we need to add dataset used in the last module, the employees who didn't leave the company! We cannot understand what's making a difference between two categories if we observe only one of them. So let's add 10,000 employees who didn't leave to the sample. We use the same variables as before: namely, satisfaction of the employee, her project evaluation, number of projects done, utilization, time with the company and whether the employee had a baby recently or not. We can compute that we have around 17% of employees who left this year. This is huge! As I'll explain in a future module, this type of information should be at the very beginning of a presentation on HR analytics for this company. This is a huge issue to solve, and reporting this value would raise the awareness for the relevance of conducting an HR analytics investigation. Then, one could first think to perform a correlation with the cor command in R again and see what are variables that are correlated with the fact that an employee left the company. During the recital, you will see that we can then obtain these correlations. We see that the satisfaction is negatively correlated with the departure of the employees. The evaluation is weakly positive and so on. Here we have each variable considered separately. Whereas it may be that "given a certain level of satisfaction, the number of projects done" would have a different effect than the positive correlation we see here. Actually, we see later that, in this case, everything else being equal, the number of projects done DOES have the opposite sign. It has a significantly positive effect on retention, hence a negative relation to attrition. So, to address this issue rigorously we can use a logistic regression, for instance. We'll do it in R by using the glm function. A logistic regression produces a similar output as a linear regression, except that we're interested in an outcome that is 0 or 1 and not continuous. Once we've estimated the relationship between the event "left" which indicates whether a employee left, (that is a 1), or stayed (that is a 0) with the other variables, we can first assess the accuracy of this model. We can compute the proportion of correctly classified observations for instance, and you will see that we managed to correctly classify 95% of the loyal employees. But only 19% of those who left. There are few employees who left, so it's more difficult to detect. As we'll discuss during the recital, you can decide to decrease the threshold of the probability as from which an employee is considered as likely to leave. That's actually called the cutoff. But let's focus on the interpretations for the moment. Overall, we currently classified 82% of the observations. This can be seen as acceptable and we can consider the model reliable. In practice, deciding of what is an "acceptable" classification accuracy is very often relative. If you have no clue, even something a bit better than a random would already be an improvement. Let's look at what are the significant values using the last column of the summary command. Remember that it reports the level of statistical significance with three stars when it's very strongly significant. Two, when it's significant and only one when it's weakly significant. No star would mean it's not significant at all. Here, we see that everything is significant! This is probably due to the fact that with 12,000 employees in the dataset, we have a lot of observations. And something that statisticians know very well is that, as the number of observations increases, the effects all tend to be significant. As a matter of fact, when you're dealing with very "big data", say millions of observations, statistical significance usually becomes meaningless. Because even for effects that have in practice no impact, a regression will probably find it significant statistically. Anyway! Since all of the effects are significant, let's focus instead on: One, how important they are. And two, if the effect is positive or negative. We can use absolute value of the z-value column to assess the importance of the variables. As with the t-value for the linear regression in the case of a logistic regression, the larger the z-value, the more important the effect is. And we see that the most important variable is the Satisfaction first. The time with the company second. Then the number of projects, and so on. Let's hide those effects and report them in a self-explanatory way. As for instance with this table. Again, I decided to report the most important effects first and use colors. Where green is for positive effect, since green is usually seen as positive, and red as negative. You probably noticed that, when I say positive, I mean it in the business sense. It has a positive impact on the retention of the employee. And that's what you want. If you look at the estimates, it is as a matter of fact, negative. But since the statistical estimate itself is probably meaningless for most of your audience - because usually, like it or not, most of the business practitioners are not that interested in statistics methodology - we replace a statistical indicator by something that we'll convey our message better. So that's why here we use colors that are generally considered as positive for positive impact on your business. Let's wrap-up what we've done until now. What did we do to report the results of the regression? One, we provided some numbers reporting clearly the accuracy of the model. Here we distinguished between two types of errors: when our model estimated that someone should have left but she didn't (that's a false positive.) and failing to estimate correctly that someone has left (that's a false negative). But we didn't use statistical terms and focused on the business interpretations instead. We also provided an overall head of correctly classified observations. In a second step, we reported a table focusing the effects that are significant or not. In red, we report the "bad effects." Those impacting negatively what we want the outcome to be: here we want to retain the employees. And in green, we have the "good effects." We assess "good or bad" in the business sense, not the statistical sense. Now, in reality, we should be really careful when interpreting the result of a regression. Because very often, the relationship between the outcome variable and the explanatory variable, is not as unidirectional as it may seem. Satisfaction certainly explains the fact that you want to stay or not with the company. But the relationship between your evaluations and staying at the company, for instance, may go two ways. The company may decide to fire an employee if she has poor evaluations. But if another employee decided to leave already, and that she's looking for the job, it may impact her performance. And there may be a delayed effect where the cause is the anticipated departure, leading to a decline in motivation, and the final consequence is a poor evaluation. This "loop of causality" will cause what we call "endogeneity" in statistics. And while, I won't explain what it is, because it's clearly out of scope of this training, I can tell you that it will result in your effects being estimated inaccurately. So be careful when interpreting the effects you estimate. And see your results as clues, leading you to your destination, but not as decisive facts. So let's investigate in this further how the effects we identified are really related to the employees leaving the company in the next video.