[MUSIC] As a manager, you want to be able to anticipate what will be the consequence of your actions, in order to maximize your effectiveness and efficiency. Nowadays the idea that one needs to rely on data to make a decision, instead of gut feelings or one's opinion is generally accepted. But too often, people still believe that correlation is the same as causation. It's not. It's not because two events are happening at the same time, that you may conclude that the first one is the cause and the other one is the consequence. As matter of fact, very often both are the consequences of the same cause. And if you fail to see it, you won't be efficient because you will try to change something, but will have no consequence on the outcome you want to obtain. As a first example, let's imagine that we are dealing with credit risk, and that we're in charge of selecting the customers, that may or may not be granted a loan. What we are looking for, is actually for a way to understand how to measure the "credit score" of those applicants. And then, make a decision based on this scoring. The first step, is hence to understand what is driving credit scores in a quantitative and robust way. If you have hundreds of applicants, we cannot just guess. It's better to have a rigorous approach that will be statistically efficient. Let's take an example that is provided in the book, "An Introduction to Statistical Learning with Applications in R", from James, Witten, Hastie and Tibshirani. In this data set, we have 300 individuals who were assigned a credit score. Actually, in the original data set, we had some more individuals, but I'll keep those aside for the moment. And I'll explain why later. We also have information about the yearly income. The number of credit cards owned, their age, their level of education, their gender, whether they are a student or not, whether they are married, their ethnicity, and the average credit card debt that we'll call balance. A regression models the relationships between potential causes (the independent variables), and an outcome, (the dependent variable). As usual, we leave the technical considerations aside, and focus on how this tool can help us. In practice we should focus on 3 aspects. One, what's the explanatory power of the model? Or said otherwise, is the model really helpful in describing the situation? This can be measured with a correlation between the values as recovered by the model and the actual ones. Or the number of correctly classified observations if we use a logistic regression as we'll see later. If the model explanatory power is weak, we can not say that the conclusions will be decisive. But if the model describes the situation well, we can then focus on two: what are the significant factors? The first outcome of a regression will be to identify the statistically significant factors from those that are not. This will be assessed using the p value of the effects. Again, in this MOOC I'll not explain what the p value is, but you should just see it as a measure of how weak the significance is. So the smaller the p value, the more significant the effect is estimated to be. We can only actionate an effect that is statistically significant, but if it is statistically significant, we can then look at three: the sign of the estimated effect. Is it positive? Is it negative? And what are the most important ones? Note that - as it is often the case in statistics - we will first be interested in knowing whether an effect is significant or not, before wondering whether the effect is positive or negative. Because at the end of the day, significance without knowing whether the effect is positive or negative, means that we should take care of the driver as a manager. While only knowing whether the effect is positive or negative, without knowing whether the effect is really significant, is not actually actionable in practice. So it's indeed significance that is of primary interest! All this information can be provided with our R tools, such as the lm function for instance. So let's apply it to credit scoring situation. The first thing we need to assess is whether the model does a good job or not at estimating the credit scores. As we'll see during the module dealing with predictions, the most rigorous way to assess the quality of our estimator, would be to take out of sample data and compare estimations with actual results for observations that have not been used to estimate the model parameters. This would assess whether our estimator does a good job, at "predicting" the outcome in an unknown situation. But here I'll just look at the "in-sample" accuracy. For each observation the regression provides a credit score that is the one "estimated by the model." Those are what we call the fitted values. We can then compute the correlation between those "fitted values" and the actual credit score. You will see that you obtain a correlation of 99% which is amazingly good! But we need to be careful with correlations, because it may be driven by the most extreme observations. Imagine that you would have only one accurate prediction. But that it would be a very extreme one. Say, 1,000 times bigger than the other ones. In that case, we would have a correlation close to one as well. So let's find a visual way to assess the quality of predictions by plotting the fitted values as a function of the actual credit score. And we see then that the recovery isn't very good. Even if, as I just mentioned, the smaller values are not as well estimated as one could believe with a correlation of 99%. Now that we are confident in the quality of our model, let's look at the significance of the effects. The summary function provides all the relevant information about the model. In particular, it reports the level of statistical significance, with three stars when it's very strongly significant, two when it's significant, and only one when it's weakly significant. A dot means that the significance level is very low. No symbol would then mean that it's not significant at all. The intercept is not really relevant here. And we see that we have income, balance, and the fact that the applicant is a student or not that are very significant. The age is only weakly significant. Now let's focus on the significant drivers and look at their effects. What are the most important ones?, and is the effect positive or negative? The problem is that the drivers are not directly comparable, some variables are zero or one, some are categories and some are continuous variables with different ranges. So, let's look at the absolute value of the T-value column. This column allows to assess the strength of the significance of a driver. The larger the absolute value, the better. We see that the most important driver is balance, then income, then the fact that the applicant is a student. And only then, the age which is only weakly significant. We can then look at the sign of the estimates, which indicates whether the effect is positive or negative on the credit scoring. Finally let's only report the relevant information for a manager, and let's do it in a visual way by just focusing on how important the factors are. By ranking them and categorizing them with colors, whether they are positive or negative we then get this table. A table is nice, but can't we think of a way which is even more visual to report those effects? We can use a plot maybe. For instance, we could report the dependent variable, here the credit score, as a function of the most important factors we've identified thanks to regression. If you do it in R, you will see that you can then obtain this plot. Which reports nicely the relationship between the dependent variable, and the balance. Which is, as we have seen, the driver impacting the most credit score. As long as it's consistent with the effects estimated in the regression, you can always produce as many plots as needed to report in a visual way how the causes impact the consequence you are interested in. People will always will understand easier, a plot like this than a story about p-values and t-tests. For instance, you could also do the same thing with the income, like this. But never forget, that it's only an "unconditional" result, meaning that you do not take into account the other effects here, where the regression does. But as long as it's consistent with rigorous statistical approaches, this type of representation should always be preferred in practice since it helps your audience to understand the message you want to pass.