So far, we have discussed predictions using sample means, that is using means of the variable we want to predict. We now enter the world of predictors. We look at how we can use other variables correlated with the variable we are interested in to make the prediction. In this and the next sessions, we start with simple correlations and multiple regressions. Multiple regression is itself a more elaborate form of correlation. The real difference is the analysis of causality, which we tackle in the final sessions of this course. We first need to clarify the distinction between correlation and causality. The definition of correlation is: if two or more variables are correlated, when one increases, the other one increases or decreases. The definition of causality is: if one variable causes another, when the variable increases, the other one increases or decreases. What is the difference then? An example may help. Consider the following statement: startups with more qualified employees make more profits. Is it a causal statement or a correlation? Causality means: qualified employees imply the startup is more profitable, for example because they are more productive, more innovative, or more motivated. However, there are two main reasons why there could be correlation without causality. The first reason is called omitted variables. In this case, you think one variable causes the other, but there is a third variable causing both. In our example, high-tech sectors could be more profitable, and they demand more qualified employees. The second reason is reverse causality. More profitable startups attract more qualified employees who seek higher stipends only more profitable companies can provide. Confusing correlation with causality produces wrong decisions. For example, you hire more qualified employees because you think they make your startup more profitable, but they do not raise the profits of the company. Correlations are nonetheless useful because they generate predictions. If two variables are correlated, we can use one to make predictions about the other. Also, you can use them to falsify theories. For example, suppose a manager theorizes: qualified employees are more innovative and raise the profits of the company. If they find that companies with more qualified employees are not more profitable, the theory is not correct. But the important distinction is that while correlations make predictions, only causal statements inform about actions. Only if you find that qualified employees cause increases in company profits, you can expect to raise the performance of your firm by hiring more qualified employees. Otherwise, you have a predictive instrument, but if you take the action, that is hire qualified employees, you do not obtain the outcome you expect, which is the increase in profits. Correlation analysis measures the strength of the linear association between two variables x and y. It measures how much the two variables are related in the sense of moving in the same direction, both increase or decrease, or in different directions. If one increases, the other one decreases. Linear correlation does not capture nonlinearity. For example, the variables are related according to U-shaped forms. Scattered plots, or XY diagrams, help to see correlations. For example, consider a sample of companies whose profits and share of employees with university degree are represented by dots with profits on the x-axis and share of employees in the y-axis. The dots can be aligned positively, the two variables increase or decrease together, or negatively, or the plot can be scattered. In this case, the two variables are independent: Changes in one variable are not associated with changes in the other variable. The correlation coefficient is a number between -1 and 1 that measures the strength of the linear correlation. A coefficient closer to 1 means that the dots tend to be aligned along an increasing line in the x-y space, and they are perfectly aligned on the line when the coefficient is exactly equal to 1. A coefficient closer to -1 means that the dots tend to be aligned along a decreasing line, and they are perfectly aligned when the coefficient is exactly -1. The correlation coefficient only captures the alignment along the line, not the slope of the line. This is the difference with regression analysis we discuss in the next session. If the dots are scattered, the correlation coefficient is close to 0, indicating no association between the variables. You can easily find the formula for computing the correlation coefficient on the web or from basic textbooks in statistics. To show how to use correlation coefficients in practice, upload again our dataset Italian startups in Stata. You can check the correlation between total funding and the number of letters in the company's name, another variable included in the dataset. The Stata command for the correlation coefficient is: pwcorr, followed by the name of the two variables. If you launch this command, you will see that the correlation coefficient is -0.124, suggesting that companies with shorter names tend to be raising more funds. The correlation coefficient is itself a random variable because different samples of a population yield different correlation coefficients. This means that the correlation coefficient is associated with measures of statistical significance that tell us whether the estimate is different from 0. In Stata, you add a comma and the term sig after pwcorr and the variables. This command provides you with a p-value of the estimated coefficient. This p-value is the probability of type I error. It measures the probability that if the true coefficient was 0, you’d obtain an estimate equal to what you actually obtain. You can check that in our case we obtain a p-value of 0.03 percent for our estimated coefficient of -0.124. This says that if the true correlation coefficient was 0, we would estimate -0.124 with an incredibly small probability. In turn, this suggests that our estimate is very reliable. Pwcorr does not show confidence intervals, but another way to think of this level of significance is that the -0.124 will have a very narrow 95 percent confidence interval around -0.124. Correlation or causality then? The immediate intuition is that this is a correlation. However, it may be that a shorter name makes it easier to remember the company, which facilitates fundraising. A manager can only answer this question if s/he has a theory and tests it, particularly using the causal tool to be discussed in future sessions. However, this correlation helps prediction. If in our sample we see companies with shorter names, we predict they have raised more funds. But unless we establish causality, we cannot hope that by shortening the name of our company, it will make it more profitable.