[SOUND] Welcome! The previous lectures have shown that ordinary least squares is a great tool to uncover relationships in economics and business. In this lecture, I'm going to make you aware that this tool does not always work. There are circumstances where OLS breaks down. These circumstances relate to the difference between correlation and causality. Luckily, econometrics also has the solution. But we before we discuss this, let's consider a motivating example. Suppose we want to explain the monthly number of departing flights at an airport using the number of travel insurances sold in the month before. What kind of relationship would you expect if you regress flights as the variable y on a constant, and insurances as the variable x? Most likely we will obtain a positive relationship. For example, like this. Now, what do these estimates really mean? I invite you to think about this by answering a test question. It is correct to use the estimates to make predictions. The positive coefficient indicates that many sold insurances goes together with many flights. Note that this statement merely relies on a correlation. The found positive correlation can be used to make adequate predictions. It is incorrect to interpret the coefficient in a causal way. Selling additional insurances does not cause an increase in flights. There is another variable that causes both the insurances and the flights. This variable is simply the demand for travel. This example shows that we cannot always interpret least squares estimation results, as causal effects. However, identifying causal effects is one of the main goals of econometrics. Ordinary least squares requires some assumptions for it to correctly estimate causal effects. One important assumption is that explanatory variables are exogenous. The violation of this assumption is called endogeneity. In this lecture, and the upcoming ones, you will learn to understand and recognize endogeneity. You will get to know the consequences of this and you will learn how to come up with an alternative estimator. You will learn how this new estimator works and the conditions that are necessary for it to work properly. Finally you will learn how to test these assumptions. Let us start by studying the source of endogeneity. The formal assumption that we violate is the assumption that explanatory variables X in the linear model are non-stochastic. So what does non-stochastic really imply? Literally speaking, non-stochastic means that if you would obtain new data only the y values would be different and the values for X would stay the same. This is like a controlled experiment where the researcher determines the experimental conditions coded in X. This assumption is crucial for the OLS estimator to be consistent. Consistent means that the estimator b converges to the true coefficient beta when the data set grows larger and larger. In economics however, controlled experiments are rare. X variables are often the consequence of an economic process, or of individual decision making. In our example, the travelers together determine the number of insurances sold. From the researcher's point of view, the X variables should therefore be seen as stochastic. Once we allow X to be stochastic, we acknowledge that we would get different X values in a new data set. And if variables are stochastic, they can also be correlated with other variables, even with variables that are not included in the model! In the context of our example, the number of insurances will be correlated with the travel demand. Although travel demand is difficult to observe and not included in the model, it does influence the number of flights. In the model, travel demand is therefore part of the error term epsilon. As a consequence, the X variable, insurances sold, is correlated with epsilon. If an explanatory variable X is correlated with epsilon, we say that X is endogenous. Usually, this correlation is due to an omitted factor. We will later see that this leads to inconsistency of the OLS estimator. That is, OLS does not properly estimate beta. If X is uncorrelated with epsilon, X is called exogenous and OLS is consistent. Now let's consider three possible sources of endogeneity in more detail. Endogeneity is often due to an omitted variable. In our example, the omitted variable was travel demand. Let's consider this situation formally. Suppose that the true model for a variable y contains two blocks of explanatory variables, X1 and X2. And that in this true model, all assumptions are satisfied. However, when we estimate beta, we omit X2. That is, we regress y only on X1. The error term epsilon in this second model, now contains the original error, eta, as well as the omitted effect of X2. From this relationship we can see that in the second model X1 will be correlated with epsilon if X1 and X2 are correlated and beta2 does not equal 0. The derivation at the bottom of this slide proves this. When thinking about whether certain variables in a model are endogenous, it is good to think about potential omitted variables. If you can think of an omitted variable that is related to the included variables, and the dependent variable, you will have endogeneity. Let's practice this a bit in a test question. Suppose we run a regression to explain a student's grade using only the number of attended lectures. What omitted variable leads to endogeneity here? The difficulty of the exam and the introduction of compulsory attendance will both not lead to endogeneity. The first variable cannot affect attendance, while the other does not affect the grade. The omission of the motivation of students does lead to endogeneity. Highly motivated students are likely to attend many lectures and obtain high grades. So a regression of grades on attendance will not show the true impact of attendance. It will partly capture the unobserved motivation as well. A second cause of endogeneity is strategic behavior. Consider a model in which you explain the demand for products using only its price. If the salesperson strategically sets high prices when a high demand is expected, high demand will often go together with high prices! A simple regression may then yield a positive price coefficient. This is of course not the true impact of price. Price is endogenous in this regression as it correlates with the market information, which in turn, determines demand. A third reason for endogeneity, is measurement error. Suppose that we have a variable y, say, salary, That depends on a factor that is difficult to measure. For example, intelligence. Let's denote the intelligence by x*. We can obtain a noisy measurement of intelligence, for example through an IQ test. The test score is called x and is equal to the true intelligence plus the measurement error. In the training exercise, you will be asked to show that such measurement error leads to endogeneity in a model that explains why using the test score x. To summarize, endogeneity is a common and serious challenge in econometrics as OLS is not useful under endogeneity. In the next lectures, we will consider solutions and tests for endogeneity. Now I invite you to make the training exercise, to train yourself with the topics of this lecture. You can find this exercise on the website and this concludes this lecture.