In this video, we'll review the linear regression model assumptions. Then we'll start to describe a set of methods that will allow us to detect violations and those assumptions. Recall that statistical models in general, and then linear regression models in particular, require statisticians and data scientists to make certain assumptions about their data in the context of regression, we had four statistical assumptions that we made about our data. The first one was the linearity assumption, and that was really the most important of the statistical assumptions. In the linearity assumption, we assume that the relationship between the response variable and the parameters is a linear relationship. Now, often there will be a linear relationship in this context between the response and the predictor variables. But actually, that's not required, in order to use the linear methods that we've described so far, namely least squares and maximum likelihood estimation. We require a linear relationship within the response and parameter values. Now the second assumption that we make is that the response measurements are independent from one another. What that means when we have normality, which is the fourth assumption, is that it's equivalent to saying that the response measurements are uncorrelated. Another way of saying this is that the error terms in the model are independent or uncorrelated. Now what this really means is that the probability distribution placed on y i doesn't change if we know something about, say, y k, where k is not equal to i. It really just means that knowing one of the other response measurements doesn't tell us anything about the response measurement in hand in terms of its probability of occurrence. Now the third assumption that will make linear regression is homoskedasticity or constant variance. We assume that every response measurement has the same variance, and we don't see things like low variability for some response measurements and high variability for others. Then finally, our last assumption is that the error terms, or equivalently the response terms are normally distributed. We now place a stricter criterion on the response measurement or the error terms, namely that they follow a normal distribution and they're not just some other random variable, so far in the course, we've taken for granted that these regression assumptions have been met. But in this unit, we'll really dig a little bit deeper and we'll try to consider techniques that will allow us to decide whether or not the assumptions have been met. What we typically do will be to fit a model to the data and then study some properties of that model and see if the properties seem a bit off. If they're off, we might have some evidence against one of the assumptions being met. If we don't see anything a bit off in the properties of the model, then we don't have evidence that we have a violation of assumptions. Now it's worth noting that there are broadly two different techniques for trying to diagnose violations and our assumptions. Broadly speaking, those are numerical techniques and those could be like formal statistical tests. Other techniques would be graphical techniques. Now it seems like the numerical techniques would be better. They seem in some ways more objective than the graphical techniques. But in fact, that's not typically the case. Typically statisticians and data scientists will really favor the graphical techniques because they give a bit more insight into the models that we fit and into their potential misfit or misspecification. Now I think it's worth noting that in this coming unit will really learn how to assess the statistical modeling assumptions and whether they hold for a particular set of data. For example, will learn to fit a model to data and then use the fit to diagnose violations and the key assumptions. But if you remember back to module one, we introduced an additional fifth assumption, and that one was called validity or concept of validity. If you recall validity applies to a data set. We would say that the validity of a data-set or a measurement tool is the extent to which the data-set or measurement tool measures what it claims to measure. It's not always the case that our data measure the thing that we claim to measure. This could be pretty tricky, especially in the social sciences and in the social sciences, we might have data that are operationalizing something but not doing it very well. There's a nice passage from a regression book called regression and other stories by Gelman Hill and the Atari. These authors write that we can define the validity of a measurement process, has the property of giving the right answer on average across a wide range of plausible scenarios. To study validity in an empirical way ideally, you want settings in which there is an observable true value and multiple measurements can be taken. What that would mean is you would be able to measure your variables many times to get a sense of a low amount of variability, and the sense that you're really capturing the thing that you want to capture. You should note that this is something that must be assessed during the measurement process. It's not necessarily something that can be assessed once the data had been collected to be analyzed in terms of explanatory or predictive purposes. Further, Gelman Hill and the Atari go on to say that in social science, validity can be really difficult to assess. When the truth is not available, measurements can be compared to expert opinion or another gold standard of measurement. For instance, a set of survey questions designed to measure depression in a new population could be compared to the opinion of an experienced psychiatrist for a set of patients, and it can also be compared to a well-established depression inventory. The idea here is if you develop some new tool to try to measure something like depression, then in order to know that that's a valid measure of depression, you should compare it to something that you already know measures depression well, so that might be expertise like a psychiatrist and might be another measurement tool that's already been verified. There's going to be some way to try to assess the validity of your new tool. I think it's important to note that because of these issues, the concept of validity really should be assessed in collaboration with a domain expert, and the domain expert would be someone who knows the science or business application well. I think it's important to diagnose those methods maybe before you collect data if possible, or to just have that conversation with a domain expert. Do they really think that the data that they have in front of them are valid for measuring whatever it is they really want to measure? Now lastly, we should note that the diagnostic methods that we'll cover in this module don't necessarily assess validity, a data-set might be valid for answering the research questions at hand. Yet the diagnostic techniques in this lesson to follow will fail to reveal deviations and our assumptions, and that's because it's only looking at deviations in the assumptions of linearity, independence, constant variance, and normality. Just something to keep in mind, validity is important, but it's not something that we will cover much in this module.