Hi everyone, and welcome to our lecture on linear modeling. In this section, we're going to study some more examples of how given a bunch of data points, we can use a linear model to predict one variable from another, and it'll be better than just randomly doing it. We wanted also distinguished is important between the ability to predict one variable from the other and the issue of whether changes in a variable are actually caused by changes in the other one. And of course, we're going to keep going back to this idea that just because we are correlated does not imply that there is a causal relationship. Correlation does not imply causation. Okay, so let's start with regression lines. So if a scatterplot shows a nice straight line relationship, give or take, like the one we see on screen here, this one has a bunch of scatter points plotted around, generally, a line. There are some, of course, off the line. But we would like to summarize this pattern by drawing a line on the graph. So a regression line, a regression line summarizes this relationship between the two variables. One of the variables helps predict the other, and that's the idea of a regression line. And there are plenty of regression lines that I guess we could draw. So maybe there's another one that goes over here, or there's another one that goes above or below. It just depends on where the line is. But we often use these regression lines to predict the y values given a value of x. So for example, in this data set where each blue dot represents a data point, let's pick a value that I don't actually have on my graph. So let's say that there's no particular value at x = 19.5. And I want to know, hey, if I have some x value at 19.5, what prediction, what prediction will I actually get for my y value given all these other data points that are there? And so what you can do is you can use the regression line, whichever one you want, to make your prediction. And so depending on your regression line, you can start to pick different y values. How about the three regression lines I drew here? You can imagine that there are three y values, y1, y2, and y3, that all fall on this continuous line. Depending which one you want, then you can have either a higher prediction, a lower prediction, just depending on what the situation calls for. So you can always draw a line that are close to the point, somewhere in the middle. If you're drawing this thing by hand, you'd really like, of course, a mathematical way to do this so that you get the best line. And so first you have to define what it means to be the best. So what we're going to do is try to minimize the distance between, the vertical distance between the points. So if I grab a random point, I can always drop a straight line down or up depending where it is, and I have some distance d to the point, so d1, d2. These distances can vary and they will vary for each point. If you take this distance, it is so you find all vertical distances. Okay, so you have all these di's, for every single point, you can drop a line down to the middle. You square them, so you look at each one squared. And then you sum them and you add them up. So I'll use sigma notation again and you add up all the di squares. The idea is this is called the sum of squares of the vertical distances. It's a bit of a mouthful, but it's sum of the squares of the vertical distances. And the goal is you want to minimize, you want this to be the smallest it can be, you want this sum. So again, best left to a computer to do this, with many, many data points. Even with a small number of data points, doing this by hand is pretty rough, and actually involves some techniques in calculus which we don't have yet. So we'll let the computer just take this back. But we call this best line, not one we draw by hand, the one that is actually calculated via this process, is called the least-squares regression line. So we say the least-squares regression line of y and x is a line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. We want to minimize that distance. When you use a computer or a calculator to find these things, you will get back good old equation of a line, y = mx + b. Your m and b though, these are what are found by the computer, they're usually three, four decimals long, they're usually not pretty numbers. And that's going to be one of the differences you see between sort of nice examples that are just helping you understand what is slope, what is intercept, versus real data, real numbers that are found via data sets. So speaking of data, let's get right to it with our small data set. We have brain size here. We have our first six numbers, 100, 90, 95, 92, 88, 106. This is a measure of pixels, 10,000 pixels. And intelligence is your IQ,140, 90, 100, 135, 80, 106. Obviously, not very useful data set, just a sample data set, just so we can play around with it. So I have the same scatterplot as we have before. So there's our six points on the graph. And what I want to do now, and I use Microsoft Excel to get this, I want to find the equation of the regression line. In Excel, and we'll see this in the next video how to do this, but you can see the line, it kind of goes right through the points, and I'm getting back, I'm going to write a little bigger, y = 1.3547x plus, I'm going to write it this way, plus -20.922. As expected, our numbers are kind of gross decimals. They actually go on, this is the rounded version of them, but they actually do go on more and more. So our slope m is 1.3547, our intercept b is -20.922. And there it is, mx + b, it is the equation of a graph. So get used to these nasty numbers inside of when you're working with regression lines, your least square regression lines. Let's think about how to interpret these numbers. The slope is, as always, the change of the y value over the change of the x value, rise over run. And it measures as x changes by one unit, how the y variable changes in one unit. If it is positive, then we tend to be positively correlated. If it's negative, we tend to be negatively correlated. So it tells you the general shape and form of your scatterplot. This number is actually more important, I think, than the intercept. The intercept, remember what the intercept is, this says, what is your value at 0? So if you plug in 0, you get your intercept b. In our picture back here, think about for a second if this makes sense. This is saying that if I were to extend the line all the way back to where x is 0, that I have a negative intercept. This will correspond to the very meaningless data size of (0, -20.922). We normally don't care about the intercept. If I'm asking you a general math question and say, find the intercept, sure. So it's a point on the graph. When these things have meanings, the data value 0 tends to not be too useful too often. Think about what this says, it says if I have a brain size of 0 pixels, that means that there's literally nothing on the screen, then my IQ is -20. As numbers, this is where interpretation of the values comes into play. They're not very useful, but it is a value needed to describe the line. But in terms of its usefulness in terms of a data point, it is not very useful. We normally will hang out inside of our range of our data. In this particular case here, our smallest value is 88, our largest value is 106. So we're going to hang out between 88 and 106 for our x values. What's nice about having the equation for the line, again, a computer will get that for us, is that I can use this to make predictions. Let's say I wanted to know or I wanted to predict What the intelligence would be is my y axis. What the intelligence would be, if a new brain was measured to have size 93, okay? So this is in between my range of 88 and 106. It is not on my chart, so I can't just look it up. So I'm going to say, well, what can I guess it would be, what prediction can I make if x = 93? Well, remember y = 1.3547(93)- 20.922. Work that out, you get 1.05 and that'll give back 0.0651. As a decimal, it's about 1.05. If we were rounding to the nearest integer, you'll see that point. If we were to measure that on our graph, go to about 93 and about 105, and it's somewhere on our line of best fit. So the line of best fit is used to make predictions when inside of the range of your data. And it gives a systematic way to do so. For any value of x, you can plug into the equation. One thing to be cautious of, and I'll put this in red because it's a little bit of a warning. What if I said to you, what if I wanted x = 120? The reason why I'm sort of giving a warning here is that this is beyond the maximum value of our data set. We are past sort of the border, the range of our data set. When you get out this far, you're in dangerous territory here. Think about it as making a prediction about the weather too far in events. If you go out too far, the model tends to get worse. It's even difficult to do things close enough. For some economic model, we're trying to predict some economic indicator in five years, ten years. The further out you go, normally, the worse your predictions are. This has a name when you do this. This is called extrapolation. It's called extrapolation. So always be wary of extrapolation. If someone tells you I know what the weather is going to be in six months from now, you probably wouldn't believe them. But if someone said, hey, I can probably tell you what the weather's going to be tomorrow, you might believe that. So just stay within your model, don't go too far outside of your range, your predictions get very risky. And usually they're not based on the data that you have. In this process of modeling something with linear data, any number can be plugged in, the computer is going to spit it back. This is the role of computer, it does whatever you tell it to do. You the human though, has to sort of decide how valuable is that number and what do I report back, okay? That is a decision that the computer cannot do, okay? This is extremely important because this is the human side of thing. Calculating things, best of the computer. Interpreting the data, best of the human. So I just want to give you some guidelines on how to interpret these predictions, these outputs from the least squares line. And in general, when we do linear modeling. First thing, predictions are based on some model of the data set. Just because I'm showing you in this video how to do a linear model, don't run off into a linear model for everything. Remember, we said there's a time when a linear model applies. There's a time when it's not. Even if it's linear, who knows if there's other variables that we haven't graphed, or trying to do things. So this is just the explanation for this particular model. The strength of a prediction is if you try different models and they keep on giving you the same thing, then you get stronger, stronger evidence that your prediction is actually a valid one. But based on some model, that's the key. Every model is based on assumptions and has certain parameters of inputs. So be careful whatever the prediction is, is just based on that particular model. Second, predictions work best when the model fits the data closely. So predictions are more trustworthy. You can believe in them when we have data that lives together. They're not very spread out. Yes, you can run through the process. Yes, you can tell the computer to compute line of best fits. And yes, you can throw in any number into that equation. But you want that line to fit the model perfectly. So even if the data is loosely correlated, not strongly correlated, but there is some correlation. It may not be the best model to give the best predictions. So it's not so easy to see these patterns when there are many variables in particular. And especially, if you don't have a strong correlation, then the prediction may become just very inaccurate. So watch out for that if someone is handing you or if you are creating a prediction based on these models. And I said this before, but I'll say it again, beware of extrapolation. First of all, realize that these are a different thing. I could have a nice strongly correlated linear model. And it's a beautiful scatterplot, beautiful line of best fit. And then, I grabbed some value outside my range, either a little too low or a little too high. And I run through the model and I output something. And there it is. This is the danger of making prediction too far in the future, okay? So, and you see this all the time, especially with like weather, economic predictions. People are trying to predict the future, what's going to happen, and it's just really hard to do. It's hardly ever correct. You just don't have enough relevant information, things change too much. So just be aware of extrapolation. One other thing that I want to point out to you is that beware of outliers. As we said before, they have a strong effect on correlation. So what I have here is the same six data points is for brain size and intelligence, but I added one more. And this is completely made up number. So I want to pick something that's a little smaller outside of my range. Remember the smallest mean prior was at 88. Now I'm putting a number in, 80, which is smaller. And I'm putting in a outlier of intelligence of 145. So it's larger than my old max at 140. So I have this new point at 80 and 140. Asking the computer, again, to compute the least squares regression line. And we get a new equation, not surprisingly, since we have a new data point. I'll write this out since it's a little small. The equation is y = -0.4014x + 150.62. So our slope all of a sudden went from positive to negative. Now, it's -0.04. Our intercept went from negative to positive. So adding 1 point, this point clearly is an outlier if it's something smaller than this point. So we had on the x variable of larger than the largest y variable we had, it completely changed the slope of our line. Now before, whereas the data was saying it was positively correlated, now we're getting that it's negatively correlated. So adding one point can really, especially if an outlier, really change your prediction. And if I now take this equation and start throwing in all the other variables that I had before, I will get different predictions based on this. So same as always, beware of the effect of outliers. The usefulness of this regression line for prediction, it completely depends on the strength of the association. The stronger we are associated, well, then the better predictions will be. The weaker were associated, then the worst our predictions will be, okay? So the usefulness of the regression line completely depends on the correlation between the variables. And it turns out, the way to measure this is called the square of the correlation, denoted as r squared with R and r. So we're after r squared, sometimes called capital R squared. Different software, different technologies will use capital for slower letters, they're the same thing. So it's the proportion of the variation in the y variable that is predicted or explained by the x variable. It provides a measure of how well the observed outcomes are replicated by the model. And so the idea is when there's a straight line relationship, some other variation of y, you can imagine sliding y along this axis, how high or how low it goes. It's accounted for by the fact that x is just changing. As x changes, it pulls y along with it. That's the idea. So when you report on our regression, it's pretty common to give r squared, the square of the correlation, this is our new term here, for This square of the correlation or the coefficient of determination, whatever you want to call it, again, lowercase r, capital R, goes by the names. It is pretty common to give this as a measure of how successful the regression was in explaining the response. Is this the right model? I know I can make predictions, but is it doing a good job of explaining the response variable? And remember when you see a correlation, so if I give you r, we can square it to get a better feel for the strength of the association. So when r is perfect, when all the data points lie on straight line, r is 1 or negative 1. Again, this hardly ever happens but if it's perfect line, so if you square it, what does R squared become? Well, negative one squared is one, one squared is just one. So perfect correlation says that the model is perfect. Everything explained by the linear piece. If you have r at some other point, perhaps point 0.7, if you have r, say around .7 or .6 or so, and I square this thing, well then R squared gives us about .49, which is about 50%. And what that's saying is that about half the variation is accounted for by the straight line relationship. So half of it, so you can imagine that this variable, half of the relationship y y is changing is based on- half of that reason is based on the x variable, but it also tells you that half of the reason is changing is not based on the x variable. So there's something else going on about that. Going back to our brain size versus intelligence data set. If I take the original data sets without that extra outlier in there, I had the equation y equals 1.3547x minus 20.922. Again, we can let the computer work this out, but R squared is a very weak positive association. In this example here, R squared for the computer is .377, so what does it say? How do we interpret this? This says that about 37%, maybe 38 if you want to round or 37.7% of the variation in y. So remember y is y is intelligence comes from brain size. Now this would be the way to interpret this number. Now obviously, my data size is so small, please don't take this number to mean anything. But this is how you would interpret this if you had a real large data set and you were looking at the value of R squared. It says, well the contribution of the x variable into the prediction of the y variable is about and it's usually given as a percent, not a decimal, about 37.7%, about 38%. And that's how you will interpret R squared. We'll end on this note because you can't repeat it enough, correlation does not imply causation. So important let me say it again. Correlation does not imply causation. Coupled with just highlights here a strong relationship where r is close to one or negative one does not always mean that changes in one variable cause changes in the other. I've seen this before where people come back with a graph and r is like one or point nine nine nine. And they say, uh-huh, x causes y. That's not true, okay? It's just showing that they are correlated. Watch the language that you use to present this information. Second, the relationship between two variables is often influenced by the other variables not accounted for. Remember, this is all based on whatever model you have, whatever variable you've included, whatever assumptions that you've made. There are other variables usually out there, sometimes they're called lurking variables. But there are other models that you could have done. Other variables you could have tested. This is just based on your particular model, some model. Last but not least, if you do want to get into causation, no statistical analysis is ever going to get that for you. You want to get into experimental evidence with control groups and this gets in another hole to hold a [LAUGH] branch of science on how to do this but just remember you're looking at correlation, not causation. And be wary of any one that claims based on some scatterplot that X and X causes Y. Just to show you a kind of comical example, if you google spurious correlations, tylervigen.com, and you ain't gotta know how to pronounce it, but there are some hilarious graphs where they go through this. You can imagine with all the datasets in the world, you can find two that just kind of overlap. And so here's one on the screen here, per capita cheese consumption correlates, is key correlates with the number of people who die by becoming tangled in their bed sheets, sounds pretty bad. So we have a graph here and the graphs are kind of the same. This is real data. This is data sources from the US Department of Agriculture and Centers for Disease Control and Prevention. And if you graph this and run this through the computer, remember the computer's done, it'll do whatever you tell it to do to say, hey, graph these two and it does, creates a nice pretty picture. You can show that the correlation, your R value is point zero nine four nine four seven. It's like point nine five. It's really, really close to one. So there, the graphs are going to be really close to each other. Does that mean that if we eat more cheese, then more people will die by being tangled in their bedsheets? I hope not, I think a reasonably intelligent person would say they're probably not related. You just happen to find two random pieces of data. Again, there's more data, the years on the screen here, just 2000 and 2009. So you're purposely zooming in on just a section where there is some overlap. So just be careful. This is where like, the computer will never tell you, hey, this is a really bad graph, I would not show this to anybody. I wouldn't present this, I wouldn't make the predictions based on this thing. But you, the human have to interpret these numbers and understand what they mean in context. And that is honestly one of the more challenging pieces about modeling, not just to spit back and become sort of the mouthpiece for the computer, really interpret these numbers and make sure they make sense to you. If you want to see some more kind of funny correlations, just to really hit the idea that correlation does not imply causation, head over to- google spurious correlations and check out some more of these pretty funny graphs. All right, great job on this video, I'll see you next time.