What about the questions that we can answer once we've run a regression? Well perhaps, the most used aspect of a regression model is as a methodology for predictive analytics. So businesses are really embraced protective analytics in the last few years. Always trying to predict outcomes. Predicting for example a product that an individual might buy on a website. Then we might want to predict the rating that somebody gives to a movie that they watch on a streaming server. We might try and predict the price of a stock tomorrow, so prediction is very common task that we face in business. We call our approaches to prediction. Predictive analytics in general. And if you have a regression, you certainly have a tool for prediction. Because once you've got that regression line there, the prediction is pretty straight forward. It's take a value of X, go up to the line and read of the value in the Y direction. So, an example question would be, based on our regression model for the diamond's data set, what do you expect to pay for a diamond that weighs 0.3 of a carat? The answer would be take 0.3 on the X-axis go up to the line and read off the value. Or equivalently, you can plug in 0.3 to the regression equation to work out that expected value. And one of the other things, though, that regression will do for you, it will just give you a prediction with suitable assumptions which we will have a look at in a little while in this module. With suitable assumptions we're able to get a prediction interval as well. And that prediction interval gives us a range of feasible values for where we think the outcome or forecast is going to lie. And that in practice tends to be much more realistic than just trying to give a single best guess. Another thing that we do with these regression models.is interpret coefficients coming out of the model. The coefficients themselves can tell us things,they can give us information. And so, I might also question,how much on average do you expect to pay on diamonds that weigh 0.3 of a carat versus the diamonds that weight 0.2 of a carat? Well,that's a change in X of 0.1 and given a linear regression with a slope that happens to equal 3,720 basically. What we can do is say, well, if we look at diamonds weighing 0.3 of carat versus 0.2 of a carat we can anticipate paying an additional $372 for them. Given the underlying regression equation so we're essentially interpreting the slope in the regression. Likewise, intercepts sometimes have interpretations and intercept might be interpreted as a fixed cost, it might be interpreted as a start up time so we often want to interpret coefficients. Another thing though that regression is able to do for us is to provide a numerical measure of the amount of variability in the outcome here price that is explained by the predictive variable. So how much variability in the outcome can we explain with the model? Typically, we like to be able to explain a lot of variability but quantifying that can be a useful activity in its own right. So we will see a numerical measure that tells us the proportion of variability explained by the regression model in due course. But prediction, interpretation and explaining variability calculations, those are key things that a regression model can do for us. So let's beam in a little bit more on this prediction idea and see one of the ways that you can immediately put a regression to work. And this harks back to a discussion that we had in another module when were talking about prospecting for other opportunities. Now, in this particular example, where I'm looking I'm at diamonds, I'm imaging that I'm a diamond merchant, or a diamond speculator. I mean, the same ideas could easily work looking for new customers, looking for new investment opportunities. Well, let's say, we've collected some data, we've fit our regression model, our linear regression model. That's finding the best fitting line to the data and then we come across a diamond. And that diamond ways 0.25 of a carat and it's being sold for $500. So I've added that point to the graph here and it's the big red dot. Now, if I see a point like that which is a long, long way beneath the regression line then it's potentially of great interest to me. Because if I believe my model, and this is a huge caveat here. Given that I believe my model, then there's something going on with this particular diamond. Now, one of the possibilities is that it's being mispriced by the market. And if it's being mispriced by the market, then it's potentially a great investment opportunity. There is another explanation though, is that maybe there's some flaw associated with this diamond and that's why it's going for such a low price. I don't know which of those two is a potential explanation, until I've gone to have a look at the diamond. The point that I'm making here is that this activity of looking to see how far away the points are away from the regression line is a technique for ranking potential candidates. And some people use the word, to come up with a set of candidates that looked the most interesting to me. And so that's one of the uses that you can put a regression model to. In some rate points a long way from the line can be of great interest. I've shown you some regression lines but I haven't yet told you how they're calculated. So where does this regression line come from sometimes called the line of best fit. There's a methodology and that methodology is called the method of least squares that is the most frequently used one to calculate these best fitting lines. And so, it's not the only way of calculating the line to go through the data, but it's a very commonly used one. And if you pick up a typical spreadsheet program, it's the one that's going to be implemented when you run your regressions there. So the optimality criteria, because we're going to fit the best line, is known as the method of least squares. And, in words, what the least squares line is doing, is finding the line amongst all the infinite number of lines that you could potentially draw through the data. It's finding the line that minimizes the sum of the squares of the vertical distance from the points to the line, and I've illustrated that idea by beaming in on the diamonds data. I've taken a small range, and I've drawn a line there, I've drawn the points around it. And the red lines are picking up the vertical distance from the point to the line. And what we want to do is find a line that minimizes the sum of the squares of those vertical distances. And we're going to call such a line, the least square line or the line of best fit. So basically, what you're trying to do is find the line that most closely follows the data, that's another way of thinking about it. But there is a formal criteria that criteria is implemented in software and you'll use that software to actually calculate a least squares lines a regression for any particular data set that you might have. So the least squares criteria is a line-fitting criteria. So we've now seen how these lines of best fit derive that derived through the least squares criteria. What these lines allow us to do is decompose the data into two parts. That's one of the key insights with a regression, so a regression line can be used to decompose the data. In our case, when we're looking at diamonds, the prices into two components. One component is called the fitted values, those are the predictions, and the other component are known as the residuals. So in terms of the picture on the previous slide, if we have a look here for any given value of X, the fitted value would go up to the blue line. And then the residual is that vertical distance from the blue line to the point. So you can see, you can, ultimately, get to one of those points in two steps. You take your X value, beneath it, first of all, you take a step up to the line, and then, once you're on the line, you add on the little red line, the residual, and you'll get to the data point. So, that says that the data point can be expressed in two components, one, the line and to the residual about that line. So that decomposing of the data into two parts mirrors a basic idea that we take to fit in these regression models. And that idea is that the data we see is made up of two parts, we often call that the signal and the noise. And the regression line is our model for the signal and the residuals are encoding the noise is in the problem. Both of these components that come out of the regression, both the fitted values and the residuals are useful. The fitted values become our forecast, if you bring me a new diamond for a given weight, let's say, 0.25 of a carat, what do I think it's price is going to be? I simply go up to the regression line, the so-called fitted values, and I read off the value of Y, the price. Now, the residuals are useful as well, because they allow me to assess the quality of fit of the regression model. Ideally, all our residuals will be zero, that would mean that the line went through all the points. In practice, that is simply not going to happen but we will often examine the residuals from a regression. Because by examining the residuals, we can potentially gain insight into that regression. And typically, when I run regressions one of the very first thing I'm going to do is take all the residuals out of the regression. I'm going to sort that list of the residuals and I'm going to look at the most extreme residuals. The points with the biggest residuals are by definition those points that are not well fit by the current regression. If I'm able to look at those points and explain why they're not well fit. Then, I have typically learned something that I can incorporate in a subsequent iteration of the regression model. Now that all sounded a little bit abstract, I've got an example to show you right now. So here's another data set that lends itself to a regression analysis. And in this data set, I've got two variables. The outcome variable, or the Y variable, is the fuel economy of a car. And to be more precise, it's the fuel economy as measured by gallons per 1,000 miles in the city. So if you're going to take let's say, you live in the city and you only drive in the city. How many gallons are you going to have to put in the tank to be able to drive your car a thousand miles over some course of time? That's the outcome variable. Clearly, the more gallons you have to put in the tank, the less fuel efficient the vehicle is, that's the idea. Now we might want to create a predictive model for fuel economy as a function of the weight of the car. And so here, I've got an X variable as weight and I'm going to look for the relationship between the weight of a car and its fuel economy. We collect a set of data, that's what you can see in this scatter plot. The bottom left hand graph on this slide and each point is car, and for each car we've found it's weight, we've found it's fuel economy, we've plotted the variables against one another. And we have run a regression through those points, through the method of each graph. And that regression gives us a way of predicting the fuel economy of the vehicle of any given weight. Now why might you want to do that? Well, one of the things that many vehicle manufacturers are thinking about these days is creating more fuel efficient vehicles. And one approach to doing that is actually to change the materials that vehicles are manufactured from. So for example, they might be moving from steel to aluminum. Well, that will reduce the weight of the vehicle. Well, if the vehicle's weight is reduced, I wonder how it will impact the fuel economy? And so, that's the sort of question that will be able to start a dressing through such a model. So that's the set up for this problem but I want to show you why looking at the residual chores can be such a useful thing. So when I look at the residuals from this particular regression, I know one of the residuals actually I found the biggest residual in the whole data set. And that's the point that I have identified in red in the scatter plot and it is the biggest residual. It's a big positive residual which means that the reality is that this particular vehicle needs a lot more gas going in the tank than the regression model model would predict. The regression model would predict the value on the line. The red data point is the actual observe value, it's above the line, so it's less fuel efficient than the model predicts. It needs more gas to go in the tank than the model predicts so is there anything special about that vehicle? Well, at that point, I go back to the underlying data set and I drill down, so when I see bigger residuals, I'm going to drill down on those residuals. And drilling down on these residuals,actually identifies the vehicle.and the vehicle turn's up to be something called a Mazda RX-7. And these particular vehicles somewhat unusual,because it had, what's term to rotary engine? Which is a different sort of engine than every other single vehicle in this data set. Every other vehicle had a standard engine but the Mazda RX-7 had a rotary engine and that actually explains why its fuel economy is bad in the city. And so, by drilling down on the point, by looking at the residuals, I've identified a feature that I hadn't originally incorporated into the model. And that would be the type of engine. And so, the residual and the exploration of the residual has generated a new question for me that I didn't have prior to the analysis. And that question is I wonder how the type of engine impacts the fuel economy as well? So that's one of the outcomes of regression that can be very, very useful. It's not the regression model directly talking to you. It's the deviations from the underlying model that can sometimes be the most Insight for part of them model itself or the modeling process. I remember in one of the other modules I talked about, what are the benefits of modeling? And one of them is serendipitous outcomes, things that you've find, that you hadn't expected to at the beginning. And I would put this out there as an example of that by exploring the residuals carefully. I've learned something new, something that I hadn't anticipated and I might be able to subsequently improve my model by incorporating this idea of type of engine into the model itself. So the residuals are an important part of a regression modem.