0:00

What about the questions that we can answer once we've run a regression?

Â Well, perhaps the most used aspect

Â of a regression model, is as a methodology for predictive analytics.

Â So, businesses have really embraced predictive analytics in the last

Â few years.

Â Always trying to predict outcomes.

Â Predicting for example, a product that an individual might buy on a website.

Â We might want to predict the rating that somebody gives to

Â a movie that they watch on a streaming service.

Â We might try and predict the price of a stock tomorrow.

Â So prediction is a very common task that we face in business.

Â We call our approaches to prediction, predictive analytics in general.

Â And if you have a regression, you certainly have a tool for prediction.

Â Because once you got that regression line there,

Â the prediction is pretty straightforward.

Â It's take a value of x, go up to the line and read off the value in the y direction.

Â So, an example question would be, based on our regression model for the diamonds

Â data set, what price do you expect to pay for a diamond that weighs 0.3 of a carat?

Â The answer would be, take 0.3 on the x-axis, go up to the line and

Â read off the value.

Â Or equivalently you can plug in 0.3 to

Â the regression equation to work out that expected value.

Â Now one of the other things though that regression will do for you.

Â It won't just give you a prediction.

Â With suitable assumptions, which we will have a look at in a while in this module,

Â with suitable assumption we're able to get a prediction interval as well.

Â And that prediction interval gives us a range of feasible values for

Â where we think the outcome or forecast is going to lie.

Â And that in practice tends to be much more realistic

Â than just trying to give a single best guess.

Â 1:51

Another thing that we do with these regression models is

Â interpret coefficients coming out of the model.

Â The coefficients themselves can tell us things.

Â They can give us information.

Â And so I might ask a question.

Â How much on average do you except to pay for

Â diamonds that weigh 0.3 of a carat versus diamonds that weigh 0.2 of a carat?

Â Well that's a change in x of 0.1.

Â And given a linear regression with a slope that happens to equal

Â 3,720 basically, what we can do is say.

Â Well, if we look at diamonds weighing 0.3 of a carat versus 0.2 of a carat,

Â we can anticipate paying an additional $372 for them,

Â given the underlying regression equation.

Â So we're essentially interpreting the slope in the regression.

Â Likewise, intercepts sometimes have interpretations.

Â An intercept might be interpreted as a fixed cost,

Â it might be interpreted as a start-up time.

Â So, we often want to interpret coefficients.

Â 4:20

And then we come across a diamond, and that diamond weighs

Â 0.25 of a carat and it's being sold for $500.

Â So I've added that point to the graph here, and it's the big red dot.

Â Now, if I see a point like that, which is a long,

Â long way beneath the regression line, then it's potentially of great interest to me.

Â Because if my believe my model, and this is a huge caveat here.

Â Given that I believe my model,

Â then there's something going on with this particular diamond.

Â Now one of the possibilities is it's been mispriced by the market.

Â And if it's been mispriced by the market,

Â then it's potentially a great investment opportunity.

Â There is another explanation though, is that maybe there's some floor associated

Â with this diamond and that's why it's going for such a low price.

Â I don't know which of those two is a potential explanation

Â until I've gone to have a look at the diamond.

Â The point that I'm making here, is that this activity of looking

Â to see how far away the points are away from the regression line,

Â is a technique for ranking potential candidates.

Â And some people use the word, triaging,

Â them to come up with a set of candidates that look the most interesting to me.

Â And so that's one of the uses that you can put a regression model to.

Â 5:43

In summary, points a long way from the line can be of great interest.

Â I've shown you some regression lines, but I haven't yet

Â told you how they're calculated.

Â So, where does this regression line come from,

Â sometimes called the line of best fit?

Â Well there's a methodology, and

Â that methodology is called the method of least squares.

Â That is the most frequently used one to calculate these best-fitting lines.

Â And so it's not the only way of calculating

Â the line to go through the data, but it's a very commonly used one.

Â And if you pick up a typical spreadsheet program, it's the one that's going to be

Â implemented when you run your regressions there.

Â So, the optimality criteria, because we are going to fit the best line,

Â is known as the method of least squares.

Â And in words, what the least square's line is doing is finding the line amongst all

Â the infinite number of lines that you could potentially draw through the data.

Â It's finding the line that minimizes

Â the sum of the squares of the vertical distance from the points to the line.

Â And I've illustrated that idea by beaming in on the diamond's data,

Â I've taken a small range and I've drawn a line there.

Â I've drawn the points around it.

Â And the red lines are picking up the vertical distance from the point

Â to the line.

Â And what we want to do is find a line that minimizes the sum of the squares

Â of those vertical distances.

Â And we're going to call such a line, the least squares line, or

Â the line of best fit.

Â So basically what you're trying to do,

Â is find the line that most closely follows the data.

Â That's another way of thinking about it.

Â But there is a formal criteria.

Â That criteria is implemented in software, and

Â you will use that software to actually calculate a least squares line,

Â a regression for any particular data set that you might have.

Â 8:23

for any given value of x, the fitted value would be go up to the blue line.

Â And then the residual is that vertical distance from the blue line to the point,

Â so you can see, you can ultimately get to one of those points in two steps.

Â You take your x value beneath it.

Â First of all, you take a step up to the line, and

Â then once you're on the line, you add on the little red line, the residual, and

Â you'll get to the data point.

Â So that says that the data point can be expressed in two components.

Â One, the line.

Â And two, the residual about that line.

Â So, that decomposing of the data into two parts,

Â mirrors a basic idea that we take to fitting this regression models.

Â And that idea is that the data we see is made up of two parts.

Â We often call that the signal and the noise.

Â And the regression line is our model for the signal.

Â And the residuals are encoding the noise in the problem.

Â Both of these components that come out of the regression, both the fitted values and

Â the residuals, are useful.

Â The fitted values become our full cost.

Â If you bring me a new diamond for a given weight,

Â let's say 0.25 of a carrot, what do I think it's price is going to be?

Â I simply go up to the regression line, the so called fitted values, and

Â I read off the value of y, the price.

Â Now, the residuals are useful as well because they allow me

Â to assess the quality of fit of the regression model.

Â Ideally, all our residuals would be zero.

Â That would mean that the line went through all the points.

Â In practice, that is simply not going to happen, but

Â we will often examine the residuals from a regression, because by examining

Â the residuals we can potentially gain insight into that regression.

Â And typically, when I run regression, one of the very first things I'm going to

Â do is take all of the residuals out of the regression.

Â I'm going to sort that list of residuals.

Â And I'm going to look at the most extreme residuals.

Â The points with the biggest residuals are by definition those points that are not

Â well fit by the current regression.

Â 10:36

If I'm able to look at those points and explain why they're not well fit,

Â then I have typically learned something that I can

Â incorporate in a subsequent iteration of the regression model.

Â Now if that all sounded a little bit abstract,

Â I've got an example to show you right now.

Â So here's another data set that lends itself to a regression analysis.

Â And in this data set I've got two variables.

Â The outcome variable, or the y variable, is the fuel economy of a car.

Â And to be more precise,

Â it's the fuel economy as measure by gallons per thousand miles in the city.

Â So let's say you live in the city and

Â you only drive in the city, how many gallons are you going to have to put

Â in the tank to be able to drive your car 1,000 miles over some course of time?

Â That's the outcome variable.

Â Clearly the more gallons you have to put in the tank,

Â the less fuel efficient the vehicle is.

Â That's the idea.

Â Now we might want to create a predictive model for

Â fuel economy as a function of the weight of the car.

Â And so here I've got an X variable as weight.

Â And I'm going to look for the relationship between the weight of a car and

Â it's fuel economy.

Â We collect the set of data.

Â That's what you can see in the scatter plot.

Â The bottom left-hand graph on this slide.

Â And each point is a car.

Â And for each car, we've found it's weight, we've found it's fuel economy,

Â we've plotted the variable against one another.

Â And we have a run a regression through those points

Â through the method of least squares.

Â And that regression gives us a way of predicting the fuel economy of a vehicle

Â of any given weight.

Â Now why might you want to do that?

Â Well, one of the things that many vehicle manufacturers are thinking about these

Â days, is creating more fuel efficient vehicles.

Â And one approach to doing that is to actually change the materials that

Â vehicles are manufactured from.

Â So for example, they might be moving from steel to aluminum.

Â Well, that will reduce the weight of the vehicle.

Â Well, if the vehicle's weight is reduce,

Â I wonder how that will impact the fuel economy?

Â And so that's sort of question that we'd be able to start addressing through

Â such a model.

Â So that's a setup for this problem, but

Â I want to show you why looking at the residuals can be such a useful thing.

Â So when I look at the residuals from this particular regression, I know one of

Â the residuals, actually I found the biggest residual in the whole data set.

Â And that's the point that I have identified in red on the scatter plot.

Â And it is the biggest residual, it's a big positive residual.

Â Which means that the reality is, that this particular vehicle

Â needs a lot more gas going in the tank than the regression model would predict.

Â The regression model would predict the value on the line.

Â The red data point is the actual observed value.

Â It's above the line, so it's less fuel efficient than the model predicts.

Â It needs more gas to go in the tank than the model predicts.

Â So is there anything special about that vehicle?

Â Well, at that point I go back to the underlying data set and I drill down.

Â So, when I see big residuals, I'm going to drill down on those residuals.

Â And drilling down on this residual actually identifies the vehicle.

Â And the vehicle turns out to be something called a Mazda RX-7.

Â And this particular vehicle is somewhat unusual,

Â because it had what's termed a rotary engine,

Â which is a different sort of engine than any other single vehicle in this data set.

Â Every other vehicle had a standard engine, but the Mazda RX-7 had a rotary engine.

Â And that actually explains why its fuel economy is bad in the city.

Â And so by drilling down on the point, by looking at the residuals,

Â I've identified feature that I hadn't originally incorporated into the model.

Â And that would would be the type of engine.

Â And so, the residual and the exploration of the residual has

Â generated a new question for me that I didn't have prior to the analysis.

Â And that questions is,

Â I wonder how the type of engine impacts the fuel economy as well?

Â So that's one of the outcomes of a regression that can be very, very useful.

Â It's not the regression model directly talking to you.

Â It's the deviations from the underlying model that can sometimes be the most

Â insightful part of the model itself or the modeling process.

Â So remember in one of the other modules,

Â I talked about, what are the benefits of modelling?

Â And one of them is serendipitous outcomes, things that you find that you hadn't

Â expected to at the beginning, and I would put this up there as an example of that.

Â By exploring the residuals carefully, I've learned something new,

Â something that I hadn't anticipated.

Â And I might be able to subsequently improve my model by incorporating this

Â idea of type of engine into the model itself.

Â So the residuals are an important part of a regression model.

Â