0:10

Hello. This lesson is going to introduce ordinary linear regression.

Â Ordinary linear regression computes

Â a best fit linear model between two dimensions under certain assumptions.

Â This linear model can be used with

Â these caveats that I've mentioned to perform predictive analytics,

Â as well as to visually understand relationships

Â between different dimensions of a data set.

Â OLS (ordinary linear regression) is

Â a very simple idea and it can be applied in a lot of different situations,

Â but sometimes it's misapplied and we have to be careful about that.

Â Future modules in this course will introduce

Â ideas that might be more powerful and be able to provide a better model.

Â This lesson includes two things.

Â First is using a visual website that explores

Â ordinary linear regression and second looks

Â at our notebook introduction to ordinary linear regression.

Â So what is this visual website?

Â The idea here is that you can play with the points and see what happens.

Â So for instance, here's some points and here's a best fit line and you might say,

Â "What happens if this point was down here?"

Â And you could see as you drag it, the line changes.

Â Not only that, but the fit parameters over here change as well.

Â So we can move the points up such that they are all very

Â nicely co-aligned and you can see that the fit parameters get pretty tight,

Â and our parameters that control that are going to get tight as well.

Â So we can also make changes here,

Â we can of course play with our website and see how this affects things.

Â This is actually demonstrating how this idea works and basically what we do is we create

Â little squares that represent the deviation from both delta Y and delta X from the line,

Â and we want to minimize these squares.

Â That's what we talk about when we say regression,

Â we're regressing to points to this line and it's an ordinary least squares.

Â And that's what I said earlier, OLS,

Â the squares are sort of what we're trying to minimize.

Â You can do this by playing with both of these.

Â Right? You can change the intercept and you see some of

Â the squares get really small and you see the squares listed over here,

Â and we can also change the slope of our line.

Â By doing these together,

Â we may get a really good fit or we may get a really bad fit.

Â So hopefully that shows you what's going on with simple linear regression.

Â Now, our notebook is going to talk about this in a little bit more detail.

Â We're going to analyze a data set.

Â First, we're going to use one data set that's included in Seaborn called Anscombe.

Â And so first we just grab it.

Â There's four different subsets in this particular dataset.

Â Here's the first part of one of them and you can just see there's X and Y.

Â So we're going to make a correlation measurement,

Â we're going to see what is our correlation,

Â and it's not too bad.

Â Point 8 1 6 Pearsonr.

Â We can then say, well, what does the data look like?

Â Here's our data and that looks like there's some sort of nice relationship present,

Â so we can actually fit a linear model using Seaborn.

Â We use regression plot and just say

Â fit regression true and it displays our line and you go.

Â That does look pretty good.

Â Now what about the residuals about this line.

Â You can plot these with the Seaborn's residplot.

Â And there you see that they're not too bad.

Â They're scattered around the line pretty nicely,

Â and we can put it all together and plot the best fit line with

Â our Pearson correlation coefficient by using the joint plot.

Â That's great. The problem is Anscombe includes

Â four different data sets that all have the exact same regression coefficient,

Â the same variance, all of the same parameters,

Â but when you look at them visually,

Â you can clearly see differences.

Â Here's the original one we looked at.

Â But if we simply looked at the numerical or analytic results,

Â these four are indistinguishable.

Â It's not until we actually view them visually that we see,

Â here is a line with a clear outlier,

Â here's a bunch of points that are all set at

Â the same thing with one point way off by itself,

Â and these points clearly don't have a linear relationship.

Â That's why I'm showing this to demonstrate that you must visualize your data,

Â view your data, to ensure that a linear model is even appropriate.

Â If we simply computed a linear regression,

Â we'd get the correlation coefficient that we got out and though,

Â t hey life is good, we've got a pretty good correlation, we can go forward.

Â But when you visualize it,

Â you can see these are not good fits to the data set.

Â So what else can we do?

Â We can actually perform the regression and calculate the parameters.

Â Here we're fitting a line and getting a Pearson correlation coefficient by using numpy.

Â We can also do the same thing by using Seaborn,

Â here's our Seaborn plot and then we can fit a line to that.

Â It's exact same things.

Â Here notice what we're trying to do,

Â here are different points.

Â And these epsilons are what we're trying to minimize in our equation.

Â If I scroll back up, you'll see these here.

Â This is what we're trying to do: we have a series of

Â variables x and we're trying to fit a linear model to them,

Â where we model X by a beta or slope parameter at

Â an intercept and have these Epsilon terms

Â which account for the difference between the model and the data point.

Â We want to model, minimize those by performing our regression.

Â So you can see, that's what we want to do,

Â the best fit line will minimize the sum of those differences.

Â We can look at this visually,

Â as we saw in the previous linear website.

Â The idea is that we want to minimize

Â the square differences and there's different ways

Â we could do it which this notebook walks you through.

Â One other thing I wanted to talk about here is that we can use

Â linear regressions in a categorical variable to see differences.

Â We've seen this before with the tips data set.

Â We noticed that the lunch has a stronger slope but there's

Â a lot more scatter at the dinner time for the total bill and the tip.

Â We can also look at these and get the correlation coefficient;

Â you see that just as we just saw visually,

Â there's a stronger correlation in the lunch and a much weaker one in the dinner.

Â There's also ability to look at nonlinear data sets: here we're generating

Â a compound interest data set and then we're going to visually explore that,

Â so we can perform regression on both of those.

Â And that when you do this with linear regression,

Â you see, that slope doesn't look too bad, that fit doesn't look too bad,

Â but if we change the number of years to even larger,

Â you'll note that this data is just going to continue to spiral

Â upwards as the exponential compounding continues.

Â So instead what we can do is change our variable to the logarithm of that.

Â And then you notice that we get a perfect straight line fit.

Â So this is again demonstrates it's important to look at your data,

Â to understand your data,

Â and that linear regression doesn't have to have the variables being linear,

Â it's the parameters that must be linear.

Â So we can take the logarithm of our data and compute the regression

Â on that and get a really good fit when it is distributed in a logarithm manner.

Â So I hope this has given you a bit of an introduction to ordinary linear regression.

Â It's a powerful technique that's often

Â useful to get a feel for what's happening in your data.

Â You need to be careful to ensure that your data are

Â reflective of the properties that are required for ordinary linear regression.

Â As we go forward in this course sequence,

Â we'll see other techniques that are more powerful and we'll see

Â ways to minimize the other issues that we might be concerned about,

Â such as bad bias and variants that are fitting models or overfitting.

Â If you have any questions, let us know in the course forums, and good luck.

Â