0:02

Okay let's go through yet a third derivation of least squares,

Â but in this way we're going to demonstrate

Â why least squares is thought of as sort of an adjustment mechanism.

Â So I'm going to write it out a little bit differently as y minus x one beta one,

Â minus x p beta p, where each of these are vectors.

Â So when I estimate the beta one coefficient, in what sense is it

Â adjusted for the presence of all the other variables in the model?

Â Before we begin, let me define my residual function for

Â two vectors as e(a,b) as a minus b times the inner

Â product of a and b over b by itself.

Â And this is merely the prediction for

Â a from b based on a regression to the origin.

Â And then this is the residual if I had subtracted.

Â 1:11

As if they were known.

Â So then, the vector y minus x two, beta two up to x p,

Â beta p could be considered a single outcome.

Â And then x one beta one could be thought of as the predictor.

Â So I've just simply rewritten asterisks there as

Â a single predictor and regression through the origin.

Â So we know that asterisk has to be larger than or

Â equal to if we plug in a beta one.

Â Where beta one, where as it depends on the beta two up to beta p.

Â If we plug in a beta one that satisfies this criteria,

Â with need to be inner product of y minus x two beta two

Â minus up to x p beta p, we'd want that comma,

Â and x one, all over norm x one by itself.

Â 2:39

Well, before I do that,

Â notice this is equal to the inner product of y and

Â x one over x one by itself, minus the inner product of x two and

Â x one over the inner product of x one by itself times beta two.

Â Minus all the way up to inner product of x p and x one.

Â 3:20

beta one there, then I'd like you to churn through the calculations.

Â What you get is e of y and

Â x one minus e of x two and

Â x one times beta two all

Â the way up to e of x p and

Â x one times beta p squared.

Â We have to have gotten smaller

Â because we'll have plugged in the optimal beta one holding the other ones fixed.

Â 4:02

So now we're back at the exact same equation with one fewer coefficients.

Â Instead of having p coefficients,

Â now we have p minus one where we've gotten rid of beta one.

Â And now instead of y, we have the residual having a regressed x one out of y.

Â We have the residual for x two having regressed to x one.

Â So in essence, we've simply gotten smaller

Â by taking the linear association with x one out of every other variable.

Â Out of every other predictor in the outcome.

Â And we've gotten smaller.

Â So, we could repeat this process, because notice, these are all,

Â this is just the same exact starting equation with one fewer.

Â 4:45

One fewer regressors, and our vectors now look a little bit weirder,

Â because they're these residual, they're the output of this residual function, but,

Â ostensibly, we're back at this same thing.

Â So we know that we can get smaller if we repeat the exact same process.

Â So this is going to be greater than or equal to.

Â If I simply take the residual,

Â the residual of the residual

Â where I've now, let's

Â supposed I've held beta

Â three up to beta p fixed,

Â and then I'm getting

Â rid of this term here.

Â Then I would get e of x two comma

Â x one minus e of e of x three, x one and

Â e of x two, x one beta three and

Â all the way up to e of e x p comma x one.

Â Now regressing out e of x two and x one.

Â Okay.

Â Times beta p.

Â So now we've gotten rid of this regressor and

Â this coefficient by regressing it out of every term.

Â And then you can see, as you iterate through this process, you regress out

Â until you get to the beta p variable, you'll find that the beta p variable

Â estimate is then a regression to the origin, with just that variable.

Â 6:43

And the outcome where we've iteratively regressed out the linear association

Â of every regressor.

Â So we took and regressed x one out of everything.

Â Then we took those residuals and took the residuals of x two and

Â regressed it out of all the other residuals.

Â And then we took the residual of x three and

Â then having took the residual of it having regressed out of.

Â And then in that sense, what we'll see is that, and then we get to this

Â beta p effect, which will just be the remaining regression for the origin.

Â So you could see this would be actually an easy way to do linear regression.

Â Where all you needed to know how to do was this residual function.

Â No matrix inversion required.

Â So that's kind of a neat result.

Â But also, a couple of other things come to mind.

Â The first thing that comes to mind is, because we did it in an arbitrary order,

Â we just picked working towards the last coefficient.

Â We could have also work towards the first coefficient and

Â we could have done it in any order, so

Â that you can see that it doesn't matter which order we take the residuals in.

Â All that matters is that we're iteratively taking residuals.

Â And I find this process really helps me understand in what sense

Â linear regression is adjusting for these other variables.

Â Because it's taking out the linear association of all the other variables

Â from everything else.

Â And I should note that

Â 8:12

thinking this way is not just restricted to separating everything in the vector.

Â So, suppose for example I had y minus x one

Â beta one minus x two beta two, where now,

Â x one is m by p one and x two is n by p two.

Â So I've broken my x matrix my design matrix into two parts x one and x two.

Â 8:51

Okay, so now this term, y minus x one beta one, is a single term,

Â and I know what by solution for beta two would have to be.

Â Beta two as it depends on beta one, my beta two hat as it depends on beta one,

Â would simply have to be x two transpose x two inverse x two transpose.

Â Then times my outcome, but because x one and beta one are fixed,

Â it'll be that which works out to be x two transpose x two inverse,

Â x two transpose times y, which is the coefficient.

Â And then x two transpose x two inverse x two transpose x one times beta one.

Â Okay, so, this term is the coefficient having

Â 9:52

This term is the coefficient if I were to only have regressed x two on y.

Â And this is the collection of coefficients I would get if I regressed every single

Â column of x one as an outcome and x two as a predictor.

Â When I plug these, this estimate back into here, into beta two, what do I get?

Â I get y minus the hat matrix x two for x two.

Â X two transpose x to inverse,

Â x to transpose y and minus and then,

Â I'm going to just write this out in a more

Â convenient form as i minus that times y and

Â then, minus I minus x two transpose x two

Â transpose x two inverse x two transpose times x one and

Â then, Beta one.

Â Okay.

Â So now, what is this?

Â This is the residual of y having regressed out x two,

Â and this is of course smaller than our original equation.

Â 11:13

And this is the residual, for

Â having regressed x two out of every column of x one.

Â Okay, so one way to think about our estimate for beta one is first get rid of

Â all the effect associated with x two out of y and

Â x one, and then perform the regression with just those sets of residuals.

Â And again, to me this really helps me understand in what sense

Â regression is doing adjustment.

Â But notice again this is the same argument as above,

Â we're just doing it with matrices now.

Â Okay, so I find this way of thinking, even though it's a little confusing and you

Â would never actually program the computer to fit least squares this way, I find

Â it a very useful way to think about linear regression and what it's accomplishing.

Â 11:59

And conversely there's nothing in particular about holding beta one

Â fixed first, we could have held beta two fixed first.

Â And so what we see is that every coefficient in least squares is obtained

Â this way, by having regressed all the other regressors out of both y and

Â the predictors associated with that coefficient.

Â