0:00

In this video, we'll talk about

Â the normal equation, which for

Â some linear regression problems, will

Â give us a much better way

Â to solve for the optimal value

Â of the parameters theta.

Â Concretely, so far the

Â algorithm that we've been using

Â for linear regression is gradient

Â descent where in order

Â to minimize the cost function

Â J of Theta, we would take

Â this iterative algorithm that takes

Â many steps, multiple iterations of

Â gradient descent to converge

Â to the global minimum.

Â In contrast, the normal equation

Â would give us a method to

Â solve for theta analytically, so

Â that rather than needing to run

Â this iterative algorithm, we can

Â instead just solve for the

Â optimal value for theta

Â all at one go, so that in

Â basically one step you get

Â to the optimal value right there.

Â 0:49

It turns out the normal equation

Â that has some advantages and

Â some disadvantages, but before

Â we get to that and talk about

Â when you should use it, let's

Â get some intuition about what this method does.

Â For this week's planetary example, let's

Â imagine, let's take a

Â very simplified cost function

Â J of Theta, that's just the

Â function of a real number Theta.

Â So, for now, imagine that Theta

Â is just a scalar value or that Theta is just a row value.

Â It's just a number, rather than a vector.

Â Imagine that we have a cost function J that's a quadratic function of this real value

Â parameter Theta, so J of Theta looks like that.

Â Well, how do you minimize a quadratic function?

Â For those of you that know a little bit of calculus,

Â you may know that the way to

Â minimize a function is to

Â take derivatives and to

Â set derivatives equal to zero.

Â So, you take the derivative of J with respect to the parameter of Theta.

Â You get some formula which I am not going to derive,

Â you set that derivative

Â equal to zero, and this

Â allows you to solve for

Â the value of Theda that minimizes J of Theta.

Â That was a simpler case

Â of when data was just real number.

Â In the problem that we are interested in, Theta is

Â 2:04

no longer just a real number,

Â but, instead, is this

Â n+1-dimensional parameter vector, and,

Â a cost function J is

Â a function of this vector

Â value or Theta 0 through

Â Theta m. And, a cost

Â function looks like this, some square cost function on the right.

Â How do we minimize this cost function J?

Â Calculus actually tells us

Â that, if you, that

Â one way to do so, is

Â to take the partial derivative of J, with respect to every parameter of Theta J in turn, and then, to set

Â all of these to 0.

Â If you do that, and you

Â solve for the values of

Â Theta 0, Theta 1,

Â up to Theta N, then,

Â this would give you that values

Â of Theta to minimize the cost

Â function J. Where, if

Â you actually work through the

Â calculus and work through

Â the solution to the parameters

Â Theta 0 through Theta N, the

Â derivation ends up being somewhat involved.

Â And, what I am going

Â to do in this video,

Â is actually to not go

Â through the derivation, which is kind

Â of long and kind of involved, but

Â what I want to do is just

Â tell you what you need to know

Â in order to implement this process

Â so you can solve for the

Â values of the thetas that

Â corresponds to where the

Â partial derivatives is equal to zero.

Â Or alternatively, or equivalently,

Â the values of Theta is that

Â minimize the cost function J of Theta.

Â I realize that some of

Â the comments I made that made

Â more sense only to those

Â of you that are normally familiar with calculus.

Â So, but if you don't

Â know, if you're less familiar

Â with calculus, don't worry about it.

Â I'm just going to tell you what

Â you need to know in order to

Â implement this algorithm and get it to work.

Â For the example that I

Â want to use as a running

Â example let's say that

Â I have m = 4 training examples.

Â 3:50

In order to implement this normal

Â equation at big, what I'm going to do is the following.

Â I'm going to take my

Â data set, so here are my four training examples.

Â In this case let's assume that,

Â you know, these four examples is all the data I have.

Â What I am going to do is take

Â my data set and add

Â an extra column that corresponds

Â to my extra feature, x0,

Â that is always takes

Â on this value of 1.

Â What I'm going to do is

Â I'm then going to construct

Â a matrix called X that's

Â a matrix are basically contains all

Â of the features from my

Â training data, so completely

Â here is my here are

Â all my features and we're

Â going to take all those numbers and

Â put them into this matrix "X", okay?

Â So just, you know, copy

Â the data over one column

Â at a time and then I am going to do something similar for y's.

Â I am going to take the

Â values that I'm trying to

Â predict and construct now

Â a vector, like so

Â and call that a vector y.

Â So X is going to be a

Â 4:59

m by (n+1) - dimensional matrix, and

Â Y is going to be

Â a m-dimensional vector

Â where m is the number of training examples

Â and n is, n is

Â a number of features, n+1, because of

Â this extra feature X0 that I had.

Â Finally if you take

Â your matrix X and you take

Â your vector Y, and if you

Â just compute this, and set

Â theta to be equal to

Â X transpose X inverse times

Â X transpose Y, this would

Â give you the value of theta

Â that minimizes your cost function.

Â There was a lot

Â that happened on the slides and

Â I work through it using one specific example of one dataset.

Â Let me just write this

Â out in a slightly more general form

Â and then let me just, and later on in

Â this video let me explain this equation a little bit more.

Â 6:16

The way I'm going to construct the

Â matrix "X", this is

Â also called the design matrix

Â is as follows.

Â Each training example gives

Â me a feature vector like this.

Â say, sort of n+1 dimensional vector.

Â The way I am going to construct my

Â design matrix x is only construct the matrix like this.

Â and what I'm going to

Â do is take the first

Â training example, so that's

Â a vector, take its transpose

Â so it ends up being this,

Â you know, long flat thing and

Â make x1 transpose the first row of my design matrix.

Â Then I am going to take my

Â second training example, x2, take

Â the transpose of that and

Â put that as the second row

Â of x and so on,

Â down until my last training example.

Â Take the transpose of that,

Â and that's my last row of

Â my matrix X. And, so,

Â that makes my matrix X, an

Â M by N +1

Â dimensional matrix.

Â As a concrete example, let's

Â say I have only one

Â feature, really, only one

Â feature other than X zero,

Â which is always equal to 1.

Â So if my feature vectors

Â X-i are equal to this

Â 1, which is X-0, then

Â some real feature, like maybe the

Â size of the house, then my

Â design matrix, X, would be equal to this.

Â For the first row, I'm going

Â to basically take this and take its transpose.

Â So, I'm going to end up with 1, and then X-1-1.

Â For the second row, we're going to end

Â up with 1 and then

Â X-1-2 and so

Â on down to 1, and

Â then X-1-M.

Â And thus, this will be

Â a m by 2-dimensional matrix.

Â So, that's how to construct

Â the matrix X. And, the

Â vector Y--sometimes I might

Â write an arrow on top to

Â denote that it is a vector,

Â but very often I'll just write this as Y, either way.

Â The vector Y is obtained by

Â taking all all the labels,

Â all the correct prices of

Â houses in my training set, and

Â just stacking them up into

Â an M-dimensional vector, and

Â that's Y. Finally, having

Â constructed the matrix X

Â and the vector Y, we then

Â just compute theta as X'(1/X)

Â x X'Y. I just

Â want to make

Â I just want to make sure that this equation makes sense to you

Â and that you know how to implement it.

Â So, you know, concretely, what is this X'(1/X)?

Â Well, X'(1/X) is the

Â inverse of the matrix X'X.

Â Concretely, if you were

Â to say set A to

Â be equal to X' x

Â X, so X' is a

Â matrix, X' x X

Â gives you another matrix, and we

Â call that matrix A. Then, you

Â know, X'(1/X) is just

Â you take this matrix A and you invert it, right!

Â 9:26

And so that's how you compute this thing.

Â You compute X'X and then you compute its inverse.

Â We haven't yet talked about Octave.

Â We'll do so in the later

Â set of videos, but in the

Â Octave programming language or a

Â similar view, and also the

Â matlab programming language is very similar.

Â The command to compute this quantity,

Â X transpose X inverse times

Â X transpose Y, is as follows.

Â In Octave X prime is

Â the notation that you use to denote X transpose.

Â And so, this expression that's

Â boxed in red, that's computing

Â X transpose times X.

Â pinv is a function for

Â computing the inverse of

Â a matrix, so this computes

Â X transpose X inverse,

Â and then you multiply that by

Â X transpose, and you multiply

Â that by Y. So you

Â end computing that formula

Â which I didn't prove,

Â but it is possible to

Â show mathematically even though I'm

Â not going to do so

Â here, that this formula gives you

Â the optimal value of theta

Â in the sense that if you set theta equal

Â to this, that's the value

Â of theta that minimizes the

Â cost function J of theta

Â for the new regression.

Â One last detail in the earlier video.

Â I talked about the feature

Â skill and the idea of

Â getting features to be

Â on similar ranges of

Â Scales of similar ranges of values of each other.

Â If you are using this normal

Â equation method then feature

Â scaling isn't actually necessary

Â and is actually okay if,

Â say, some feature X one

Â is between zero and one,

Â and some feature X two is

Â between ranges from zero to

Â one thousand and some feature

Â x three ranges from zero

Â to ten to the

Â minus five and if

Â you are using the normal equation method

Â this is okay and there is

Â no need to do features

Â scaling, although of course

Â if you are using gradient descent,

Â then, features scaling is still important.

Â Finally, where should you use the gradient descent

Â and when should you use the normal equation method.

Â Here are some of the their advantages and disadvantages.

Â Let's say you have m training

Â examples and n features.

Â One disadvantage of gradient descent

Â is that, you need to choose the learning rate Alpha.

Â And, often, this means running

Â it few times with different learning

Â rate alphas and then seeing what works best.

Â And so that is sort of extra work and extra hassle.

Â Another disadvantage with gradient descent

Â is it needs many more iterations.

Â So, depending on the details,

Â that could make it slower, although

Â there's more to the story as we'll see in a second.

Â As for the normal equation, you don't need to choose any learning rate alpha.

Â So that, you know, makes it really convenient, makes it simple to implement.

Â You just run it and it usually just works.

Â And you don't need to

Â iterate, so, you don't need

Â to plot J of Theta or

Â check the convergence or take all those extra steps.

Â So far, the balance seems to

Â favor normal the normal equation.

Â 12:27

the normal equation, and some advantages of gradient descent.

Â Gradient descent works pretty well,

Â even when you have a very large number of features.

Â So, even if you

Â have millions of features you

Â can run gradient descent and it will be reasonably efficient.

Â It will do something reasonable.

Â In contrast to normal equation, In, in

Â order to solve for the parameters

Â data, we need to solve for this term.

Â We need to compute this term, X transpose, X inverse.

Â This matrix X transpose X.

Â That's an n by n matrix, if you have n features.

Â 13:00

Because, if you look

Â at the dimensions of

Â X transpose the dimension of

Â X, you multiply, figure out what

Â the dimension of the product

Â is, the matrix X transpose

Â X is an n by n matrix where

Â n is the number of features, and

Â for almost computed implementations

Â the cost of inverting

Â the matrix, rose roughly as

Â the cube of the dimension of the matrix.

Â So, computing this inverse costs,

Â roughly order, and cube time.

Â Sometimes, it's slightly faster than

Â N cube but, it's, you know, close enough for our purposes.

Â So if n the number of features is very large,

Â 13:37

then computing this

Â quantity can be slow and

Â the normal equation method can actually be much slower.

Â So if n is

Â large then I might

Â usually use gradient descent because

Â we don't want to pay this all in q time.

Â But, if n is relatively small,

Â then the normal equation might give you a better way to solve the parameters.

Â What does small and large mean?

Â Well, if n is on

Â the order of a hundred, then

Â inverting a hundred-by-hundred matrix is

Â no problem by modern computing standards.

Â If n is a thousand, I would still use the normal equation method.

Â Inverting a thousand-by-thousand matrix is

Â actually really fast on a modern computer.

Â If n is ten thousand, then I might start to wonder.

Â Inverting a ten-thousand- by-ten-thousand matrix

Â starts to get kind of slow,

Â and I might then start to

Â maybe lean in the

Â direction of gradient descent, but maybe not quite.

Â n equals ten thousand, you can

Â sort of convert a ten-thousand-by-ten-thousand matrix.

Â But if it gets much bigger than that, then, I would probably use gradient descent.

Â So, if n equals ten

Â to the sixth with a million

Â features, then inverting a

Â million-by-million matrix is going

Â to be very expensive, and

Â I would definitely favor gradient descent if you have that many features.

Â So exactly how large

Â set of features has to be

Â before you convert a gradient descent, it's hard to give a strict number.

Â But, for me, it is usually

Â around ten thousand that I might

Â start to consider switching over

Â to gradient descents or maybe,

Â some other algorithms that we'll talk about later in this class.

Â To summarize, so long

Â as the number of features is

Â not too large, the normal equation

Â gives us a great alternative method to solve for the parameter theta.

Â Concretely, so long as

Â the number of features is less

Â than 1000, you know, I would

Â use, I would usually is used

Â in normal equation method rather than, gradient descent.

Â To preview some ideas that

Â we'll talk about later in this

Â course, as we get

Â to the more complex learning algorithm, for

Â example, when we talk about

Â classification algorithm, like a logistic regression algorithm,

Â 15:32

We'll see that those algorithm

Â actually...

Â The normal equation method actually do not work

Â for those more sophisticated

Â learning algorithms, and, we

Â will have to resort to gradient descent for those algorithms.

Â So, gradient descent is a very useful algorithm to know.

Â The linear regression will have

Â a large number of features and

Â for some of the other algorithms

Â that we'll see in

Â this course, because, for them, the normal

Â equation method just doesn't apply and doesn't work.

Â But for this specific model of

Â linear regression, the normal equation

Â can give you a alternative

Â