0:31

This happens to be the weight of an individual here on the left-hand side, and

Â the number of days on the right-hand side.

Â So what we'd like to do is to fit a straight line through that data.

Â For example, the red line shown here.

Â And this line, generally speaking, the equation of a straight line,

Â a linear regression line, is a + bx,

Â where if this little hat over the top here is sometimes called a circumflex.

Â And this is the terminology they use in the reference handbook.

Â So, the equation of the line here is a + bx.

Â 1:08

And, to find that line,

Â the equation of that line, we usually do this by the method of least squares.

Â And, here's the extract from the reference

Â handbook that explains this, and in that equation,

Â the y-axis intercept which is a = y bar- b x bar.

Â Where y bar and x bar are the mean or average values of y and x respectively.

Â 2:13

Sxx is the sum of the squares of the x values as defined here.

Â And then later on, we'll also need Sy,

Â which is the sum of the squares of the y values, as shown here.

Â So, when we do this, what this method is actually doing,

Â if I look at any individual data point, and compute its distance,

Â say, delta from the line here,

Â what we're doing is computing delta squared, and these can be plus or

Â minus, delta can be plus or minus, but squaring them makes them all positive.

Â And then we want to find the equation of the line which minimizes

Â the value of these squares.

Â In other words, minimizes the value of delta squares.

Â So that's why this is called the method of least squares.

Â And these equations give us the line which accomplishes that.

Â 3:14

A related question is how good a fit is this equation to the line?

Â So, in the handbook, they give some measures of this.

Â For example, a standard error of estimate, confidence interval for

Â the intercept, confidence interval for the slope, etc.

Â But, what's more useful is usually the correlation coefficient for

Â the variable between the variables, R,

Â which is given by Sxy divided by the square root of Sxx, Syy.

Â Or the R squared value, the square of that which is commonly

Â called the R squared value which is just R R squared.

Â And in here, they denote this as the coefficient of determination but

Â normally we just call that the R squared value.

Â So, going back to that previous example, the red line here is the linear regression

Â line and this I obtained just by fitting a line in an Excel spreadsheet.

Â And this automatically gives me the equation of the line and

Â also the R squared value.

Â So, if I take the square root of that, I can see that

Â the correlation coefficient between those two variables is approximately 0.89.

Â In other words, the straight line in this case is quite a good fit to that data.

Â 4:35

But, what about other possibilities?

Â This data also has the same line through it.

Â But here obviously the data is much more scattered and the fit is not as good.

Â So, the equation of the line here is the same but now you see that the R squared

Â value is much smaller or the correlation coefficient is much smaller.

Â It's 0.47, which is a much poorer fit.

Â 5:02

If the sketch of the data is much less, and

Â this data also has exactly the same curve, the same linear regression curve.

Â And this I obtain just by adding random values to the data.

Â And, the equation is the same, and

Â the correlation coefficient in this case is 0.7823.

Â R squared is 0.7823, so the correlation coefficient,

Â the square root of that is 0.88.

Â Quite a good fit.

Â And the final example, the data are so closely correlated that you can't even

Â see the difference between them in this last example.

Â So, if I fit the equation through it, again, I get the same equation but

Â now, the agreement between them is perfect so

Â R squared is equal to 1 and the correlation coefficient is equal to 1.

Â 6:09

So, if the correlation coefficient is zero,

Â that means that the two variables are completely uncorrelated.

Â On the other hand, if the two variables are perfectly correlated

Â like in the last example here, then the correlation coefficient is equal to 1.

Â Let me do a numerical example on that, and

Â I have this data series consisting of four points, x and y, as shown here.

Â And we can only do an example, a numerical example with a few points because

Â the computation rapidly becomes too cumbersome.

Â But here we're going to answer two questions about this data.

Â First of all, what is the correlation coefficient between those two variables?

Â Which of those alternatives?

Â And secondly, what is the equation of the linear regression line through the data?

Â Which of those four alternatives is it?

Â 7:04

So, first of all, I'll plot out the data and take a look at it.

Â And this a graph of the data x versus y.

Â So, x versus y here.

Â And just looking at it,

Â it seems like, yes those two variables are reasonably correlated.

Â And, if I compute the line through there, just plug those data into excel.

Â This is the linear regression line I get and

Â this is the equation of the line and the value.

Â So, we've already answered both of those questions.

Â It's given there.

Â But, let's go through and calculate that and see how those arise.

Â 7:42

So, to do that, we have to compute a number of variables,

Â for example, Sxy, the sum of the x y products, and others.

Â So, I've computed, added some extra columns and rows to the table here.

Â The first column is the product xiyi.

Â The second column is x squared.

Â So for example, 2 squared is four, xy here is 2 times 8 is 16.

Â And the last column is y squared.

Â For example, 8 squared is 64.

Â And, the additional row here is the summation of all of those columns.

Â So, this number is the summation x, this is summation y, this is summation xy, etc.

Â And finally, the additional row on the bottom here is the average values.

Â So, x bar is 4.75, and y bar is 12.0.

Â So, now that we have all these parameters,

Â we can compute these numbers here.

Â So, Sxy is equal to the summation of xiyi,

Â which is equal to 262 minus 1 over n,

Â there are four samples, so that's a quarter,

Â multiplied by summation of xi, which is 19,

Â multiplied by the summation of yi, which is 48,

Â which gives me 34.00 for Sxy.

Â Next, I compute the summation of the x squared terms, Sxx.

Â So, that is equal to summation xi squared,

Â which is equal to 109, minus 1 over n,

Â multiplied by the summation of xi all squared,

Â in other words, 19 squared, which is 18.75.

Â And, the answer is 18.75.

Â Next, we'll compute the sum of the y squared terms,

Â which is given by this expression here.

Â So, that is equal to summation yi squared is 648

Â minus 1/4 multiplied by the summation of yi squared,

Â in other words, 48 squared, and that is equal to 72.

Â 10:52

Next, we want to find the equation of the linear regression line.

Â And here is how general expression for

Â the linear regression line in the notation of the reference handbook,

Â y is equal to a plus bx where the slope of the line b is equal to Sxy divided by Sxx.

Â So that is equal to thirty four divided by eighteen point seven five.

Â So the slope is 1.81.

Â The intercept a is given by y bar minus bx where y bar and

Â x bar are computed as normal and

Â we already have those values over here.

Â They are 4.75 and 12.00.

Â So, substituting n, we find that a is equal to 12 minus the slope,

Â which is 1.81 times 4.75, which tells us that the intercept is 3.39.

Â So, substituting those values of a and b back into the equation for y,

Â we see that y is equal to 3.39 + 1.81x,

Â so the correct answer is A.

Â And again, if we see the equation here, which Excel gave us,

Â we see that indeed, it agrees with that equation.

Â 12:20

Now, I'll just make a couple final comments about probability and statistics.

Â There are a great many other topics which are covered in the reference handbook.

Â For example, one thing that I haven't covered here is hypothesis testing by

Â means of one-way analysis of variance, or anova, and

Â the corresponding tables here for one-way analysis and two-way analysis.

Â But, I'm not gonna cover those because I don't think that

Â they're very likely to occur in the exam.

Â I also haven't covered the fundamental

Â definitions of sets, and I would also mention that there is

Â a table given in the book here are probability and density functions.

Â For example, the first two of these would probably be useful.

Â The binomial coefficient, and binomial occur in

Â 13:16

topics that we've looked at previously, so those are useful.

Â However, all the rest of these, for example, hypergeometric,

Â Poisson distribution, geometric, all these different distributions are again,

Â they're given in the reference handbook.

Â But I think are quite unlikely to actually occur in the reference,

Â in the actual exam, so I won't cover them here.

Â