As I mentioned in previous video, correlation only measures strength of association between two variables. But in practice, we need to evaluate associations between one variable and a set of multiple variables. For example, the house price and other factors, RM and LSTAT. Secondly, in many applications, we not only need to evaluate the strengths of association. For example, in stock trading high-frequency, we may need to estimate price change in next five seconds using information on price change, volume history of last 10 minutes. That is, we need to build a equation between one variable, you try to estimate, called a response and other multiple variables which you use to estimate response called predictors. This equation is called a model. The mostly widely used equation for this is a linear regression model. We use this model to estimate response variable which is also called prediction. In this video, we will discuss linear model using the simplest version - simple linear regression model in which it only has one predictor. From previous exploration, we have the scatter matrix for mutual pattern between variables. We can see that RM - number of roos as predictor, has very strong linear pattern with MEDV, which is the median value of house price. This pair of variables is good for our introduction of simple linear regression model. Just like in confidence interval estimation, we assume the population is normal. For linear regression, we also make assumptions about response and predictors in population. This is our starting point to build the models. We assume that response variable in population are all normal. They have equal variance sigma square. But in the mean of the response variable is determined by predictors. In linear form, mu_i equal to beta_0, plus beta_1, times X_i. Like in confidence interval estimation, we use the samples to estimate a population parameter, mu. Similarly, we also use samples to estimate parameters of population, beta_0, beta_1, sigma. In conclusion, if we apply linear regression model, we assume there exists such a real pattern in population. More specifically, linearity: The mean of y is linearly determined by predictors. Independence: With different X, responses are independent. Normality: The random noise and y follow normal distributions. Equal variance: The variance y are all equal even if the values of predictors are different. These assumptions need to be validated if you want to make inference using linear regression models. But in most cases, we do not need to be strict about these assumptions if you just use the model to make a prediction. In real application, we do not know beta_0 which is called intercept, beta_1 which is called coefficient of slope, and sigma neither. So, we cannot identify exact position of a line of a mean equation. Instead, we have samples, consisting of pairs of y and x, as shown in this scatter plot. They are not on a straight line, since the medium price of a house is not only determined by number of rooms, it is also determined by location and other conditions of house, collectively we put all other factors into noise term. It does not mean they are really noise because we will consider their pattern with a house price in a simple linear equation model. We know that mean equation is a straight line. All the sample pair goes around this line because of noise term. Using sample data, our target is to find a straight line y-hat equal to b_0 plus b_1 times x, to approximate mean equation as close as possible, where b_0 is the estimated value of beta_0 and b_1 is estimated value for beta_1. This equation is called a prediction equation. The idea here is a straightforward, but it has a problem. We are not sure which line is the closest to mean equation. Since we do not know the location of real mean equation, hence, we have to predict with other standard. Suppose we find a line, with b_0 equal to 1 and b_1 equal to 2. We can view the prediction equation using Python. This prediction equation is our guess of mean equation. GuessResponse is our estimated mean for response, which is also our guess of predict value for response. Then we can compute the difference between real medium price and our guess, which we call observederror. We name it observederror to distinguish it from noise in population model. Observederror can be regarded as an observer of a noise term. Here we print out of observederror for 7's, 20's, 100's pairs from the sample. Using method - sum of a data frame, we can compute the sum of squared errors. In short SSE. If SSE is larger, it is easy to figure out, observe pairs, stay far away from your predictor equation. Ideally, we want to get a line whose SSE or sum of squared errors is very small. Alternatively speaking, all sample pairs are close to predict line you found. Now it is turned into a minimization problem. Choose the best estimate values for beta_0 and beta_1 to minimize SSE term. The prediction equation found using this criteria is called best fit line. This process of estimation is called ordinary least square estimation. With a mathematical computation, there exists explicit formulas for b_0 and b_1. From formula of b_1 which is the estimated value for the slope of the model, the denominator is a measure of variation of predictor. The numerator is to measure the association between x and y. The ratio clearly is to measure sensitivity of a change of response with respect to the change of predictor. To estimate models, we have statistical package, Statsmodels. We use method fit of the model OLS of statsmodels. Here OLS stands for ordinary least square estimation. The parameter, data gives the names of data frame of a sample. The parameter formula tells which columns of data frame are response or predictors. We can get estimated coefficients b_0, b_1 using feature params of model. Finally, we can compute predicted value using prediction equation. We can plot best fit line along with sample data and our initial guess. The yellow color line is a best fit line. Obviously, some square error is much smaller with best fit line. Hence, the criterion using least squares is reasonable. Statsmodels also provide statistical evaluation of the model by using method summary. First, we need to pay attention to p-value of slope. This p-value is a p-value for the two-tail test of a slope. As we discuss in topic three, for population mean, in two-tailed tests, p-value is to compute the probability for statistic to take a more extreme values on the two tails. Here, statistic is b_1 minus beta_1, divided by S_b_1 because population standard deviation for the estimate b_1 is unknown. We use a sample S_b1 for b_1 as a replacement. Hence this statistic follows a t-distribution with degree of freedom, m minus 2. The reason for degree of freedom equal to m minus 2 is because we need the X-bar and y-bar to compute S_b_1 which cause the loss of two freedom. If we can reject the null, freedom y is not equal to zero, it is equivalent to say that then predict RM is useful in predicting the median value of house. We only have five percent chance to include RM in the model wrongly. Second, we can get a confidence interval for the slope. By default, confidence level is 95 percent. Ninety-five percent of confidence interval for the slope is 8.279 and 9.925. This interval is located on the positive side of real line. Hence, the slope is positive with high chance which is consistent with a correlation of a scatter plot of our data. Finally, we need to pay attention to r-square from output of summary. R-sqaure is important measure of performance of a model. First, we will compute the variation of y, without model. which is a sum of square deviation of observed y from the mean of y, which is denoted as SST. Then we can compute the sum of square deviation of predict y, from the mean of y. This is a part of variation response that can be explained by the model, which we denote as SSR. If SSR is larger, the difference between our prediction and the mean of y is larger. It means that our prediction is significantly different from the mean of the response, which is the estimated value of response without model. There is another variation SSE, we mentioned the ordinary least square which is a sum of a square error. It is a variation of a response that cannot be explained by the model or predictor. With a mathematical calculation, this equation holds true. We want a good model in the sense that, most of the changes are our target or response variable, can be explained by our model. In other words we hope SSE the variation unexplained can be relatively smaller. In simple linear regression, we use R square to measure the percentage of explained variation. Here are the model for medium price, R square is equal to 0.484. It means that, about 48.4 percent of variation of MEDV can be explained by our model. Some may ask, is R square equal to 48 percent too low and the model derive is not a good model? We will discuss a more detailed later. Here, please notice two facts. In this example, R square is less than 50 percent. Implies the medium price is not uniquely determined by number of rows. There is a big portion of it, explain by other variables which we need a multiple linear regression. Secondly, if the response is very noisy, like stock return, R square equal to 48 percent is already high enough to generate profit in trading. We will cover these two points in math for linear equation using stock market data.