Hi everyone. In this video we're going to talk about multi-linear regression. Previously we talked about simple linear regression where we have only one variable. Now we're going to add more variables, whether it's a higher-order terms for that single variable, or other features into the model. Then the key idea we're going to discuss is that when the model complexity increases by adding more features, we can fit the data better but it can also introduce some other problems. We'll introduce a concept of bias-variance trade-off. Then we'll talk about how to sell the features that are most contributing to the model. Last time we talked about single variable linear regression, which takes the form of y equals a_0 plus a_1x_1. X_1 is one feature that we care about, and a_1 is a slope and a_0 is a coefficient for intercept. The example we have was you can predict the price of the house sales as a function of size of the house. Size could be x_1 and price is the y, w, 1, 2. Predict, now let's say we want to add another feature. Say that besides of the lot. When it has a big lot, then maybe it's more expensive than the same small house that has a smaller lot. We can think about this and we can add the new term, new feature into our model, a_2x_2. Similarly we can add more features such as the number of bedrooms and things like that, then it becomes more complex model and so on. This is also linear regression, especially it's called multi-linear regression because it has a multiple features. But we can also make some other model that has a higher-order terms of the house size. For example, we can have a squared term of the house size. In that case we'll have a_1x_1 plus a_2x_1 squared and that could be also a good model. If we wants to add more complexity or higher-order term to in this model with the same feature, we could add a third term, a cubic term of the house size like this and we can add more. In this case it's called the polynomial regression, this is multi linear regression. We can also engineer some features. Instead of having squared term and cubic term and so on, we're not restrict to have just the higher-order terms but we can create some other variable or features using existing features. For example, if we are predicting some probability of getting diabetes based on height of a person and weight of the person and some other features that we measured from the lab and so on, instead of having this model, height plus a_2, weight plus, and so on, we can construct another variable, let's say called x prime, and which is BMI, which is proportional to weight divided by height squared. This BMI is a function of x_1 and x_2. and this becomes a new feature, x prime. We can have insert a_0 plus a_1 x prime. The things that we wanted to add like [inaudible] and things like that, something like this. Instead of having height and weight separate features. There are many different possibility that we can engineer like relevant features depending on your domain knowledge or your intuition on the problem and so on. Linear model can become really flexible in this case. We're going to talk about what happens if we start adding more complexity into model. Then there are some things that we need to be careful. We'll talk about those. Let's start by polynomial regression. This m represent the order of the maximum term. M equals 1 represent the simple linear regression ax plus b. Then m equals 2 will be a_0 plus a_1x plus a_2x squared and so on. These are the complexity of our model. When you look at the simple linear regression, it looks a straight line, which is okay. But it's still maybe little, too simple for this data. Let's add another term, squared term, and then maybe fits a little bit better. We can add a cubic term. Then you can see as you add more high-order terms, the fitted line becomes there more flexible and have different shapes of the curve. Some point the fitting fails actually. What happens here is that I wasn't very careful about scaling of the feature x. In my simple linear regression model, this was on the order of thousand and my y is going to be on the order of million. Then this coefficient could be on the order of thousand or less, and so on. Then this squared term would be on the order of million, and by the time I have the size of the house is six power, this could be 10^18 which is a really big number, and the coefficient to match this number should be very small. That means the computer has a hard time to calculate all these coefficients, therefore, the fitting may not work very well. In order to prevent this disaster, one way you can do it is just to scale the feature to something on the order of one instead of thousand. If you just divide by 1,000 of your features, then you could have 1^6, 7, something like that, and then here you're going to have 1^6 or 10^6. It's more manageable, therefore, you can add more higher-order terms if you want to. However, you will see shortly that we don't want to add higher-order terms indefinitely. It leads to a question, where do we want to stop adding higher-order terms? Obviously, when you see the model fitness, the model fitness will go up and up as you add more model complexity. You have some data like this, and your model could be little crazy that it has a really higher-order and can fit everything like this. This model is not very good. First, it's not very interpretable, but second, it's more vulnerable to new data point, say this one. You will have a huge error or maybe something like here, you will have a huge error with this. However, if you have a simpler model, it will have a smaller error at this new data point and things like that. That's the motivation. How do we determine where to stop when we add model complexity? Is we want to monitor the error that's introduced when we introduce new data points. You remember we talked about how to measure the test data error and training data error. We had our dataset that we have both feature and label, and then we set aside some portion of this data and called it test data. Another name for test data that's used while we're training is called the validation. We can call them interchangeably but in machine learning community, validation error is more used term for the dataset that's set aside for the purpose of testing while you're training the model. But anyway, with this, we can measure errors for the training and testing. Let's say we picked MSE, then as we mentioned before, we have a trained model and measure. We can have the prediction from the training data and with this training label, we can calculate the mean squared error or any other metric of your choice. That becomes the error for the training, and we can do the similar for the test data, x_te prediction value, and then y_te, then can have the error for the test data. This f correspond to each different model with a different higher-order terms or different model complexity, so this is m equals 1, and this is model with the m equals 2, etc. Then when you plot the exact shape of the curve for training error and test error will be different depending on your number of data and the data itself that you randomly sample, and also it will depend on your model complexity, and so on. However, in general, you're going to see this type of error curves. For training error, you will go down as you increase your model complexity. However, the test error will go down in the beginning and then at some point it will start going up again as the model complexity is increased. Then we can find the sweet spot here that the test error is minimized, so we can pick our best model complexity equals two. You can also see this model complexity m equals 3 model is also comparably good. In some cases, depending on your data draw, it can show you, actually is slightly better results then model complexity equals two. However, if they are similar, then you want to still choose the simpler model, and this principle is called Occam's razor. This is only telling that if the model performance are similar for simpler model and complex model, we prefer choosing simpler model.