Welcome. In this video, we will learn how to use XLMiner to perform linear regression. Assume that you already have XLMiner property installed on your computer. Throughout the video, I will use Boulder housing data set. I will first demonstrate how to be able to a simple regression model with one predictor variable and then proceed to show you how to build a multiple progression model with more than one predictive variables. We'll also talk about interacting terms and transformations as ways to improve the regression models. And then I will show how to use XLMiner to partition the data set and perform cross validation. Here is the Boulder housing dataset we discussed before. The data has many columns. To perform a simple regression, we're going to use two columns. List price is used as a target variable, and square feet is used as a predictive variable. If you have XLMiner properly installed, you should see the XLMIner re-bin when you bring up Excel. To perform regression, click on Predict and the Multiple Linear Regression. Know that all variables are listed here. Choose square feet and move it to selected variables and set list price as output variable. Click Next. You will see a number of options. We skip them for now and click Finish. This creates two new output sheets in Excel Workbook. Let's take a closer look at the output sheet named MLR output. At the top of the window is output navigator, which can lead you to different sections of the output. Let's scroll down to the regression model section. As we can see here, the coefficient estimate are minus 124 and 0.4268. This table also gives us the P-values, which are pretty small for both coefficients, indicating that the model is statistically significant. There are some additional summary statistics in the table on the right, in particular, the r squared, 0.64, meaning that 64% of relation in these prices are explained by the model. The RSS or residual sum of squares is 97901628. This number is also call sum of squared errors. Its last number is also reported in the training data score in. Building a multiple regression model follows almost exact same steps. Let's first return to the worksheet with the data. Click Predict and multiple linear regression. Excel actually remembers our previous selection. Let's add parking time to the selected variables. Click Finish. Excel creates another two sheets with an updated output. Scroll down to the regression model section in the output sheet. Here we can see the multiple regression output. . Because we have two predictor variables, we have one additional coefficient estimate. The coefficient estimate of parking type is minus 99.56 indicating that the value of garage parking is around negative $99,000. This is quite surprising and is likely caused by model linearity. Recall that model linearity occurs when two predictor variables are correlated. For our data set, it just so happens that larger houses are more likely to have garage parking. There are several different ways to improve a linear regression model. We discussed two of them here. The first one is interacting terms, or mathematically, simply the multiplication of two terms. To introduce an interacting term between parking type and square feet, we add one more column to our data set. Let's return to the worksheet with our original data. We'll call our new data column, square feet times parking time. I'm typing in the first empty cell in column K. Move the cursor to the last cell in the same column and title formulas that are multiplied square feet and parking time. Feed down the formula in the column to create the value of the interacting term for all rows. To build a new model with this interacting term, click Predict and multiple linear regression. In the pop up window, our selections [INAUDIBLE] are there. However, the interacting term does not appear in the available variables. This is because the data range ends in column J. Go to the data range input and change J to K. Observe that now the interacting term is available. Move this new variable to select the variables and click Finish to get the results. Now there are four co-efficient estimates. In particular, the coefficient for the interacting term is 0.0169. Next, let's briefly discuss data transformation. We apply a logarithm transformation to the list price and square footage variables. This can be done by adding two data columns. Go back to the worksheet with the data. Now in column L, create a variable called Log Price. Type the name of the column Log Price in the first empty cell in column L and type the formula log AQ in the next cell in column L. In column M, create a variable called Log Square Footage. In column M, create a variable called Log Square Feet. Type Log Square Footage in the first empty cell and type the formula LOG EQ in the cell on the next row. Select the two cells in the same row, and feed down to create the variables for all rows. To build a new model with the variables we just created, click Predict and multiple linear regression again. In the pop-up window, clear selected variables. Now select log price as output variable, and move log square footage to selected variables. If you do not see the variables we just created, remember to update the data range. We obtained a substantially improved model Result in a sense that r squared is substantially larger at 0.78, meaning that 78% of the variation in the target variable is explained by the model. Now the residual sum of squared, rss in output is not directly comparable to the one we obtained before, because the variables are now on the log scale. Our next topic in this video is cross validation. The first step in cross validation is to divide the data into training, and validation stats. This can be done easily with XLMiner. To do this click Partition, and then Standard Partition. In the pop out window, we can choose which variables to include. We choose to include all variables by moving them to selected variable. In the bottom, we can set the percentage in training set. WE accept the default 60/40 split. Note that you can also choose to set own percentage, in which case you can divide the data into three sets instead of two sets, where the last set is called the test set. Click OK, brings us a new worksheet with the partitioned data. As you can see, there are 188 rows in training data and 126 rows in the validation side, now that we can bring up the multiple regression window as before. Observe that the worksheet is filled with data partition, which is the worksheet we just created. Again will be on a simple regression model with list price as output variable, and square footage as the predictor variable. Compared to the case without cross validation, now the output contains the summary report for validation data scoring which is substantially different from the results from training data scoring. The co-efficient estimates are also labelled different than before. This is because when we perform cross-validation, only the training data is used to build the model. I hope you enjoyed the video. Even though we discussed relatively simple regression models, we can already see the power of XLMiner. XLMiner also has many advanced features, and options not covered here. I encourage you to explore them on your own time.