Welcome in this video we'll learn how to use XLMiner to perform linear regression. Assume that you already have XLMiner properly installed on your computer. Throughout this file I will use Border Housing data set. I'll first demonstrate how to build a simple regression model with one particular variable. I will then proceed to show you how to build a modern program model with more than one predicted variables. We will also talk about interacting terms and transformations as ways to improve the regression models. And then I will show how to use XLMiner to partition the data set and to perform cross modulation. Here is the boundary housing data set we discussed before. The data has many columns. To perform a simple regression we're going to use two columns. List price is used as a targeted variable. Now square feet is used as a predictive variable. If you have XLMiner properly installed, you should see the XLMiner reading when you bring up Excel. To perform regression, click on predict and multiple linear regression. Note that all variables are listed here. Choose square feet and move it to selected variables and set list price as output variable. Click next. You will see a number of options. We'll skip them for now and click finish. This creates two new output sheets in the Excel workbook. Let's take a closer look at an output sheet named MLR output. At the top of the window is Output Navigator which can lead you to different section of the output. Let's scroll down to the Regression Model section. As we can see here, the coefficient estimates are -124 and 0.4268. This table also gives us the p-values, which are pretty small for both coefficients. Indicating that the model is statistically significant. There are some additional summary statistics in the table on the right. In particular, the r squared is 0.64, meaning that 64% of variations in these prices are explained by the model. The RSS or residual sum of squares is 97901628. This number is also called sum of square errors. This last number is also reported in the training data scoring section below. Building the multiple regression model follows almost the exact same stats. Let's first return to the worksheet with the data. Click predict and multiple linear regression. Excel actually remembers our previous selection. Let's add parking type to the selected variables. Click finish. Excel creates another two sheets with an updated output. Scroll down to the regression model section in the output sheet. Here we can see the multiple regression output. Because we have two predicted variables, we have one additional coefficient estimate. The coefficient estimate of our parking type is minus 99.56. Indicating that the value of garage parking is around $99,000. This is quite a surprising and is likely caused by [INAUDIBLE]. Recall that [INAUDIBLE] occurs when two particular variables are correlated. For our data set, it just so happens that larger houses are more likely to have garage parking. There are several different ways to improve our linear regression model, we discussed two of them here. The first one is interacting terms, or mathematically, simply the multiplication of two terms. To introduce an interacting term between parking time and the square feet, we add one more column to our dataset. Let's return to the worksheet with original data. We'll call our new data column square feet times parking type, and type it in the first empty cell in column K. Move the cursor to the next cell in the same column and type in formulas that multiply square feet and the parking type. Fill out the formula in the column to create the value of the [INAUDIBLE] terms for all rows. To build a new model with this interaction term, click predict and multiple linear regression. In the pop up window our selections before are still there. However, the interaction term does not appear in the available variables. This is because the data range ends in column J. Go to the data range input and change J to K. Observe that now the interaction term is available. Move this new variable to select the variable and click finish to the results. Now there are four coefficient estimates. In particular the coefficient for the interaction term is 0.0169. Next let's briefly discuss data transformation. We apply a logarithm transformation to the list price and square footage variables. This can be done by adding two data columns. Go back to the work sheet with the data. Now column L create a variable called log price. Type the name of the column log price in the first empty cell in column L, and type the formula log A2 in the next cell in column L. In column M, create a variable called log square feet. In column M, create a variable called the log square footage. Type log square footage in the first empty cell and type the formula log b2 in the cell on the next row. Select the two cells on the same row and field down to create variables for all rows. To build a new model with the variables we just created click predict and multiple linear regression again. In the pop up window clear selected variables. Now select log price as the auto input variable and move log square footage to select the variables. If you do not see the variables we just created remember to update the data range. We obtained a substantially improved model. Resolved, in the sense that our squared is substantially larger at 0.8. Meaning that 78% of the variations in the target variable is explained by the model. Know that residual sum of squared, RSS, in output is not directly comparable, which is what we obtained before, because the variables are now on the log scale. Our last topic in this video is cross validation. The first step in cross validation is to divide the data training and validation sets. This can be done easily with EXMiner. To do this, click partition and then standard partition. In the pop up window we can choose which variables to include. We choose to include all variables by moving them to select variables. In the bottom we can set the percentage in training by division set. We except the default 60 40 split. Note that you can also choose to set your own percentage in which case you can divide the data into 3 sets instead 2 sets where the last set is called the test set. Click ok brings us a new worksheet with the partition data. As you can see, there are 188 rows in training data, and 126 rows in the validation set. Now that we can bring up the multiple regression window as before. Observe that the worksheet is filled with data partition, which is the worksheet that we just created. Again, we'll build a simple regression model with list price as auto input variable, and square footage as the predictor variable. Compared to the case without cross validation, now the output contains the summary report for Validation Data Scoring which is substantially different from the results from Training Data Scoring. The coefficient estimates are also a little different than before. This is because when we perform cross validation, only the training data is used to build the model. I hoped that you enjoyed the video, even though we only discussed relatively simple regression models. You can already see the power of XLMiner. XLMiner also has many advanced features and options not covered here. I encourage you to explore them on your own time.