Welcome back to our notebook. Here we'll cover question number 4, where we will perform our train and test splits for both of our datasets, both the original data frame with all of our shrink, our categorical features drops, as well as the version that had created all of the one-hot encoded versions of each one of those string variables. So we're going to have one where there's only 37 different features and one where we have 294 features as we saw before. We're going to train and split using the train_test_split functionality so that they are both split in the same fashion. Then we're going to see, given that holdout set, which one will be able to perform better between the one-hot encoded version and the original version. So the first thing that we do is import our train_test_split from sklearn.model_selection. We set our y_col equal to the string SalePrice. We're going to use this in just a second in order to separate out our x variable and our y variable, our features, and our target column. So our feature_cols are going to be x for x in data.columns. So that would be a list of all of our actual columns if x is not equal to y_col. So it's essentially taking all of our columns except for SalePrice, which we defined up here. So that'll be all of our feature columns. We can then isolate all of our features using that feature_cols list. So we say from our data, pull out just the feature_cols and set that equal to x_data. We then set y_data equal to our target column, which is just going to be data selecting that y_col that we defined above. Then we're going to use those x and y_data in order to create our x_train, x_test, y_train, and y_test, using the train_test_split function. So how does it train_test_split function work? We saw this briefly in another notebook. We parse in x_data and y_data. We parse in the test size. So test size is going to be for the number of rows we have, what percentage do we want to holdout for a test and not train on? Then a random state to ensure that we get the same split every time and you'll be getting the same split that I'm getting now. So we get four outputs from train_test_split, assuming that we passed in both an x and y, those four values will be in this order, x_train, x_test then y_train, then y_test, where x_train and y_train will be paired together in order to fit our model and then we're going to use that fitting in order to see how well we can predict on x_test. Then take those predictions from x_test and compare to how well they perform against y_test, assuming that it was a holdout set. Then we're going to do the same thing for our data_ohc, which is just our data with the one-hot encoded variables. We will get the x_data_ohc, using the same feature cols process as we did before, just instead this time we're going to say x for x in data_ohc.columns, assuming it's not SalePrice, so that we get all 294 columns that we discussed before. Then we set the y_data in the same version and we use the same train_test_split to get our x_train, x_test, y_train, and y_test in the same fashion that we did above, but this time for one-hot encoding and we're going to name them appropriately with this _ohc suffix. So we run that, we can look at the shape of x_train_ohc. That's going to be 965 rows, which should be 70 percent of the original data frame. If we were to look at y_train_ohc, what do you think that shape would be? Hopefully you guess correctly, with 965 rows complementing each one of the rows in our x_train because that's going to be how we fit there has to be the same number of rows. Then if we look at x_train without ohc, we should have many less columns, namely 36 columns, because we don't have all those one-hot encoded columns. We're going to check here that x_train_ohc and x_train from the original dataset both have the same indices so that we know that we got the same exact split. We see that that is true. Just to make clear how the.all functionality works, if we run two arrays to see if they're equal to one another, It will output true for every single value or false for every single value according to their one-by-one match. If we run.all then every single value in that array has to be true for this to evaluate to true. Now let's run our linear regression to see how well we're able to perform on our train and test sets. We're going to initiate our linear regression objects using linear regression or importing the linear regression from sklearn.linear_model. We're also going to import the mean_squared_error from sklearn.metrics. Again we initiate that linear regression model to LR, recreating an empty list called error df where we will append each one of our error terms so that we can evaluate them later. The first thing that we're going to do is fit our model on the original x_train and y_train without any one-hot encoded. We say LR.fit x _train and y_train. LR is now fit to our model so as come up with the parameters, and then we can use the LR. predict in order to predict first how well we do on x_train. When we predict and we're setting that equal to y_train predicts. When we predict on the x_train we should be able to get a much lower error because we've already seen all of this data before, and then we're going to LR.predict on x_test which is our holdout set. That should give us a better picture of the reality as to how well we're able to generalize once we see a new data that hasn't been trained. We're then going to append to error_df the series. Now the series is just going to pass in dictionary. You can think of a series as a column again and then each one of these values will be an index for that column. We have the first index value is train and we get the mean squared error for the training set and the prediction for the training set, recall that that should be a lower number. Then we get the mean squared error for the test and the test predictions, and that should be a little bit of a higher error. Then we're going to name it no encoding so later on you'd assume if you're creating columns, you're going to eventually be creating a data frame, so that will be the column name. So we have the column name, no space enc standing for no encoding and these will be the two indices, train and test. We're then going to do the same thing for our one-hot encoded version of our data. So it's all the same steps, LR.fit, x_train_ohc, y_train_ohc. Just looking at the x_train and y_train in relation to the one-hot encoded version, we are then going to come up with a prediction on our training set. Again, that should be doing better than it would on the test set, and then next we see how well we're able to predict on the actual test set, on our holdout set that we didn't actually use in our training. We're then going to append to our original error df, another append.series. This time we're again going to call the indices train and test. We're going to take them mean squared error for y_train versus the prediction as well as y_test versus the prediction. This one we're going to name this column one-hot encoded. We should have now two columns, one that is not no encoding, another one one-hot encoding and then we'd be able to see each one is going to have the same indices train and test. We use pd.concat in order to concatenate each one of these series together in order create our data frame, and then we end up with our error_df which is going to be this data frame here. Now, let's think about which one ended up having a lower error. We see that on the training set that with no encoding, so looking back at the same training set that was used, we got a lower mean squared error for the one-hot encoded versus the no encoded. In that sense it did better on just the training set. But then looking at the holdout set, we see that we performed much worse on the one-hot encoded. Now, what could be the reason for this? If you think about the idea that we talked about in terms of complexity and overfitting, the more parameters that you're able to use, the more likely you are to overfit your data. You're able to come up with a variation or different coefficients to exactly fit to your training set that may not replicate out in the real world. In this case, using the one-hot encoding it seems that we may have overfit our data, a good indicator that is a big gap between the training set error and the test set error. Then you see here, we are closer to just write when these aren't as largely separated. Now, this is all relative of course, because this is 10 to the power of 9. This may be a large difference, but compared to the train and test set of one-hot encoding, you see that we vastly overfit and came up with way too complex to the model. That closes out question number 4. In question number 5, we will talk about scaling our data and then we'll also go into question number 6 to quickly plot out our predictions versus the actual values and that will close out this notebook. I'll see you there. Thank you.