Welcome back to our notebook. In this video, we're just going to focus on question 3. Not because question 3 is incredibly complicated, but rather because there's a lot of important concepts that we'll introduce that'll be important to keep in your toolkit as you move along on your machine learning journey. This question asks us to create a new dataset where all of the above categorical features that we just discussed will be one-hot encoded. We can fit this data and see how it'll end up affecting our results compared to not including these one-hot encoded categorical variables. So we're going to use the dataframe.copy, taking our Pandas DataFrame, use that.copy method in order to create a completely separate copy of our dataframe that we will use for one-hot encoding. On this dataframe, we will one-hot encode each of the appropriate columns, and then add it back to the dataframe. Then as we do this, we also have to make sure to drop that original column that was just an object or all the values or strings. Then going back to the original data that are not one-hot encoded, we also have to make sure that we drop all the string categoricals because we can't pass strings into any of our models. For the one-hot encoding step, we're going to want to use OneHotEncoder. We've introduced pd.get_dummies earlier. This will be a different way of doing the same thing as pd.get_dummies, but it'll do it in a more efficient manner for one you are expanding your dataset beyond just a 100 or 200 variables. But often, it will blow up into a matrix that's a 1,000 columns, 2,000 columns, so on, and that can eat up a lot of memory. We will use, here, within one-hot encoding, the sparse matrix rather than just outputting the full dataframe. We'll see that in practice in just a little bit. So we're going to start off by importing the OneHotEncoder from sklearn.preprocessing. We're then going to make a copy of the data, we have data.copy, using that method that we discussed. Now we have data_ohc as our new Pandas DataFrame that is a copy that won't affect the data. We're then going to initiate our OneHotEncoder encoding object. We're doing this similar to how we do it with all of our sklearn objects. What's important to note here is you can pass in certain arguments. An important argument to take note of, for those that want to do linear regression and ensure that their model is interpretable, we talked about prediction versus interpretability. One of the things that you're going to want to do is drop the first value from your dummy variables. I'll show you why in a second. So this is an argument that's available. We're not going to do it here, but I just want to show you that it is available. If you imagine, let's say, with our housing we had a column that's either beachfront or not beachfront. It started off with one column, we expand it to two, and then the last value is going to be the price. Let's just say it's five for the first one, and for our second one, it's not going to be beachfront, so it'll be zero for the first new column, one for the next one, and let's say this one is going to evaluate to four for its median value. We see here that the first column is indicating it is beachfront, the second column that it's not beachfront. When we do linear regression, if we remember the model, if we don't drop one of these columns, we can end up with an infinite amount of different coefficients in order to define coefficient for the first column and the second column. So imagine if we're trying to get to the numbers 5 and 4, and we can start off with the intercept equal to 1, and then the coefficient for our first column would be four, so 1 plus 4 is equal to 5. Then for the second column, since we're starting off with one, the second coefficient would be three, and we do 1 plus 3 equals 4, and that would work. But on the same note, we could have started with the first coefficient, the intercept being equal to 5, and then our first value being equal to 0, and we'd end up with five. Then for the second one, we just set the second coefficient to negative one. What I'm showing here is that if you don't drop one of these columns, you can end up with an infinite amount of what your intercept would be, what your first coefficient would be, and what your second coefficient could be. Because the first column and the second column are completely dependent on one another. There's complete multicollinearity when the first column is one, the second column is always zero, and vice versa. If we dropped one of these columns, so let's say we drop the column of not beachfront, then we can only have one option. Because our intercept will have to be four, because we only have one coefficient, is it beachfront? So the first one, our intercept is four, so 4 plus 0 equals 4, and then our intercept is equal to 1 here. That'll tell us that the beachfront will add an extra dollar a value to our median value. That's interpretable. The coefficients will actually make sense, and we don't have to worry about that multicollinearity messing with the interpretability of our coefficients. So that's a danger of multicollinearity. It probably will not affect your predictions, but it will affect the interpretability. So if you want high interpretability when you're looking back at your coefficients that you've learned, make sure that you use this drop first, and that's also available on pd.get_dummies. Now I'm going to remove this. Here we're just focusing on predictability, so it won't have too much of an effect, and I want to match up with the number that we came up earlier, that's 215, and that doesn't take into account that we're dropping one of the columns. But best-practice will generally be to drop one of these columns. Now for each column in our values here. So this, if you recall, let's start with just the variable num_ohc_columns was just our pandas series with all of our unique columns that are categorical. When we say index, then we're just going to be looping through each one of these individual columns. So in order to make this example clear of what's happening within the for loop, let's set everything above as we go through it. So we're going to set col equals to neighborhood, just to see a single example. The first thing that we're going to do is use our fit transform of our OneHotEncoder that we are defining. Now I'm going to have to pull that in. I'm going to have to pull both of these in an order for this to work, because I have to initiate those values of the copy of the dataset as well as the OneHotEncoder objects. Then I use the OneHotEncoder to fit and transform just this individual column. Now, another key concept that we're learning. So we learned about the importance of dropping one of the columns. Another key concept is, if you see I have two brackets here, what that's doing for me is is ensuring that rather than just pulling out that column, which will only have one dimension in terms of its length. When I put double brackets, it's going to output as a dataframe, and that's going to have two dimensions. One of those dimensions just equal to one, but the other one, the length of the dataframe. Just to make that clear, before I tell you why that's so important, we have, let's say data neighborhood, which is with just one bracket. I misspelled it. If I do.shape here, you see that it only has one value, 1379. It is a one-dimensional array. If I do two brackets, first let's see what that looks like. That will be a dataframe now. If I look at the shape, it will be 1379 by 1. So 1379 rows by 1 column. The reason why this is important is because when you pass in to your different sklearn transformers or predictors with linear regression in just a bit, you will often get this error that it will not work because you don't have the appropriate number of dimensions. You need to ensure that you have two dimensions whenever you pass into many of these sklearn objects. So that's the idea behind these two brackets that will be important to use throughout as you continue to use sklearn objects. So I'm going to run this. I didn't import OneHotEncoder, we're going to have to import that as well. Now we have our new data, which is just going to be a sparse matrix. Now this is going to be another important concepts. The idea again, which I mentioned earlier, is that you will actually be able to save a lot of memory when that matrix blows up. Now, to show you what that actually looks like, I'm going to run pd.DataFrame and this won't get it to its normal dataframe, but rather output these funky values that you see here. What these values mean is that it's just that we've blown up probably one column, it was initially one column, we'd have blown it up into, I believe around 25 columns, depending on how many we have for neighborhood. We see neighborhood was 25. So we would have blown this up into 25 columns. Rather than creating 25 new columns, we just say we are going to ignore zeros, that's what the zeros here are for, and we're going to say where the one is within our sparse matrix. So here it's within the fifth column, here it's the 24th column, and here it's the fifth column again. So the zero index and the second index should have similar or the same exact values. So if we look back at data neighborhood, you'll see that zero and two both are ColigCr. So that's why we have five. Five. So that sparse matrix is going to allow you to save a lot of memory. Now, we just want to show you how that works, and we can pass in these sparse matrices into our Sklearn models, so you won't have to put these back into data frames. But they're a little bit more difficult to look at, so later on, we're actually going to change this back to an array and just to see how that looks. It would just be like this, and this is going to be our 25 columns where we have 1 for 5 and then 0s otherwise. This again, will take up a lot more memory, but for our current process, it won't be too much memory. You can think, if you're doing any type of natural language processing, how intense this can get when you're trying to one-hot encode every single word, how big that matrix can get. So we have new_dat, which is just a sparse matrix. We're then going to drop the original column from our copy of the data set, because we do not want that anymore. We now have the one-hot encoded version of that column. We're then going to pull out the names of each one of the values, and this will just be for interpretability to show you what that looks like. These are just going to be all the unique values that are available within the neighborhood. They will align with this 0, 1, 2, 3, 4, etc, column by column, so that you'll be able to know the name of each one of those columns. We're also going to want to join that with the original column that we pulled out. So we're going to join neighborhood, that's the name of the column, with the category, which is either Blmngtn, Blueste, and so on for every single category in cats0. The reason why we say cats0 is because this is an array and within that array is our list. Now here's our array, so here you just see we're still working with an array, and then we can index the first value, second value from there. Then we create our new data frame. That's going to be equal to pd.DataFrame, which is just going to transform an array into a data frame. I'm going to use that toarray that we saw up here. We'll translate this into an actual data frame, and then we're going to set columns equal to our new columns that we have defined up here. I'm going to run this just so that we can see what that looks like. Then we can see that our new_df will outputs this new data frame which has Neighborhood_Blmngtn, Neighborhood_Blueste, and so on and so forth, so that we can correctly identify which one of these different unique values we are looking at within the column that we started with. Then finally, we're going to take our copy of the data and simply concatenate that with this new_df that we have created up here. That's going to create all of our new columns. When we do the concat on axis 1, we're just putting it to the right of it. So we're continually adding this to the right of our data frame. You look above, this will have the same number of rows every single time, so we can keep adding on to the right. So running a for loop, so data_ohc just keeps equaling itself plus all of these new columns. So I'm going to run all of that. Then we can say, what is the difference, in terms of our columns, from the original data frame to our new data frame. So this is our new data frame. When we run.shape, we get rows and columns. We only want the number of columns, so we take index 1 and we subtract the data.shape that we originally had, and you see we have 215 new columns, which matches up with what we predicted early on with 215 columns. Then finally, because we want to test with our original data frame, we have to drop all those original columns that were strings, otherwise we'll not be able to pass through our original data. So we just do data.drop, we're dropping the num_ohc_cols, those are just our object values,.index will just give us a list of just those names, and axis equals 1 is just that we're dropping columns and not rows. We can see that we take it from 80 columns down to 37 because we are subtracting 43 values if we look here. We add 43 values, we dropped all those object values. So that's Question 3, really in depth. Hopefully you now understand why we may want to drop certain columns, understanding what a sparse matrix is, as well as how to use the one-hot encoder in order to add on new columns to our data frame in a similar fashion that we did with pd.get_dummies. In the next lesson, we will get into our train-test split, and I look forward to seeing you there.