Welcome to our second notebook in our second course. Here we'll be learning about train-test split, specifically in order to evaluate a linear regression model. Again, we'll be working with the data set on housing prices in Ames, Iowa that we were introduced to in the first notebook. The first thing that we're going to do is import os, which is just a library to access our operating system, as well as set this variable, data_path, equal to this list with one item data, which we'll see in a second, is going to allow us to access the folder where our CSV file lives. So the first question asks us to import the data using Pandas and examine the shape. So there should be 79 feature columns plus the predictor, the predictor being the sale price. Then the next part of the question is asking, there are three different types, integers, floats, and strings, we want to examine how many of each type there is in our data frame. So as usual, we import pandas as pd and numpy as np. We're going to set our filepath equal to this os.sep, and to show you how this works, os.sep is just going to tell you, given your operating system, what is the separator that you use as you try to find the folder, or a sub folder within a folder, or a CSV file within the folder. I'm on the Mac operating system, so we use a forward slash. So we have this forward slash and then we want to join together using the.join different items in the list. So our data_path is one list, we add that to the Ames_Housing_Sales.csv, which is another list, to have a list with two items. When we run.join, these values, then we end up with the string that we want, which is just our folder and then the CSV file that we want to access. Then all we have to do is run pandas.read.csv and pass in the file path, and we end up with our Pandas data frame and we're setting that equal to data. Next we're going to print out the shape. Remember we asked in the beginning, what is the shape of our data frame. That Pandas data frame has this attribute,.shape, and we see that as 1,379 rows and as we said earlier, there's going to be 80 columns, 79 of those, features, plus the predictor column. Now, in order to get the second portion of Question 2, we were asked to find the number of integers, floats, and strings in each one of those 80 columns. When I run data.dtypes on its own, it will return a Pandas series. You can think of that as a column within your data frame. When you have just a single column, you could run this method called.value_counts on your Pandas series, and that will give you the count of every unique value within that column. So here our unique values are float, object, and integer, so when we do value_counts, it just gives the count of object, float, and integer. Now, moving to Question 2. A significant challenge, particularly when dealing with data that many columns, is ensuring that each column gets encoded correctly. This is particularly true with data columns that are ordered categoricals versus unordered categoricals, the ordinal versus nominal that we talked about in the first lecture. Now, unordered categories should be one-hot encoded. However, this can significantly increase the number of features when we one-hot encode and create features that are also highly correlated with one another. So we want to determine, with our starting point of 80 columns, how many total features would be present relative to what currently exists if all object features are one-hot encoded. Recall that the number of one-hot encoded columns is going to be n minus 1, where n is the number of categories. The reason being that if you have one category in your original data frame, and then you one-hot encode it and say that creates three new columns because there's three unique values, we are also going to drop that original one, so we only end up with two new values overall. So that's why it's n minus 1. So the first thing that we do is using that data.dtypes, we want to see which one of those are equal to np.objects. So which ones are of type object, and once we do that, we can use that mask, which is just going to be true or false Pandas series, that is the same length as your columns, in the order to filter down to the columns that are just equal to np.objects. So quickly showing you what the mask looks like. It's going to be this data frame where Alley is actually an object, whereas BedroomAbvGr is not. That will actually be a numerical value. So now that we have our categorical columns, and that's just going to be our columns, we'll look at that as well really quickly. These are all the columns that we're now looking at. We are going to filter down our data to just those columns. We're going to use the.apply method, and that's going to apply some function to each individual column. We're going to run the x.nunique which will give you, for that column, the number of unique values, which is what we're looking for. When we want a hot encode, we're creating a new column for every single unique value in that column. So I'm going to say lambda x x.nunique, and we see that we have, for every column, the number of unique values in that column. That's going to be our num_ohc_cols. We can sort our values, that will matter later on, but it may make it a little cleaner to look at. Now, we don't need to encode if there's only going to be one value in that column. That's not the case for any of these, but just in case, you can locate any of those that are equal to 1 and filter those out using this.loc. The next thing that we want to do is we want to take each one of these values and subtract 1. Because, as we mentioned, we want to get to n minus 1, because we're going to be losing that column after we create these new columns for each unique value. So every one of these is going to be 1 subtracted from it using minus equals 1. Then finally, we're going to get the sum of how many unique values we are now left with. So after we subtracted 1, we have that n that we're looking for, and we run.sum, and we should have 215 new columns being created. So if we started off with 79 or 80, we should end up with 294 or 295, depending if you're counting that outcome variable. Now, in the next section when we come back, we will get started on actually doing that one-hot encoding, not using the pd.dummies that we used earlier, but this time using Sklearn's one-hot encoder. Look forward to seeing you there.