In this video, we'll be demonstrating how to do one-hot encoding of categorical features, as well as indexing categorical features. Then we'll compare the results from each set of features with these two different methods by building and evaluating a decision tree. Our first step is to prepare the data where we need to aggregate the user level table. Remember that one of our project objectives for this lesson is to predict a customer's BMI or body mass index based on their other recorded metrics. Therefore, we need to do a user-level aggregation of the data since the data is coming in more frequently every hour for each user, so we want to aggregate that into averages. To prepare the data set to do this, we'll use our ht daily metrics table, and we'll take it to the user level by grouping on device ID. Then if we want to view what that table looks like now. Okay. We see that we have our table grouped by device ID. For each device ID, which represents each user, we have the average resting heart rate, active heart rate, the BMI, V02, workout minutes, steps, and lifestyle. You'll notice that all of these categories, aside from device ID, which we can ignore because we're not going to use it, but all of these features are numerical except for lifestyle, which is categorical. Before we deal with that, we're going to convert this Spark DataFrame to a Pandas DataFrame so that we can easily work with it using scikitlearn. All right, now let's look at the values in the lifestyle column. We see that we have sedentary weight trainer, athlete, and cardio enthusiast. We need to convert these to numerical in order for machine learning algorithms to be able to deal with them. The first method we'll try is using the label encoder from scikitlearn. This is how we import it, then we instantiate a label encoder object. Then we apply it to our lifestyle column in our data frame by using fit transform on the column. By doing that, we're going to make a new column called lifestyle cat in the same data frame. If we run this cell and we look at the first five lines of the data frame, we see that we now have this additional column over here, lifestyle cat. It has encoded the categories as numbers. Now, if I look at the unique values for that, I see that we have 0-3. Now, we're going to proceed with our typical machine learning process where we will split our data set into our X and our Y or target and remember that we're trying to predict BMI, so that's what we'll use for the Y. Let's just look at the shape of each. We see that we have 3,000 rows and six columns for the features. Then the target data set is, as expected, 3,000 rows, one column. We'll split our data set into training and test set. Now, we're just going to fit a very simple decision tree. We see that we import it from sklearn.tree and we can import the decision tree regressor since we're specifically doing regression. We instantiate the decision tree object. We're going to call it DT. Then we fit it on the X and Y training data set. We see that we get this printout that shows us all of the parameters that are specified for the decision tree regressor. We didn't do too many of these obviously, so these are all just going to be the default values. Then we'll evaluate the decision tree by getting the R-squared value. We see that just in this first pass, it did pretty well. It got 100 percent on the training set and 92 percent on the test set. Obviously, this is slightly over fit on the training set because we could probably get those values a little bit closer to each other. But we'll ignore that for now since we're not trying to focus on the accuracy right now. Our second method that we'll try to deal with the categorical lifestyle column is using a Pandas built-in method called get dummies. What this does is it one-hot encodes categorical values. Scikit-learn also has a one-hot encoder class that does the same thing. But using the built-in pandas method is a little more straightforward in our case, and it just tends to be a little easier to use. Now we're going to make a new version of the table just so that we don't impact the original version. Next, we import Pandas as pd and we make a new data frame where we call the Pandas get dummies method on our data frame. We give it the prefix equals lifestyle and specify which column we're using to get dummies. You'll see what that prefix does when we run this and then display the new data frame. We see that we have this data frame where we have these new columns, these four columns on the right, and they're filled with ones and zeros. Each column has this prefix lifestyle and then underscore and then whatever the value for that category was. What it does is it pulls out where previously the column had athlete or cardio enthusiasts as its value. It pulls that into the name of the column and then it fills in zeros and ones depending on which one of these four categories that data sample or row fits into. This first row is the sedentary lifestyle. We get a one for the lifestyle sedentary and a zero for the other three categories. We will split our data frame into the X and Y, so our features and our target. We're going to drop our target from our feature X data set. We're also going to drop the device ID because we're not using that for this model. The model would have a problem with using this string value. We'll run that. Here we see that instead of having six columns as before, we have nine because we added the four for the dummy variables, and then we dropped the device ID column. Then we do our train test split. We can now train our decision tree on our training data set. Then we will evaluate how it did. It looks like it's slightly better this time than the previous round with the string indexing method where we got 100 percent on the training set again but 93.1 percent on the test set. I think it was 92 percent last time. It did slightly better, but this could just be a fluke. We can't really say that it did a lot better from the one-hot encoding. But now you know how to do string indexing of categorical features using two different methods.