In this lesson, we'll look at the data preparation that the AutoAI tool performs when building a prototype. Here are the steps involved in building a prototype. And so first up is this data preparation stage. And we'll take a look more closely now with the capabilities of the AutoAI experiments tooler in this stage. Just a reminder here that data wrangling is more than 80% of the work. So these data preprocessing steps can be very time consuming depending on the project, depending on the state of the data when you get it. First let's talk about feature selection, because there are some checks that the AutoAI tool performs to discard features that don't contain a lot of information. So first, if the column has a constant value for the entire data set, then the autoAI experiment is not going to select this column for use in its model. Secondly, if a column has all unique values, so every row in the data set has a different number, maybe an ID number. As long as this is not a date or a timestamp, this column is not going to be included in the model either, so it'll be automatically sort of discarded by the AutoAI experiment. So how is this accomplished? Here I have a screenshot from the notebook at the top of the slide here, this little code snippet, and then I have a screenshot from the documentation for these autoai_libs. As you can see here, there is this function, that NumPyColumnSelector, which is going to select particular columns. And so you'll observe in the notebook when we get to the lab portion for this lesson, how particular columns are selected for inclusion in the model using this function. Another consideration for data preprocessing is identifying missing or outlying values, and so the audioAI tool can identify missing and outlying values. So how does it accomplish this? You'll see in the notebook there is a list specified. These are values that indicate missing data, and so you can add values to that list if your data set has some other indicator for missing data. And then the AutoAI tool can go through and identify which are the missing values in the data set. So the next thing to think about for data preparation is what to do with missing data. And there are some options here, so you can remove rows from the data set that have missing data. Or you can also try to impute a value for the missing data. And both of these strategies will be used by the AutoAI experiment, but under different circumstances. So first up, what if data is missing for the labels for the data? So some rows in the data set don't have a value in the target variable column. Well, the default behavior for the AutoAI experiment is to remove these rows from the data set. And here is the snippet from the notebook that performs that. If you wanted to change this default behavior, it's possible to do that too. Now let's talk about if data is missing in the features. So not the labels, but in some of the features in the data set, there are missing data in these columns. So the AutoAI tool here will try to impute values for this missing data. And it will use different strategies based on the type of variable that you're talking about. These strategies are all borrowed from the sklearn library. So for categorical variables, the AutoAIexperiment tool will impute missing values with the most frequent value for that feature. And for numerical variables, the AutoAI tool will impute with the median value for that feature. So let's look at how this is accomplished. And you can see the documentation here from these autoai_libs, and that there's a function called CatImputer, which uses a lot of the imputation strategies from the sklearn library. And so at the top here is a snippet from the notebook. And you can see here that the arguments specified here for a strategy is most frequent. So that is the strategy here for categorical variable implication. And you can change this if you wanted to use one of the other strategy. So you could replace it with a constant value if you wanted something that would be relevant for categorical variables. Here we see how the data imputation is done for numerical features. And so again there's this function num_imputer, and it uses the strategies from the sklearn library. So you can see this code snippet at the top here. And the strategy, the argument there is set to median. So that's the default here for imputing data, but you can change this in the notebook. You could also use the mean or most frequent value. In our use cases these are small data sets and they don't have a lot of missing data. So we won't get to see this in play too much. But if you do have with missing data, I think it'd be interesting to see how this performs on it. All right, so after we have identified missing data, decided how to handle missing data, one of the other parts of preprocessing is encoding and scaling for our features. So if we have categorical features, we should think about how these are going to be encoded. And by default, the AutoAI experiment will encode categorical features as ordinal. So we'll see in the labs later and in the demo here how that works for our use cases. And then for numerical features, they will be not scaled by default. And if you think back to the algorithms that we are trying here, we have a lot of tree-based algorithms, so it makes some sense that we're not scaling the numerical features. But you're also going to get a chance in the notebook to scale some of the numerical features, and then see how that affects the performance here of our prototypes. All right, so for categorical encoding, again, we have this function from autoai_libs called the CatEncoder, which is sort of a wrapper for sklearn's encoder. And so if you look at this code snippet here, you'll see that the argument encoding has been set to ordinal. So again that's going to be the default here. And you can try out some of these other encoding strategies, and we'll do that in the labs to try out one hot encoding versus ordinal, and see how that affects our prototype. For numerical scaling, again, we have this function, the standard scaler, which is based on an sklearn function. And so here you'll see that the default is to set this to false, so to not scale. But we'll do some testing coming up to try scaling it and seeing how that might affect our prototype's performance. One very cool feature that IBM Research has is this idea of preprocessing HPO, so this is not yet built into the product from when I last checked. But this is the idea of optimization on the preprocessing part of the pipeline. So the way that this would work is that, a grid search could be performed for all the preprocessing strategies using different strategies for data imputation, different strategies for encoding and scaling. So if that process is automated, that means as a data scientist, you wouldn't have to go in and change the encoding or try a different way of imputing values, or think about a different scaling strategy. You could just specify these strategies and then, the AutoAI tool would find the best performing preprocessing strategies that give the best performance for your pipeline. So I think this is a very neat thing. And I hope to see more of this, and I hope to see it make its way into our product soon. All right, that wraps up the AutoAI capabilities on preprocessing. Next I will show you a demo where we change some of the strategies and observe the performance for our pipeline.