The purpose of this unit is to illustrate some basic methods in Python for dealing with imbalances across classes when you're training machine learning models. This is a very complex topic because not all machine learning models are sensitive to class imbalances. Decision trees for example, often perform well when using imbalanced data. But you should be able to compare different algorithms and assess the performance of each one. The level of imbalance also affects your interpretation of the results in how you approach the problem. For example, if you have a fraud detection model, where only one percent of the activity the model is analyzing consists of fraud, then that model can have a 99 percent accuracy score, yet be completely useless. We will introduce the use of the imbalanced-learn package in Python. Imbalanced-learn has a simple interface and integrates easily with scikit-learn. It contains many algorithms for up-sampling and down-sampling, which is important because you will need to compare different algorithms and assess their effects on the performance of your models. In these materials, we will use data from AAVAIL to generate churn data to showcase the handling of imbalanced classes during the AI workflow. In the AAVAIL data which you can see here, there are two categorical variables and two continuous variables that we're going to use to predict. The other variable is a subscriber. The categorical variables are country and subscriber type, and the continuous variables are age and num streams. The class imbalance that we have is that there are a lot more subscribers than there are non-subscribers. So in the subscriber column, there's a lot more ones then there are zeros. First, we're going to remove the target from the data frame. So we're going to use the pop function to pop that out, and the target is is subscriber of course. After we do that, we switch churn to be the minority class. We do this by populating the Y data set with ones or zeros depending if each user was a subscriber or a non-subscriber. We then drop the columns that we don't plan on using, which are customer ID and customer name. Then finally, we use the train, test, split function to generate training and testing data sets. The stratify parameter makes the test, train, split, so the class proportion from the train and test splits is the same as the original sample. Going forward, we will be making heavy use of pipelines, where we want to ensure that the examples are as close to reality as possible. In this example, we make use of a column transformer in order to combine two pipelines. One of those pipelines is for categorical variables and the other is for continuous variables. The reason we split the data set up into categorical and continuous variables is because we may have to deal in the future with missing data values. Now the example we have here isn't missing any data, but in production, you want to make sure that your pipeline doesn't break due to missing values. So we split the pipeline into categorical and continuous variables, and then we use a SimpleImputer to generate and fill in any missing values. After that, we give each one of those pipelines, the categorical one and the numeric one. We give those two column transformer and it combines those pipelines into one single pipeline. Finally, we use simple linear regression because it's a reasonable baseline model to begin with. In this example here, we go ahead and fit the model using the training data with the predictor variables, which is indicated here by the X train data set and the lowercase Y trained data set contains the labels that we're going to use to train the predictor. We use the fit method to do that. We generate a prediction and then generate the report of the classification looking at subscriber and churn, and as you can see here, it's doing a pretty good job. If you look at the precision, the recall, and the f1-scores across indicating the performance of this particular model. Imbalanced-learn is a package in Python that provides access to re-sampling techniques used to address between class imbalances. It's compatible with scikit-learn and has a simple interface shown in the code example on the next page. It provides utilities for working with imbalanced data in neural networks as well. Finally, it provides access to SMOTE and other related algorithms for generating missing data and dealing with class imbalances. In this example here, we first generate a simulated classification problem using the make classification function. This particular problem has three classes, and as you can see by the weights line here, you can see that it is very heavily imbalanced. One of the classes, the third one has a lot more entries in it than the other two. Then after that, we go ahead and generate a simple report showing the numbers of the zeros, ones, and twos that are in the data set. So in the original data set, we have 64 zeros, 262 ones, and 4,674 twos, that's in the original target. Then we use the random over sampler function, which essentially goes through randomly samples from the original data set to fill in any missing data that will help deal with the imbalances between the classes. After random over sampler does its job, we then print the results and you can see that it did its job. It generated data for the zero class and the one class to bring them up to equal amounts with the biggest class, which was the class designated by entries number 2. One of the things we really want to do is to compare methods using the full pipeline. You want to compare different methods for doing sampling techniques for dealing with class imbalances in missing data. One of the really handy things about imbalanced-learn is that it has its own functions to create a pipeline, and it can use scikit-learn transformers, estimators, and even scikit-learn pipelines. In this example here, we're going to go ahead and use the imbalanced-learn pipeline functions to create three separate pipelines. One pipeline, the first one is just the original pipeline that contains the data from our example, our AAVAIL example. The second pipeline that we're going to create has that very same data except it is going to use the random over sampler technique to fill in any missing values and deal with any class imbalances. Then the third pipeline that we've created here, does the same thing except instead of using random over sampler, it's going to use the SMOTE algorithm to deal with the imbalance in numbers between classes. After we've created these three pipelines, we can go ahead and fit each one using the training data for each one, and then we can go ahead and compare the results. Here on the results, you see three separate entries. One for no sampling which is here, one for the random oversampling which is here, and one for the SMOTE algorithm which is here. If you notice, each one of these different techniques doesn't appear to create much of a difference. So if you look at for example the f1-score, you'll notice that the f1-score for a subscriber and churn is roughly equivalent across the no sampling pipeline, which did not do any sampling to deal with any class imbalances, the random oversampling pipeline, and the SMOTE pipeline. Similarly, all of the numbers that you see for precision, recall, accuracy, etc. are very similar. In real life, what you're going to want to do is compare these numbers, and obviously go with the ones that give you the best performance for your models.