This week, we're talking about the many ways data can go wrong sometimes through no fault of its own. One common theme in real-world classification problems is when the class you care about is under-represented. We've referred to this before, but now we're going to expand your tool set for dealing with this. In this video, we'll discuss imbalanced data, why it's a problem, and what you can do about it. But the end you'll be able to recognize when it might cause problems, know how to adjust your evaluation metrics accordingly, and know what to try when learning from imbalanced classes. A dataset with skewed class proportions where the vast majority of your examples come from one class is called an imbalanced dataset. In some classification problems such as medical diagnosis or predictive maintenance, there's a very high chance that you'll run into this. It's probably a good thing when diseases and machine failures are the exception rather than the rule. Not surprisingly, having imbalanced classes in your learning data impacts the QuAM that results. Imagine building a classification model that tries to predict whether or not someone has a rare disease. If only a small percentage of the population, let's say one percent have this disease, we can easily build a highly accurate classifier. It always guesses no disease. Then we have a model which is right on 99 percent of the data points. So 99 percent accurate and zero percent useful. One way you can handle the imbalanced classes problem is to change your evaluation metric. For example, in the medical diagnosis scenario plain accuracy is not the best metric to use. You don't want overall accuracy. You want to correctly catch those examples of the disease. We've discussed false positive and false negative errors before. In most applications like medical diagnosis, these two error types are not equally bad. In our example, if we measure the recall of the useless predictor, that is the false negative rate, it's clear that there's something wrong with the model that always predicts negative on having a disease. Precision, recall, the confusion matrix, and the F1 measures are all useful metrics in cases where classes aren't evenly distributed. You might want to review our lecture on classification assessment from Course 2. You can also come up with a cost matrix or loss function that's a weighted combination of false positive and false negative errors with different weightings assigned to each type. Then you select the best classifier as the one that minimizes that cost matrix or loss function. You can also try to optimize something called Cohen's kappa. This measure adjusts for the imbalance of the classes by normalizing accuracy with the imbalanced ratio. ROC curves is another useful evaluation metric you can try for classification on imbalanced classes, as we mentioned in the previous course. Like precision and recall, accuracy is divided into sensitivity and specificity, and models can be chosen based on the balanced thresholds of these values. To recap, an ROC plot is a two-dimensional plot with the misclassification rate of one class, false positives on the x-axis, and the accuracy of the other class, true positives on the y-axis. An ROC plot not only preserves all performance-related information about a classifier, it also allows key relationships between the performance of several classifiers to be identified instantly by visual inspection. For example, if classifiers C1 has better accuracy on both classes C1s ROC plot will be above and to the left of C2s. If C1 is superior to C2 in some circumstances but not others their ROC plots will cross. Interpreted correctly ROC plots show the misclassification cost of a classifier over all possible class distributions and all possible assignments of misclassification costs. Cost curves are an alternative to ROC curves for visualizing the performance of binary classifiers. Cost curves are superior to ROC curves for visualizing classifier performance for most purposes. While they share many of ROC curves desirable properties, they can also show confidence intervals on a classifier's performance and visualize the statistical significance of the difference in the performance of two classifiers. If you're interested in learning more about ROC curves and cost curves, you should check out the supplemental reading for this week. Let's not overlook the obvious. You should also consider simply gathering more data. This is an important fix that's sadly overlooked many times. It's possible that the imbalance can be addressed with a larger dataset. This is especially true if the imbalance isn't inherent in the problem like classifying rare events, but is an artifact of your data collection. Even if not, more examples for minority classes may be especially important for resampling techniques, which we'll discuss shortly. Sometimes you can combine domain knowledge with data to improve classifier performance. For example, check out the advanced supplemental reading in fault diagnosis for chemical processes. If you combine process knowledge with historical data, you may get better performance, especially for rare faults. In addition, when you're working with imbalanced data you might want to try different learning algorithms, as different algorithms may be more or less suited to handling class imbalance. For example, decision trees will often perform well on imbalanced datasets, while others assume an even distribution. You may recall that decision trees learn a hierarchy of if else questions, and this can address both classes. If another technique is to synthesize data samples. You can create synthetic samples by randomly sampling features from data points in the minority class. We can also try resampling techniques to handle imbalanced data sets. One possibility is to add more copies of the minority class, which is called oversampling. You can replicate samples from the minority class randomly. Depending on the nature of your data, it might make sense to add a small amount of noise to the copies. If you're oversampling, it's important you split your data points into test and training sets before doing the oversampling. Otherwise, oversampling can lead to having exactly the same observations in both the test and training datasets. This allows our QuAM to simply memorize specific data points which leads to overfitting and poor generalization performance. Of course, you would never contaminate your test set that way. Undersampling means randomly removing some observations of the majority class. This is another technique for handling imbalanced data. The upside is our classes are more balanced. Downside is we're removing information that could have been valuable. This could cause underfitting and consequently poor generalization. After apply such resampling techniques, you'll have about the same number of data points for each class. It's important to make changes again on data samples of only the training set to ensure your QuAM performs well on unseen data. Last but not least, realize that using different sampling techniques can introduce bias into the data. After all, you're deliberately changing the distribution of the data. Sometimes this is a good thing. Sometimes it isn't. That's why a clean test set and solid domain knowledge are so important when deciding what technique to try. So now you know a few techniques you can turn to when you're facing a classification problem with severely imbalanced classes. You're ready to predict rare events.