Introduction
Machine learning models’ performance heavily depends on the quality of data you fade in for the training. Therefore your data must be cleaned and balanced. But there would be a situation when we’ll face an imbalanced data problem. In this article, we’ll look at different practices to handle the data imbalance problem.
Balanced Data VS Imbalanced Data :
If you’re reading this article, that means you’re already somewhat familiar with these terms. But still, let me give a basic overview of both.
Balanced Data
In the above image, you can see in gender, both class ‘males’ and ‘females’ have almost equal distribution. Data with such distribution is termed as Perfectly Balanced Data, but such a situation is very rare. In most cases, we’ll see class distribution in a 60%-40% ratio. Well, for now, focus on the current topic.
Imbalanced Data
In the above image, you can see in transactions; the ‘normal’ class has a very high distribution, i.e., 99%, and the ‘fraudulent’ class has low distribution, i.e., 1%. Here the ‘normal’ class is the dominating class. Here data is biased towards the ‘normal’ class. And this is called Data Imbalanced.
Imbalanced data: A BIG Problem
When data is skewed or biased towards one class, then that’s a big problem because most machine learning models are designed around the assumption of having equal distribution to both classes. Because of the data imbalance problem models tend to have notably poor performance on minority classes compared to majority classes.
For example; you trained a model to predict whether a given transaction is ‘fraud’ or ‘normal’. Training data was imbalanced and biased towards the ‘normal’ class. So the model will. Since ‘fraud’ is more important than the ‘normal’ class, so misclassification is costly from a business perspective.
Handling Imbalanced Data Problem
We’ll be looking at a different strategy to tackle the data imbalance problem. Let’s jump into that.
Choosing the right Performance Metric
Whenever you see an imbalanced class distribution in the dataset, don’t use Accuracy as your KPI use other instead like AUCROC or it mostly depends on business objective.
As you can see, we have to find out how many points have been classified correctly out of all samples. And you know that if the model is highly biased towards the majority class, then this score is going to be always very high. Even a dumb model could achieve more than 95% accuracy.
Upsamilng / Oversampling Minority Class
Upsampling is the process of creating duplicate copies of datapoint which belongs to the minority class. By creating duplicates, we try to balance the imbalanced data. Above image is an excellent example of what oversampling is all about.
The disadvantage is we’re just creating duplicates and not adding any extra value to the data. And the advantage is we’re not losing any information here unlike undersampling where information loss is the biggest drawback.
Undersampling /Down Sampling Majority Class
Downsampling is just the complete opposite of the upsampling. With downsampling, we remove or throw the data points belonging to the majority class to make the distribution of data points across both classes equal. Above image shows the graphic illustration of what Undersampling is all about.
The biggest drawback is we’re just throwing our data points to make data balanced, which is crazy.
Assigning higher weights to Minority Class
Instead of ging with Down or Upsampling, we can assign weights to the Data points based on the class they belong to. Data points from Minor Class will get more weight, and data points from dominating classes will get lesser weight. There are machine learning algorithms which take class weights into considering to handle the data imbalance problem. You can refer this article to know more about this approach.
Generate Artificial Samples
We generate artificial data points such that they’re similar to minor class datapoints to balance the data.
Conclusion
I tried to cover the often-used approach to handle the data imbalance issue. There are other advance techniques which you can refer too, but in most case, these basic approaches are enough to handle the problem.