Member-only story
Solving Imbalanced Data
Oversampling and Undersampling | Data Series | Episode 16.1
In this article we discuss the different methods we can use to over-sample and under-sample from imbalanced data.
What is imbalanced data?
Imbalanced data refers to datasets where the target variable has an imbalance of class frequencies. What do I mean by this? Take for example a dataset that has two classes. This dataset would be considered imbalanced if the target class frequencies might look something like this:
Imbalanced data is common in examples like manufacturing defects, rare disease diagnosis or fraud detection.
Impacts of imbalanced data on Machine learning Models
- Models can ignore the minority class when being trained
- Bias towards the majority class
- Can cause an accuracy paradox (Just predict the majority class and get a high accuracy performance)
Resampling
There are two techniques we can use to deal with imbalanced data. These are…