Member-only story

Solving Imbalanced Data

Oversampling and Undersampling | Data Series | Episode 16.1

Mazen Ahmed
4 min readDec 24, 2023

In this article we discuss the different methods we can use to over-sample and under-sample from imbalanced data.

What is imbalanced data?

Imbalanced data refers to datasets where the target variable has an imbalance of class frequencies. What do I mean by this? Take for example a dataset that has two classes. This dataset would be considered imbalanced if the target class frequencies might look something like this:

Image by Author

Imbalanced data is common in examples like manufacturing defects, rare disease diagnosis or fraud detection.

Impacts of imbalanced data on Machine learning Models

  • Models can ignore the minority class when being trained
  • Bias towards the majority class
  • Can cause an accuracy paradox (Just predict the majority class and get a high accuracy performance)

Resampling

There are two techniques we can use to deal with imbalanced data. These are…

--

--

Mazen Ahmed
Mazen Ahmed

No responses yet