Assessing model’s performance on imbalanced data

Karen Pedraza
5 min readJun 29, 2021

If you are new into the world of machine learning and running your first models, you have perhaps used this library and module:

if you are getting a high score, you might be assuming you have a working model, however, is better to double check a couple of things:

  1. if the score you get when predicting your test data set is greater than the one you got for your training data set, your model could be either over-fitted or under-fitted.
  2. If you are getting a high accuracy score on both, after predicting on both train and test, the question here is did you check your dataset was balanced?

Here’s a picture that depicts the previous idea, consider a dataset with a target feature whose main purpose is to identify whether a product is defective or not.

Histogram of defective products classified into two categories

There are several things to consider on this image, this is a histogram displaying how the target feature is distributed. From it, there is a majority class (good) and a minority class (bad) clearly this is an imbalanced dataset since one of the classes the model will try to predict is under represented.

Now let’s look at an example after implementing a model using the previous dataset:

From the previous image, it can be seen that when the model faces the dataset, he will recall how to reproduce the good ones since he was fed with a lot of information of the majority class, whereas for the bad ones , it is as if he was not even trained for it (this will be cover in detail downwards). In fact the model predicted the bad ones as good ones. But why is that , the overall accuracy of the model is so high then?

According to this article:

the reason we get 90% accuracy on an imbalanced data (with 90% of the instances in Class-1) is because our models look at the data and cleverly decide that the best thing to do is to always predict “Class-1” and achieve high accuracy.

So the lesson learnt here is, accuracy by itself does not work to assess how good the model is predicting, because what we want to achieve is a model capable of predicting accurately both of the classes not just one. Therefore, some other important metrics are needed, meet the following concepts:

  • Confusion Matrix
  • Recall
  • Precision

Confusion Matrix: This is a table which allow us to have a big picture of the final results of the model in terms of our target class, in oder words, it allows to contrast the predicted values against the actual values.

Source: https://glassboxmedicine.com/2019/02/17/measuring-performance-the-confusion-matrix/

Why this table is so important? because it allows to calculate different metrics and get a different perspective of what the model is predicting.

Accuracy of the model

Accuracy can be described as the ratio of the correct predicted values against the total of all observations indiscriminately. However, as previously mentioned this metric by itself does not allow to deduce whether the model is working as expected or not.

On the other hand, there is Recall, this metric focuses on the True Positives, as a quick reminder, True Positives refer to one of our classes, and it is basically the class you are in the end interested on, so keeping the previous example if your goal is to predict whether a product is on perfect conditions, then this will be the positive, otherwise it will be negative. Why the True then? it means the model was able to predict it correctly.

Now coming to Recall here’s the formula:

Recall Formula

This represent the ratio of True Positives accurate predicted, in other words, the percentage of the primary target class which was predicted correctly.

Precision Formula

Precision also gets the ratio of the true positives that were predicted but agains of those the model believe were true positives.

Specificity Formula

Specificity focuses on the ratio of the other counter-part of the target classes , the Negatives, it is the percentage that the model was able to predict correctly against the total number of observations for that specific class.

Now the question is how these previous measures can help to get a better reading of what is going on with the unbalanced dataset. The following formula comes up:

This formula only applies for binary classification

Balanced Accuracy offers the overall picture for both classes the Positive and the Negative, is the average of how good the model did predicting each of the different classes. This is the perfect measure to use when dealing with imbalanced datasets, and can help to identify the accuracy of the model in terms of the target classes.

and this is it, hopefully with the previous tips now it will be easier to measure the performance of the models on imbalanced data.

Thanks for reading!

--

--

Karen Pedraza

Full Stack Developer, Msc Data Science and PM, Programming in Berlin, never stop learning and contributing!