## MACHINE LEARNING

## Measure Imbalanced Models the right way!

Rajneesh Tiwari

·

Follow

Published in

·

8 min read

·

Feb 9, 2023

--

In the last blog (link), we put together various strategies for Imbalanced Learning. We also looked at their advantages and disadvantages in detail, along with code for more obscure techniques.

With the basics about Imbalanced Learning in place, we will now delve deeper into a specific aspect of ML training, i.e. choosing the right Metrics for an Imbalanced Data use case.

In this blog, we will focus mostly on Imbalanced Classification use cases, and the metrics will also be in the same context.

When you want to train Machine Learning models, there are specific objectives that you want to achieve with these models. These objectives are business decisions, that depend upon a lot of business factors such as risk appetite, the growth stage of the business and growth strategy, and so on.

For example, in Fraud Detection use cases, we often work with established/large banks to flag Fraudulent Transactions. In most cases, large banks have a greater risk/loss appetite, meaning, they care more about customer experience and want to **limit false positives**.

This essentially means they want us to flag Fraud cases **only** when the model is super confident about its predictions, even if this means the model incorrectly flagging a few fraudulent cases as genuine ones. So there is a tradeoff that is established based on business rules.

This business decision impacts how we train our Machine Learning models, and we often set this expectation at the start of our engagements by deciding on a Machine Learning model Metric.

The Confusion Matrix provides a good baseline view of how to construct a variety of Evaluation Metrics.

A confusion matrix is a performance measurement tool, often used for machine learning classification tasks where the output of the model could be 2 or more classes (i.e. binary classification and multiclass classification). The confusion matrix is especially useful when measuring recall, precision, specificity, accuracy, and the AUC of a classification model.

To conceptualize the confusion matrix better, it’s best to grasp the intuitions of its use for a binary classification problem. Without any annotations, the confusion matrix would look as follows:

For a binary classification problem, the confusion matrix is a 2 x 2 matrix containing 4 types of outcomes:

- TP — True Positive
- TN — True Negative
- FP — False Positive
- FN — False Negative

Here TP and TN denote the number of examples classified correctly by the classifier as positive and negative respectively, while FN and FP indicate the number of misclassified positive and negative examples respectively.

The first important metric for Imbalanced Data cases is Weighted Balanced Accuracy. This metric adjusts the Accuracy metric as per class weights, wherein classes with lower class weights receive higher weightage.

Note that WBA is a thresholded metric, meaning, we have to first apply thresholds to get actual binarized predictions.

Let's look at an example below:

Accuracy in the above example we have correctly classified 4 out of 8 samples, hence accuracy is 50%

However, the balanced accuracy is 58%, which takes into account the class imbalance as well. The balanced accuracy is based on the formula below:

In general, we can formulate Weighted Balanced Accuracy as a generalized form of Balanced Accuracy :

Typically, we want to select class weights such that each weight is within [0,1], and across classes these sum to 1.

In most imbalanced data use cases, the rare class is the more important class, hence, we generally choose weights to reflect this. One such formulation that is very commonly used is the Normalized Inverse Class Frequency.

The Weighted Balanced Accuracy reaches its optimal value at 1 and its worst value at 0.

The F-beta score is a very robust scoring mechanism for scoring both balanced and unbalanced data use cases. It is the generalized form of the commonly used F1 score with β value set to 1. F-Beta takes into account both Precision and Recall and does a weighted Harmonic mean between the two.

Fbeta is also a thresholded metric, meaning, we have to first apply thresholds to get actual binarized predictions.

Note that

Precisioncalculates the percentage of correct predictions for the positive class.

Recallis the percentage of correct predictions for the positive class out of all positive predictions that could be made.

The F1 Score is calculated as the simple Harmonic Mean of Precision and Recall. The generalized form i.e. the F-beta score weighs the contribution of Precision or Recall to the metric.

Considering the same numerical example as above:

We can calculate the following based on TP, FP, TN, and FN scores

- Precision = 0.75
- Recall = 0.6
- F1 = 0.66
- F2 = 0.625

A smaller value of beta gives more weight to Recall, while a large value of beta gives lower weight to Recall.

The F-beta score reaches its optimal value at 1 and its worst value at 0.

Owing to obvious drawbacks in deciding a thresholding function for traditional F Scores, this extension of the traditional F score accepts probabilities instead of binary classifications.

It was first proposed in this (link) paper as an extension of Precision, Recall, and F1 scores for a more thorough evaluation of Classification Models.

It has a few advantages vis-a-vis the thresholded F-Scores.

- Lower likelihood of being NaN, robust to NaN values
- Sensitive to model’s prediction confidence scores
- Have lower variance making the point estimates generalize better to test datasets
- Provide the same ranking of the performance of candidate models as their threshold-based counterpart’s population value in the majority of cases

Here is the math behind the function:

Let M be the matrix of class confidence scores, for a C class problem. Then the class with the highest confidence will be the model’s prediction for that observation.

By applying the model M on the full dataset S, we obtain a confidence score M(xi , Cj ) for each i ∈ {1, …, n} and j ∈ {1, …, m}. Suppose Sj denotes the set of samples with true class Cj . We can build a probabilistic confusion matrix pCM as follows:

Intuitively, each cell (jref , jhyp) of the confusion matrix corresponds to the total confidence score assigned by the model to hypothesis jhyp for samples for which the true class is jref . It is very similar to the usual definition of a confusion matrix, apart from the fact that we leverage all confidence scores as quantitative values as rather than just the highest-scoring class as a qualitative value.

From this probabilistic confusion matrix, cRecall and cPrecision are calculated in the same way that Recall and Precision are from the non-probabilistic (regular) confusion matrix.

Here is a Python implementation of the probabilistic F Score.

`def pfbeta(labels, predictions, beta):`

y_true_count = 0

ctp = 0

cfp = 0 for idx in range(len(labels)):

prediction = min(max(predictions[idx], 0), 1)

if (labels[idx]):

y_true_count += 1

ctp += prediction

else:

cfp += prediction

beta_squared = beta * beta

c_precision = ctp / (ctp + cfp)

c_recall = ctp / y_true_count

if (c_precision > 0 and c_recall > 0):

result = (1 + beta_squared) * (c_precision * c_recall) / (beta_squared * c_precision + c_recall)

return result

else:

return 0

#credits: https://www.kaggle.com/code/sohier/probabilistic-f-score/notebook

Consider the numerical example below which is based on python code calculations as shown above:

From the above calculations, we can calculate

- pPrecision = 0.685
- pRecall = 0.659
- pF1 = 0.672

Note that pFScore is based on probabilities and is NOT a thresholded metric.

Precision-Recall Curve is a very commonly used measure of prediction efficacy when the classes are very imbalanced. Precision measures relevancy and is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that were retrieved.

The precision-recall curve depicts the tradeoff between precision and recalls for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

Note that PR Curve is based on probabilities and is NOT a thresholded metric.

There is a similar metric AUC-ROC curve, which works similar to AUC-PR but is based on TPR-FPR rates.

However, the AUC-ROC curve is not preferred under severe imbalance, as it produced over optimistic results especially if the number of rare class is very small. Generally for Imblanced Scenarios, we prefer PR-AUC as the metric of choice.

At CueNex, we know there is no one right metric that can be applied to all the cases. We often discuss with our clients their error preferences. Specifically, we try to understand when the business is more sensitive toward False Positives or False Negatives.

Generally, if the positive/rare class is more important, then we use PR-AUC in combination with F1/F2/F0.5/pF scores.

- If both False Negatives and False Positives are equally important then we use F1-Score
- If False Positives are more important than False Negatives then we use F0.5-Score
- If False Negatives are more important than False Negatives then we use F2-Score
- If the output is measured via Probabilities, then we use pF1/pF2/pF0.5 as per the scenarios above

Hope you liked this blog. We will delve much deeper into case studies in the coming set of blogs, where we take an in-depth look at the data and how we solve fraud at CueNex.