Skip to main content

WYNTK: Classifier Calibration

What You Need To Know: Some recent discussion in the community has centered around calibrating your classifiers; here's what you need to know.
Created on April 16|Last edited on April 21

TL;DR

Calibration ensures the range of predictions lie on the normalized interval. It's not necessary for any rank-based recommendations. It's valuable when the output probabilities are useful as confidence estimates in addition to the ranking.

Why Cover Classifier Calibration?

Two posts in the same week brought questions about calibration into the discussion:

Let's break these two discussions down and talk a little about these concepts.

Classifier Calibration and Isotonic Regression

When building classifiers, binary or multiclass, the learner outputs a probability for each class. If your goal is simply pick the right class in most of the cases, e.g. minimizing cross-entropy, then relative magnitude of the probabilities per class is all that matters. So too for ranking models; all that really matters is the highest probability corresponds to the best option.
However, sometimes you want more out of your model than just the class recommendation. In applications where you care about the probability itself, it's worth asking if the estimates reflect the actual confidence of the model. For example; if you care about risk estimates from misclassification, then you're keen on the probabilities reflecting actual confidence estimates. This is called a calibrated model, to indicate a direct relationship between these outputs and the actual likelihood of the model's accuracy in that label.
There's been research showing that different model types have their own tendencies for output probability distributions. From the scikit-learn docs:
Output probabilities for various classifiers
So how does one take the above various probability distributions, and then further ensure that they're reflective of the model's confidence? Well, you simply train a meta-learner. :-)
An isotonic regression is learning a monotonic function–to preserve the relative probabilities which the original learner is optimizing for–to transform the output probabilities into the proper posteriors. A few important technical details:
  • this is performed via a set of cross-validation datasets
  • the meta-learner minimizes the square errors between the labels and the image of the output probabilities: in(yim(fi))2\sum_i^n (y_i - m(f_i))^2

Effects on ROC

In Sebastian Raschka's original tweet, he asked if using these calibrations improved ROC -AUC. A few respondants pointed out that these monotonic adjustments should not actually effect AUC. This is because ROC-AUC is invariant to the probability magnitudes, and only depend on the ordering.
Some further discussion added that the observed performance increase was actually due to simply adding a CV step to the model training, and that these calibration models were essentially acting like bagging.
Finally, it was pointed out that for these reasons people who are interested in risk-adjusted models from the confidence use the Kolmogorov-Smirnov Statistic instead of ROC-AUC.

Classifier Calibration For Recommender Systems

From the above, you can probably guess that for recommendations, the relative affinity scores are enough to pick the top-k for selection. However, Wenzhe Shi's post goes beyond this. They point out that for applications like Ad-recommendations, estimates of confidence are crucial to the application of these models. In those cases, should we fall back on the previously discussed Isotonic Regression?
The author makes the point that this will work–and we've plenty of data to do so–but it interacts poorly with another consideration for modern RecSys: continuous learning. Because many modern recommenders are active learning systems with streaming events, you'll need a meta-learner than can update based in real-time.
Some commenters replied with suggestions to do conditional calibration (with respect to a covariate); others encouraged frequent retraining for the calibration model.

Should you calibrate?

As always, it depends. If you're keen on using the output probabilities for other tasks or applications, you'll definitely want to consider calibration.
What do you think? What are some applications of probability estimates? Are there clever loss functions you can use to integrate calibration in the original training?
Let me know in the comments what you think, and if you like these WYN2K posts.
Thanks to Sebastian Raschka and Wenzhe Shi for the thought provoking posts.
Iterate on AI agents and models faster. Try Weights & Biases today.