Evaluating the Performance of the Model Selection with Average ECE and Naive Calibration in Out-of-Domain Generalization Problems for Binary Classifiers

Bachelor Thesis (2022)
Author(s)

A. Liu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Jesse Krijthe – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

R.K.A. Karlsson – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

S.R. Bongers – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

T. Höllt – Graduation committee member (TU Delft - Computer Graphics and Visualisation)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2022 Anxian Liu
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Anxian Liu
Graduation Date
24-06-2022
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Out-of-domain (OOD) generalization refers to learning a model from one or more different but related domain(s) that can be used in an unknown test domain. It is challenging for existing machine learning models. Several methods have been proposed to solve this problem, and multi-domain calibration is one of these methods. Model selection with the average expected calibration error (ECE) across training domains and naive calibration are two approaches to implementing multi-domain calibration. However, it might happen that neither approach can learn a genuinely well-calibrated model in the multi-domain setting. Hence, this paper intends to evaluate how naive calibration and model selection with average ECE perform in the OOD generalization problem for binary classifiers. We generated many synthetic datasets and set up three experiments to answer this question. Finally, the conclusions based on empirical results are obtained: 1) Although naive calibration can improve the average accuracy across unseen domains (OOD accuracy) and the average area under the ROC Curve across unseen domains (OOD AUROC) for some binary classifiers, it does not work for all binary classifiers. However, at least it does not make the model worse for OOD generalization. 2) On the synthetic datasets we generated, if the number of training domains increases, most binary classifiers' OOD accuracy will also increase. 3) Average ECE is a reasonable metric for selecting a model in the OOD generalization problem and is better than validation accuracy. This is because a strong linear relationship exists between OOD accuracy and the average ECE across the training domains. This linear relationship is stronger than the linear relationship between OOD accuracy and validation accuracy.

Files

License info not available