When Do Deep Ensembles Improve Robustness to Spurious Correlations?

None, None

When Do Deep Ensembles Improve Robustness to Spurious Correlations?

Bachelor Thesis (2026)

Author(s)

J. Hidayat (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.W. Böhmer – Mentor (TU Delft - Sequential Decision Making)

D.M.J. Tax – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Convolutional neural network Machine learning Deep learning Artificial intelligence Computer vision Spurious correlation Uncertainty quantification Shortcut learning Deep ensembles Distribution shift

To reference this document use

https://resolver.tudelft.nl/uuid:ad39359f-73e6-4b40-a387-dd6a9933207d

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

27-01-2026

Awarding Institution

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Downloads counter

48

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Models trained with empirical risk minimization can rely on spurious features that are highly predictive during training but fail under distribution shift. We study deep ensembles as a simple baseline that does not require spurious-attribute labels. We construct a controlled dataset by placing MNIST digits on CIFAR-10 backgrounds. During training, each digit class is assigned its own disjoint set of N backgrounds, so background identity alone predicts the label. We evaluate in-distribution (ID) and two out-of-distribution (OOD) shifts: (i) seen-shuffle, which keeps the same backgrounds but permutes their association with labels, and (ii) unseen-background, which draws backgrounds from a held-out pool and therefore combines shortcut breaking with a novel-background shift. Across N in {1, 2, 4, 8, 16, 32, 64}, deep ensembles improve OOD accuracy most in an intermediate regime (max gain 17.1 pp at N=8 for M=8; 95% CI 8.4 pp to 26.8 pp). For LeNet trained with mean-squared error (MSE), increasing ensemble size from M=1 to M=8 improves seen-shuffle OOD accuracy from 71.3% to 88.4% at N=8. In the same setting, background-follow rate drops from 26.9% (M=1) to 16.3% (M=8) at N=8. Under the unseen-background shift, we observe larger gains at smaller N (for example, 45.7% to 67.6% at N=4), but this setting changes both the shortcut and the background distribution. We also find that model capacity and loss choice affect where shortcut reliance breaks down, and that a parameter-matched larger single model can outperform a small ensemble.

Files

Hidayat_DeepEnsembles_Spurious... (pdf)

(pdf | 0.465 Mb)

License info not available