When Do Deep Ensembles Improve Robustness to Spurious Correlations?
J. Hidayat (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J.W. Böhmer – Mentor (TU Delft - Sequential Decision Making)
D.M.J. Tax – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Models trained with empirical risk minimization can rely on spurious features that are highly predictive during training but fail under distribution shift. We study deep ensembles as a simple baseline that does not require spurious-attribute labels. We construct a controlled dataset by placing MNIST digits on CIFAR-10 backgrounds. During training, each digit class is assigned its own disjoint set of N backgrounds, so background identity alone predicts the label. We evaluate in-distribution (ID) and two out-of-distribution (OOD) shifts: (i) seen-shuffle, which keeps the same backgrounds but permutes their association with labels, and (ii) unseen-background, which draws backgrounds from a held-out pool and therefore combines shortcut breaking with a novel-background shift. Across N in {1, 2, 4, 8, 16, 32, 64}, deep ensembles improve OOD accuracy most in an intermediate regime (max gain 17.1 pp at N=8 for M=8; 95% CI 8.4 pp to 26.8 pp). For LeNet trained with mean-squared error (MSE), increasing ensemble size from M=1 to M=8 improves seen-shuffle OOD accuracy from 71.3% to 88.4% at N=8. In the same setting, background-follow rate drops from 26.9% (M=1) to 16.3% (M=8) at N=8. Under the unseen-background shift, we observe larger gains at smaller N (for example, 45.7% to 67.6% at N=4), but this setting changes both the shortcut and the background distribution. We also find that model capacity and loss choice affect where shortcut reliance breaks down, and that a parameter-matched larger single model can outperform a small ensemble.