Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Benchmarks for out-of-distribution (OOD) generalization often reveal a strong positive correlation between in-distribution (ID) and OOD accuracy across models, a phenomenon known as “accuracy-on-the-line.” This pattern is commonly interpreted as evidence that spurious correlations—relationships that improve ID but harm OOD performance—are rare in practice. We show that this positive correlation can be an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy-on-the-line breaks down. Across widely used distribution-shift benchmarks, OODSelect uncovers subsets—sometimes comprising more than half of the standard OOD set—where higher ID accuracy predicts lower OOD accuracy. These results suggest that aggregate metrics can mask critical failure modes in OOD robustness. We release code and the identified subsets to support further research.