FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. The expense of large-scale pre-training puts such research out of reach for many, hence limiting scientific advancements. We thus propose a novel pretraining strategy for DINOv2 that simultaneously accelerates convergence–and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum–low-frequency being seen first–and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time is reduced by 1.6×–from 16.64 to 10.32 NVIDIA L40S days–and FLOPs by 2.25×, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with the DINOv2 baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as a means to improve self-supervised learning models robustness.