Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

Mikey Shechter, Yair Carmon

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1\% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2\% absolute accuracy increase over all previous results and a 5.5\% increase over results that---like us---use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.