OSKAR: Omnimodal Self-supervised Knowledge Abstraction and Representation

Mohamed Abdelfattah, Kaouther Messaoud, Alexandre Alahi

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

We present OSKAR, the first multimodal foundation model based on bootstrapped latent feature prediction. Unlike generative or contrastive methods, it avoids memorizing unnecessary details (e.g., pixels), and does not require negative pairs, large memory banks, or hand-crafted augmentations. We propose a novel pretraining strategy: given masked tokens from multiple modalities, predict a subset of missing tokens per modality, supervised by momentum-updated uni-modal target encoders. This design efficiently utilizes the model capacity in learning high-level representations while retaining modality-specific information. Further, we propose a scalable design which decouples the compute cost from the number of modalities using a fixed representative token budget—in both input and target tokens—and introduces a parameter-efficient cross-attention predictor that grounds each prediction in the full multimodal context. We instantiate OSKAR on video, skeleton, and text modalities. Extensive experimental results show that OSKAR's unified pretrained encoder outperforms models with specialized architectures of similar size in action recognition (rgb, skeleton, frozen, low-shot) and localization, video-text retrieval, and video question answering. Project website: https://multimodal-oskar.github.io