Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, shao xinyu, Gao Huang, Xiu Li

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi–Head Self–Attention (MHSA) layer still performs a quadratic query–key interaction for \emph{every} token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce \emph{Visual–Contrast Attention} (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from $\mathcal{O}(N^{2}C)$ to $\mathcal{O}(N n C)$ with $n\!\ll\!N$. VCA first distils each head’s dense query field into a handful of spatially pooled \emph{visual–contrast tokens}, then splits them into a learnable \emph{positive} and \emph{negative} stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than $0.3$\,M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from $72.2\%$ to \textbf{$75.6\%$} (+$3.4$) and improves three strong hierarchical ViTs by up to $3.1$\%, while in class-conditional ImageNet generation it lowers FID-50K by $2.1$ to $5.2$ points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at \href{https://github.com/LeapLabTHU/LinearDiff}{https://github.com/LeapLabTHU/LinearDiff}.