YOLOv12: Attention-Centric Real-Time Object Detectors

Yunjie Tian, Qixiang Ye, DAVID DOERMANN

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Enhancing the network architecture of the YOLO framework has been crucial for a long time. Still, it has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.5% mAP with an inference latency of 1.62 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLO11-N by 2.0%/1.1% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETRv2 / RT-DETRv3: YOLOv12-X beats RT-DETRv2-R101 / RT-DETRv3-R101 while running faster with fewer computations and parameters. See more comparisons in Figure 1. Source code is available at https://github.com/sunsmarterjie/yolov12.