CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

Park, Gunho; Bae, Jeongin; Kim, Byeongwook; Park, Baeseong; Ryu, Jiwon; Kim, Hoseung; Kwon, Se Jung; Lee, Dongsoo

CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong Park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Abstract

Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency–memory–accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.

Abstract

Name Change Policy