Foundations of Top-$k$ Decoding for Language Models

Noarov, Georgy; Mallick, Soham; Wang, Tao; Joshi, Sunay; Sun, Yan; Xie, Yangxinyu; Yu, Mengxin; Dobriban, Edgar

Foundations of Top-$k$ Decoding for Language Models

Georgy Noarov, Soham Mallick, Tao Wang, Sunay Joshi, Yan Sun, Yangxinyu Xie, Mengxin Yu, Edgar Dobriban

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Abstract

Top-$k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-$k$ and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-$k$ decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-$k$ decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We introduce *Bregman decoders* obtained by minimizing a separable Bregman divergence (for both the *primal* and *dual* cases) with a sparsity-inducing $\ell_0$-regularization; in particular, these decoders are *adaptive* in the sense that the sparsity parameter $k$ is chosen depending on the underlying token distribution. Despite the combinatorial nature of the sparse Bregman objective, we show how to optimize it efficiently for a large class of divergences. We prove that (i) the optimal decoding strategies are greedy, and further that (ii) the objective is discretely convex in $k$, such that the optimal $k$ can be identified in logarithmic time. We note that standard top-$k$ decoding arises as a special case for the KL divergence, and construct new decoding strategies with substantially different behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).

Abstract

Name Change Policy