Value-Guided Search for Efficient Chain-of-Thought Reasoning

Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kianté Brantley, Wen Sun

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of ``step,'' which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (\texttt{VGS}) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-$n$. Moreover, \texttt{VGS} significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced at \codeurl.