Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

Liu, Runheng; Huang, Heyan; Xiao, Xingchen; Wu, Zhijing

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

Runheng Liu, Heyan Huang, Xingchen Xiao, Zhijing Wu

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their ability to generate human-like text has raised concerns about potential misuse. This underscores the need for reliable and effective methods to detect LLM-generated text. In this paper, we propose IRM, a novel zero-shot approach that leverages Implicit Reward Models for LLM-generated text detection. Such implicit reward models can be derived from publicly available instruction-tuned and base models. Previous reward-based method relies on preference construction and task-specific fine-tuning. In comparison, IRM requires neither preference collection nor additional training. We evaluate IRM on the DetectRL benchmark and demonstrate that IRM can achieve superior detection performance, outperforms existing zero-shot and supervised methods in LLM-generated text detection.

Abstract

Name Change Policy