NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data

Haolong Qian, Xianliang Yang, Ling Zhang, Lei Song, Jiang Bian, Chun Yuan

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Group Relative Policy Optimization (GRPO) fine-tuning has demonstrated significant enhancements in reasoning tasks. However, it often relies on high quality labeled dataset, which is typically difficult to obtain. To address this challenge, we introduce \textbf{N}oise-\textbf{A}ware \textbf{D}ual-\textbf{R}eward \textbf{O}ptimization (\textbf{NaDRO}) to effectively enhances the training of Large Language Models (LLMs) under noisy or ambiguous supervision. NaDRO operates through two key components: \textbf{(1) Preference-based Outcome Reward (POR)},which makes a principled bias-variance tradeoff, reducing training variance by learning from robust preference rankings instead of overfitting to single-best estimates; and \textbf{(2) Context Perception Reward (CPR) mechanism}, which ensures that LLMs conduct necessary qualitative assessment of the current problem state to foster deeper situational understanding prior to decision-making. To validate our approach in a realistic decision-making testbed, we model classic combinatorial optimization problems like the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) as Markov Decision Processes, generating training data via cost-limited exploration. Our results demonstrate that the fine-tuned Qwen 7B and Llama 3.1-8B models achieve statistically robust performance, significantly outperforming leading LLM baselines and standard fine-tuning methods on these complex benchmarks. Code is released at \url{https://github.com/microsoft/HeurAgenix/tree/NaDRO}.