top of page
Search

Teaching AI Agents to Think Before They Act: A New Approach to Reinforcement Learning

  • Writer: Evandro Barros
    Evandro Barros
  • Dec 12, 2025
  • 10 min read

The explosion of large language model capabilities has sparked intense interest in building AI agents—systems that can autonomously use tools, search for information, and execute multi-step plans to solve complex problems. Yet a fundamental challenge has held back progress: how do you train these agents when you can't verify whether their answers are correct? Research from Tencent's WeChat AI team introduces a solution that sidesteps this problem entirely by focusing on process rather than outcomes.





The Training Paradox in AI Agent Development


AI agents typically operate through a two-stage workflow. First, the planning stage, where the agent decides which tools to invoke and in what sequence to gather necessary information. Second, the summary stage, where the agent synthesizes collected information into a final response. The planning stage fundamentally determines agent performance—without complete and accurate information gathering, even perfect summarization cannot produce correct answers.


Current training approaches use end-to-end reinforcement learning that simultaneously optimizes both planning and summarization by providing a single reward signal based on final answer correctness. This approach seems intuitive: if the answer is right, reinforce the entire behavior that produced it. If wrong, discourage it. However, this paradigm faces critical challenges in real-world industrial deployment.

The first challenge stems from data scarcity. Effective reinforcement learning requires reliable reward signals. For questions with verifiable answers—mathematical problems, factual queries with ground truth—computing accurate rewards is straightforward. But in real-world industrial scenarios, such verifiable data represents less than one percent of total queries. The remaining ninety-nine percent consists of open-ended questions, requests for analysis, or tasks where correctness cannot be objectively determined.

For this vast majority of unverifiable data, current systems use reward models—other AI systems trained to judge answer quality. These models, however, prove susceptible to reward hacking, where agents learn to exploit the reward model's biases rather than genuinely improving performance. The result: most training data lacks effective reward signals, severely limiting reinforcement learning effectiveness.


The second challenge involves competing objectives and credit assignment difficulty. When training proceeds end-to-end, gradients for the planning module and summary module often point in opposite directions, creating optimization conflicts. More problematically, the reward structure creates a tight coupling: the quality of the final summary determines the reward for the entire trajectory, including all planning actions.

This mechanism makes credit assignment extraordinarily difficult. Consider an agent that executes five tool calls perfectly, gathering all necessary information, but then makes an error in the final summary. The negative reward applies to the entire trajectory, potentially causing the model to unlearn correct planning behaviors simply because the summarization failed. This phenomenon, where correct actions get penalized for errors elsewhere in the trajectory, fundamentally impedes planning capability optimization.


Decoupling Planning from Summarization


The research team's insight was to recognize that these challenges stem from multi-objective optimization attempting to improve both planning and summarization simultaneously. By focusing exclusively on optimizing the agent's core planning component, the problem simplifies to single-objective optimization, immediately mitigating issues around competing objectives and credit assignment.

This focused approach enables a novel training framework called Reinforcement Learning with Tool-use Rewards (RLTR). Rather than judging final answer correctness—which requires either verifiable ground truth or unreliable reward models—RLTR evaluates whether the planning process itself is complete. Did the agent invoke all necessary tools? Did it gather sufficient information to answer the question?


This shift proves transformative. Assessing tool-use completeness requires no knowledge of the correct final answer. An evaluator (which can be another language model) simply examines the action sequence and determines whether the agent performed all necessary information-gathering steps. A question asking for the temperature difference between two cities requires searching weather data for both cities and calculating the difference. An incomplete plan might search only one city's weather, or search both but fail to compute the difference.


The completeness evaluation provides a more direct and reliable training signal than final answer assessment. It focuses narrowly on the action sequence quality, decoupling tool invocation from summarization accuracy. This enables simpler, more effective agent assessment even when final answers cannot be verified—addressing the data scarcity challenge directly.


The RLTR Training Framework


The framework operates in three phases. First, cold-start initialization uses knowledge distillation from a state-of-the-art language model. The team samples multiple agent action trajectories from a teacher model, applies rejection sampling to select the highest quality examples, and uses these to perform supervised fine-tuning of the planner. This establishes basic tool-use capabilities and familiarizes the model with planning task formats.


Second, the tool-use completeness calculation phase implements the core innovation. A completeness-checking function evaluates whether the action sequence at any given state is complete or incomplete. This function is implemented using a verification language model with specific instructions to assess whether all necessary tool calls have been executed. The invocation completeness score is computed by averaging results over multiple samples to ensure reliability.


The evaluation prompt instructs the model to act as an agent expert committed to fully meeting query needs through precise tool combinations. When tools return unsatisfactory results, the agent must adjust parameters and retry. The output is binary: 0 indicating missing invocations, 1 indicating completeness. This simple, focused evaluation proves far more reliable than attempting to judge final answer quality.

Third, multi-turn reinforcement learning optimizes the planner using the completeness rewards. The overall reward function first verifies trajectory format correctness, immediately assigning negative reward for malformed outputs. For correctly formatted trajectories, it computes the tool-use completeness reward and incorporates rule-based penalties for redundant tool calls and incorrect tool usage. These components combine into a total reward signal.


During training, the system masks tool-use results to exclude them from loss computation. This prevents gradient signal dilution and ensures the agent focuses specifically on optimizing tool invocation behavior rather than learning to memorize tool outputs. The training follows a standard generate-evaluate-optimize cycle: the current policy generates tool-use trajectories, rewards evaluate these trajectories, and policy updates increase the likelihood of high-reward behaviors.


Experimental Results and Performance Gains


Testing on both proprietary industrial datasets and open-source Chinese QA datasets reveals substantial improvements. The research team evaluated models ranging from 1.7B to 8B parameters, comparing against leading systems including Qwen3-235B and DeepSeek-R1, as well as conventional end-to-end training approaches.

For planning performance, the planner trained with RLTR demonstrates 8-12% improvement in tool-use completeness compared to end-to-end trained agents. This advantage proves particularly pronounced on difficult questions, underscoring the importance of targeted planning capability optimization for complex problems. The improvement manifests in both supervised fine-tuning and reinforcement learning phases.


Training dynamics analysis reveals why RLTR succeeds. During supervised fine-tuning, the planner achieves faster convergence and lower loss than end-to-end agents, reflecting more effective gradient optimization of the planning process. During reinforcement learning, RLTR more effectively activates language models to utilize tools, whereas end-to-end reinforcement learning fails to drive tool usage effectively. This advantage arises from RLTR's precise reward attribution to planner actions, while end-to-end approaches struggle with credit assignment when constrained by final answer rewards.


Perhaps most remarkably, the enhanced planning performance translates directly to improved final response quality even without dedicated summarization training. The planner-optimized system outperforms end-to-end agents in final response performance by 5-6% on average. Better tool use and comprehensive information gathering enable more accurate summarization even from untrained summarizers.


This result validates a key hypothesis: the planning stage fundamentally determines overall agent capability. When planning optimization enables complete information gathering, summarizers can produce accurate responses without requiring specialized training on summary generation. The quality of inputs matters more than the sophistication of the synthesis process.


Reward Function Accuracy and Reliability


To validate the superiority of tool-use completeness rewards over traditional answer-based rewards, the team manually annotated 925 planner trajectories and their corresponding final answers for correctness. Both reward functions evaluated these samples, with results compared against human annotations.


The tool-use completeness reward achieved 74.59% accuracy and 84.64% F1 score, substantially outperforming the answer-based reward at 65.30% accuracy and 76.17% F1. This confirms that completeness evaluation provides more reliable sample assessment and leads to more stable, effective training.


The accuracy advantage stems from the fundamental difference in what each reward measures. Answer correctness depends on both complete information gathering and accurate synthesis. A planning process might execute perfectly, gathering all necessary information, yet the summarizer could still produce an incorrect response due to reasoning errors or hallucinations. Conversely, an incomplete planning process might accidentally produce a plausible-sounding answer that happens to be correct.


Tool-use completeness, by contrast, evaluates only whether necessary information-gathering steps occurred. This focused assessment proves more reliable because it depends on objective, verifiable criteria—were the required tools called with appropriate parameters?—rather than subjective judgment of natural language quality or correctness.


Case Studies: Process Versus Outcome Evaluation


Concrete examples illustrate the advantage of process-focused rewards. Consider a question asking for the temperature difference between Beijing and Shanghai. An incomplete planner searches Beijing weather data, retrieving 16-29°C sunny conditions. However, it fails to search Shanghai weather and instead fabricates typical data of 24-32°C cloudy, then calculates based on this fabrication.


When evaluated by answer correctness, this trajectory might receive positive reward—the response structure appears correct and the numbers seem plausible. The reward model, unable to verify actual Shanghai weather, judges the answer as reasonable. This incorrect positive reward reinforces the bad behavior of fabricating data rather than executing complete searches.


Tool-use completeness evaluation, however, immediately identifies the missing step: no search for Shanghai weather occurred. The negative reward correctly signals that the planning process is incomplete, requiring the agent to learn to search both cities' weather before calculating differences. This precise feedback drives correct behavior regardless of whether the fabricated answer happened to be plausible.


Another example involves identifying the author of a poetry collection. An unoptimized planner executes a single search, retrieving information about a novella collection by Wang Zongkun rather than the correct poetry collection by Ho Leng Seng. The planner, not recognizing the mismatch, generates a summary claiming Wang Zongkun as the author—incorrect.


An optimized planner, trained with completeness rewards, recognizes when initial search results seem inconsistent with the query. It executes additional searches with refined parameters, ultimately finding correct information about Ho Leng Seng and his poetry collection. The summarizer then generates an accurate response based on this complete information. The improved planning directly enables improved final answers without any summarizer training.


Broader Implications for Agent Development


The RLTR framework's success highlights several important principles for AI agent development. First, modular optimization focusing on core capabilities can outperform holistic end-to-end training when the latter faces fundamental challenges like competing objectives or unreliable reward signals. Decomposing complex systems into separately optimizable components reduces training complexity and improves final performance.


Second, process evaluation can substitute for outcome evaluation when outcomes cannot be reliably assessed. In many real-world domains—medical diagnosis, legal analysis, strategic planning—determining correct answers requires expertise and context that automated systems lack. Evaluating whether proper processes were followed provides an alternative training signal that remains reliable even when outcome verification proves impossible.


Third, the research demonstrates that enhanced capabilities in one component can improve overall system performance even when other components remain unoptimized. Better planning enables better summarization not through explicit summarization training but simply by providing higher quality inputs. This suggests that in multi-stage systems, optimizing bottleneck stages can yield disproportionate overall improvements.

For financial institutions developing AI agents, these insights prove particularly relevant.


Trading strategy validation, risk assessment, regulatory analysis, and market research all involve agent-like workflows where systems must gather information from multiple sources, synthesize findings, and generate actionable recommendations. Yet the correctness of these recommendations often cannot be immediately verified—market outcomes reveal themselves only over time, risk assessments prove valid or invalid only during crises, regulatory interpretations get confirmed or contradicted only through enforcement actions.


Implementation Considerations and Limitations


Institutions considering RLTR adoption should understand both capabilities and constraints. The approach works best for scenarios where completeness criteria can be clearly defined—where there exists an objective specification of what information must be gathered to adequately address a query. Domains with well-defined information requirements, structured tool ecosystems, and clear task decompositions prove most amenable.


The method requires access to model internals for training, making it suitable for organizations that fine-tune their own models but less applicable to those relying exclusively on API-based large language models. The training process, while modest compared to training foundation models from scratch, still requires GPU infrastructure and machine learning engineering expertise.


The research focused primarily on Chinese language scenarios and did not extensively test across other major languages. While the core principles should generalize, language-specific characteristics around tool use, question formulation, and completeness evaluation may require adaptation. Organizations should plan for language-specific fine-tuning and validation when deploying across multiple linguistic contexts.


The framework optimizes planning but does not address summarization optimization. The research used untrained language models as summarizers, demonstrating that improved planning alone yields better final outputs. However, further gains might be achievable through dedicated summarization training. Organizations seeking maximum performance should consider whether summarization represents a remaining bottleneck worth addressing after planning optimization.


Future Directions and Open Questions


Several promising research directions emerge from this work. Multi-instrument or multi-domain training could address the current limitation that separate models must be trained for different tools or domains. A unified model with tool-specific or domain-specific conditioning could leverage shared structure while maintaining flexibility for specialized behaviors, reducing training costs and potentially improving generalization.

Hierarchical completeness evaluation might enable more sophisticated assessment of complex multi-step plans. Rather than binary complete/incomplete judgments, a system could evaluate partial completeness, identify which specific steps are missing, or assess the quality of information gathered at each step. This richer feedback signal could enable faster learning and better handling of ambiguous scenarios.


Combining RLTR-style process rewards with outcome rewards when outcomes can be verified might achieve better performance than either approach alone. In scenarios where some fraction of data has verifiable answers, a hybrid training regime could use completeness rewards for all data while additionally leveraging answer correctness rewards where available, potentially accelerating learning while maintaining reliability.


Extending the framework to multi-agent scenarios where multiple specialized agents collaborate poses interesting challenges. How should completeness be evaluated when no single agent sees the entire workflow? How can credit assignment function when rewards depend on information passed between agents? These questions become increasingly relevant as organizations deploy more sophisticated multi-agent systems.


Process Over Outcomes in Agent Training


The RLTR framework represents a paradigm shift in how we approach AI agent training. By decoupling planning from summarization and focusing optimization on planning completeness rather than answer correctness, it circumvents fundamental challenges that have hindered agent development in real-world scenarios—particularly the scarcity of verifiable training data and the difficulty of credit assignment in multi-stage systems.


The 8-12% improvement in planning performance and 5-6% improvement in final response quality demonstrate that this process-focused approach delivers practical advantages. More fundamentally, it offers a novel perspective: in many cases, teaching systems good processes matters more than training them to produce good outcomes. When processes are sound, good outcomes follow naturally.


For the AI and financial technology communities, this research suggests that when facing training challenges stemming from unverifiable outcomes or complex credit assignment, the solution may lie not in more sophisticated outcome evaluation but in refocusing on process evaluation. In domains from trading strategy development to risk analysis to research automation, teaching agents how to think systematically and gather information comprehensively may prove more tractable—and ultimately more valuable—than teaching them to generate perfect final answers.


 
 
 

Comments


The Financial LAB 2025
bottom of page