Perplexity publicly disclosed the post-training method for its Search Agent, and the model based on Qwen3.5 surpasses GPT-5.4 in accuracy and cost.

robot
Abstract generation in progress
ME News reports that on April 23 (UTC+8), according to Dongcha Beating monitoring, the Perplexity research team published a technical article disclosing the post-training process for its web search agent. The process is based on open-source models Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, using a two-stage approach: first, supervised fine-tuning (SFT) to establish necessary deployment behaviors such as instruction following and language consistency, then online policy reinforcement learning (RL) to optimize search accuracy and tool usage efficiency. The RL stage uses the GRPO algorithm, with training data composed of two parts: one is a self-developed synthetic multi-hop verifiable QA dataset, starting from internal seed queries, constructing questions requiring 2 to 4 hops of reasoning through entity chains, and verifying answer uniqueness by multiple independent solvers; the second is general dialogue data based on rubrics, converting deployment requirements such as instruction following and format constraints into objectively checkable atomic conditions, used in the RL stage to prevent degradation of behaviors established by SFT. The core of the reward design is gated aggregation: only when the baseline is correct (QA answered correctly or all rubric criteria are met) does the preference score participate in the calculation, preventing high preference signals from masking factual errors. Efficiency penalties use intra-group anchoring, taking correct answers within the same group as a baseline, imposing smooth penalties for excessive tool calls and generation length. Evaluation shows that the post-trained Qwen3.5-397B-SFT-RL performs best on multiple search benchmarks. On FRAMES, with a single tool call it achieves 57.3%, 5.7 percentage points higher than GPT-5.4 and 4.7 percentage points higher than Sonnet 4.6. With a medium budget (4 tool calls), it reaches 73.9% at a cost of 2.0 cents per query; under the same conditions, GPT-5.4 is 67.8% / 8.5 cents, and Sonnet 4.6 is 62.4% / 15.3 cents. Cost data is calculated based on each vendor's public API pricing, not including cache optimization. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned