According to monitoring by Dongcha Beating, the Perplexity research team has published a technical article detailing the post-training process for its web search agent. This process is based on the open-source models Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, employing a two-stage approach: first, supervised fine-tuning (SFT) is used to establish necessary behaviors for deployment, such as instruction adherence and language consistency; then, online policy reinforcement learning (RL) optimizes search accuracy and tool usage efficiency. The RL phase utilizes the GRPO algorithm, with training data consisting of two parts: first, a self-developed multi-hop verifiable question-answer dataset, which constructs questions requiring 2 to 4 hops of reasoning from internal seed queries and verifies answer uniqueness with multiple independent solvers; second, general dialogue data based on scoring criteria (rubric), which transforms deployment requirements like instruction adherence and format constraints into objectively checkable atomic conditions to prevent degradation of behaviors established during SFT in the RL phase. The core of the reward design is gated aggregation: preference scores are only considered in calculations when the baseline is correct (i.e., the question-answer is correct or all scoring criteria are met), preventing high preference signals from masking factual errors. Efficiency penalties are applied using an intra-group anchoring method, where the correct answers in the same group serve as a baseline to impose smooth penalties on excessive tool invocation counts and generation lengths. Evaluation shows that the post-trained Qwen3.5-397B-SFT-RL performs optimally across multiple search benchmarks. On FRAMES, a single tool invocation achieves 57.3%, surpassing GPT-5.4 by 5.7 percentage points and Sonnet 4.6 by 4.7 percentage points. Under a medium budget (4 tool invocations), it reaches 73.9%, with a cost of 2.0 cents per query; under the same conditions, GPT-5.4 achieves 67.8% at 8.5 cents, and Sonnet 4.6 reaches 62.4% at 15.3 cents. Cost data is calculated based on publicly available API pricing from each vendor, excluding cache optimization.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
344.29K Popularity
#
CryptoMarketsDipSlightly
249.34K Popularity
#
DailyPolymarketHotspot
686.5K Popularity
#
StrategyAccumulates2xMiningRate
139.46M Popularity
#
TapAndPayWithGateCard
19.31K Popularity

Sitemap

Perplexity Reveals Post-Training Method for Search Agent, Qwen3.5 Model Surpasses GPT-5.4 in Accuracy and Cost

Trending Topics

WCTCTradingKingPK

CryptoMarketsDipSlightly

DailyPolymarketHotspot

StrategyAccumulates2xMiningRate

TapAndPayWithGateCard

Pin