ME News, April 23 (UTC+8), according to Beating monitoring, the Perplexity research team published a technical article disclosing the post-training process of their web search agent. The process is based on the open-source models Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, adopting a two-stage approach: first, supervised fine-tuning (SFT) establishes required deployment behaviors such as instruction following and language consistency; then, online policy reinforcement learning (RL) optimizes search accuracy and tool use efficiency. The RL phase uses the GRPO algorithm, with training data consisting of two parts: one is a self-developed synthetic multi-hop verifiable Q&A dataset, starting from internal seed queries and constructing questions requiring 2 to 4 hops of reasoning through entity chains, with multiple independent solvers verifying answer uniqueness; the other is general conversational data based on scoring rubrics, transforming deployment requirements such as instruction following and format constraints into objectively checkable atomic conditions, used in the RL phase to prevent regression of behaviors established by SFT. The core of reward design is gated aggregation: preference scores only participate in calculation when the baseline is correct (Q&A answered correctly or all rubric criteria satisfied), preventing high preference signals from masking factual errors. Efficiency penalties adopt an intra-group anchoring method, using correct answers within the same group as a baseline, applying smooth penalties for excessive tool calls and generation length. Evaluation shows that the post-trained Qwen3.5-397B-SFT-RL performs best on multiple search benchmarks. On FRAMES, a single tool call achieves 57.3%, 5.7 percentage points higher than GPT-5.4 and 4.7 percentage points higher than Sonnet 4.6. At a medium budget (4 tool calls), it reaches 73.9% with a cost of 2.0 cents per query; under the same conditions, GPT-5.4 is 67.8% / 8.5 cents, and Sonnet 4.6 is 62.4% / 15.3 cents. Cost data is calculated based on each vendor's public API pricing, excluding caching optimizations. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Get2SharesOfSKHynixAtZeroCost
1.55M Popularity
#
MicronOvertakesMetaInMarketValue
232.83K Popularity
#
WorldCup🇫🇷vs🇳🇴
341.43K Popularity
#
USMayPCEInflationRisesTo4.1%HighestIn3Years
560.56K Popularity
#
StakeUSD1Earn9.48%APR
983.35K Popularity

Pinned

Sitemap

Perplexity publicly disclosed the post-training method of the Search Agent, and the model based on Qwen3.5 surpasses GPT-5.4 in accuracy and cost.

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇫🇷vs🇳🇴

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned