ME News reports that on April 23 (UTC+8), according to Dongcha Beating monitoring, the Perplexity research team published a technical article disclosing the post-training process for its web search agent. The process is based on open-source models Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, using a two-stage approach: first, supervised fine-tuning (SFT) to establish necessary deployment behaviors such as instruction following and language consistency, then online policy reinforcement learning (RL) to optimize search accuracy and tool usage efficiency. The RL stage uses the GRPO algorithm, with training data composed of two parts: one is a self-developed synthetic multi-hop verifiable QA dataset, starting from internal seed queries, constructing questions requiring 2 to 4 hops of reasoning through entity chains, and verifying answer uniqueness by multiple independent solvers; the second is general dialogue data based on rubrics, converting deployment requirements such as instruction following and format constraints into objectively checkable atomic conditions, used in the RL stage to prevent degradation of behaviors established by SFT. The core of the reward design is gated aggregation: only when the baseline is correct (QA answered correctly or all rubric criteria are met) does the preference score participate in the calculation, preventing high preference signals from masking factual errors. Efficiency penalties use intra-group anchoring, taking correct answers within the same group as a baseline, imposing smooth penalties for excessive tool calls and generation length. Evaluation shows that the post-trained Qwen3.5-397B-SFT-RL performs best on multiple search benchmarks. On FRAMES, with a single tool call it achieves 57.3%, 5.7 percentage points higher than GPT-5.4 and 4.7 percentage points higher than Sonnet 4.6. With a medium budget (4 tool calls), it reaches 73.9% at a cost of 2.0 cents per query; under the same conditions, GPT-5.4 is 67.8% / 8.5 cents, and Sonnet 4.6 is 62.4% / 15.3 cents. Cost data is calculated based on each vendor's public API pricing, not including cache optimization. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Get2SharesOfSKHynixAtZeroCost
1.54M Popularity
#
MicronOvertakesMetaInMarketValue
167.92K Popularity
#
WorldCup🇫🇷vs🇳🇴
322.15K Popularity
#
USMayPCEInflationRisesTo4.1%HighestIn3Years
550.05K Popularity
#
StakeUSD1Earn9.48%APR
981.19K Popularity

Pinned

Sitemap

Perplexity publicly disclosed the post-training method for its Search Agent, and the model based on Qwen3.5 surpasses GPT-5.4 in accuracy and cost.

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇫🇷vs🇳🇴

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned