ME News, April 23 (UTC+8), according to Dongcha Beating monitoring, the Perplexity research team published a technical article disclosing the post-training process of their web search agent.
The process is based on the open-source models Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, adopting a two-stage approach: first, supervised fine-tuning (SFT) to establish instruction following, language consistency, and other behaviors necessary for deployment; then, online policy reinforcement learning (RL) to optimize search accuracy and tool usage efficiency.
The RL phase uses the GRPO algorithm, and the training data consists of two parts: first, a self-developed synthetic multi-hop verifiable QA dataset, starting from internal seed queries, constructing questions requiring 2 to 4 hops of reasoning through entity chains, and verifying answer uniqueness with multiple independent solvers; second, general dialogue data based on a rubric, converting deployment requirements such as instruction following and format constraints into objectively checkable atomic conditions, used in the RL phase to prevent degradation of behaviors established by SFT.
The core of the reward design is gated aggregation: the preference score only participates in calculation when the baseline is correct (QA answered correctly or all rubric criteria are satisfied), preventing high preference signals from masking factual errors. The efficiency penalty adopts an intra-group anchoring method, using correct answers within the same group as a baseline, applying a smooth penalty for excessive tool calls and generation length.
Evaluation shows that the post-trained Qwen3.5-397B-SFT-RL performs best on multiple search benchmarks. On FRAMES, a single tool call achieves 57.3%, 5.7 percentage points higher than GPT-5.4 and 4.7 percentage points higher than Sonnet 4.6. Under a moderate budget (4 tool calls), it reaches 73.9%, with a cost of 2.0 cents per query; under the same conditions, GPT-5.4 is at 67.8% / 8.5 cents, and Sonnet 4.6 at 62.4% / 15.3 cents. Cost data is calculated based on each vendor's public API pricing, excluding cache optimization. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Get2SharesOfSKHynixAtZeroCost
1.54M Popularity
#
MicronOvertakesMetaInMarketValue
167.92K Popularity
#
WorldCup🇫🇷vs🇳🇴
322.15K Popularity
#
USMayPCEInflationRisesTo4.1%HighestIn3Years
550.05K Popularity
#
StakeUSD1Earn9.48%APR
981.19K Popularity

Pinned

Sitemap

Perplexity publicly released the post-training method for the search Agent, and the model based on Qwen3.5 surpasses GPT-5.4 in accuracy and cost.

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇫🇷vs🇳🇴

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned