Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Perplexity publicly searches for agents and training methods, based on the Qwen3.5 model, which surpasses GPT-5.4 in accuracy and cost.
According to Beating Monitoring, the Perplexity research team published a technical article revealing the post-training process of its web search agent. This process is based on the open-source models Qwen3.5-122B-A10B and Qwen3.5-397B-A17B and uses a two-stage approach: first, supervised fine-tuning (SFT) is used to establish deployment-required behaviors such as instruction following and language consistency, and then online policy reinforcement learning (RL) is used to optimize search accuracy and tool-usage efficiency.
In the RL stage, the GRPO algorithm is used. The training data consists of two parts: one is a self-developed synthetic multi-hop verifiable question-answer dataset. Starting from internal seed queries, it constructs questions that require 2 to 4 hops of reasoning through entity chains, and multiple independent solvers verify the uniqueness of the answers. The other is a general dialogue dataset based on scoring standards (rubric), which converts deployment requirements such as instruction following and format constraints into objectively checkable atomic conditions, and is used in the RL stage to prevent behavior degradation established during SFT.
The core of the reward design is gated aggregation: only when the baseline is correct (all question-answer pairs or all scoring rubric criteria are satisfied) does the preference score participate in the calculation, preventing high-preference signals from masking factual errors. The efficiency penalty uses intra-group anchoring: using correct answers in the same group as the baseline, it applies smooth penalties to tool call counts and generation lengths that exceed the baseline.
Evaluation shows that the post-trained Qwen3.5-397B-SFT-RL performs best across multiple search benchmarks. On FRAMES, a single tool call reaches 57.3%, which is 5.7 percentage points higher than GPT-5.4 and 4.7 percentage points higher than Sonnet 4.6. Under a moderate budget (4 tool calls), it reaches 73.9%, with a per-query cost of 2.0 cents. Under the same conditions, GPT-5.4 is 67.8% / 8.5 cents, and Sonnet 4.6 is 62.4% / 15.3 cents. The cost data is calculated based on each vendor’s publicly available API pricing and does not include caching optimizations.