Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
Stanford and Berkeley proposed LLM-as-a-Verifier, while also setting new records on the Terminal-Bench and SWE-Bench leaderboards.
ME News Report, April 14 (UTC+8), according to 1M AI News monitoring, when AI programming agents handle a single task, running multiple times often yields different solutions, some of which may be correct or incorrect. If we can automatically select the best one, the overall success rate can surpass that of a single run. The question is how to select: using another model as a judge to score (i.e., LLM-as-a-Judge) is the current mainstream approach, but scoring granularity is too coarse, often giving different solutions the same score, making it hard to distinguish the better one. Stanford AI Laboratory and Berkeley Sky Computing Laboratory, in collaboration with NVIDIA, proposed LLM-as-a-Verifier to improve this selection process. Instead of only considering the final score given by the judge, it reads the probability distribution over each scoring level from the model and calculates a continuous reward value. At the same time, the judge repeats the evaluation multiple times and averages the results to eliminate random bias, and the overall assessment is broken down into three independent dimensions (whether the task requirements are met, whether the output format is correct, and whether there are error signals) for separate verification. In experiments, Gemini 2.5 Flash was used as the verifier, achieving a single verification accuracy of 74.7%, compared to only 57.0% for traditional judges; after 16 repetitions, the Verifier reached 77.4%, while the Judge was at 70.2%. The traditional Judge had a 26.5% tie rate, while the Verifier had a 0% tie rate under all configurations. Practical results: on Terminal-Bench 2, running GPT-5.4 five times on the same task, randomly selecting one solution had an 81.8% success rate, which increased to 86.4% after being filtered by the Verifier. On SWE-Bench Verified, selecting one solution each from Claude Opus 4.5, Claude Opus 4.6, and Gemini 3 Flash (a total of 3 solutions), the success rate increased from 76.1% to 77.8%. As of April 9, both metrics were top of the charts. The framework has been open-sourced. (Source: BlockBeats)