Stanford and Berkeley laboratories, working with NVIDIA, proposed LLM-as-a-Verifier. By analyzing the score probability distribution and conducting multiple rounds of judgment, it successfully improved the accuracy of solution-choice selection for AI programming agents. Experiments show that, compared with traditional judges, the Verifier performs better in evaluation, achieving a significantly higher success rate, and the framework has been open-sourced.

MeNews

2026-05-01 06:03:18

Abstract generation in progress

ME News Report, April 14 (UTC+8), according to 1M AI News monitoring, when AI programming agents handle a single task, running multiple times often yields different solutions, some of which may be correct or incorrect. If we can automatically select the best one, the overall success rate can surpass that of a single run. The challenge is how to select: using another model as a judge to score (i.e., LLM-as-a-Judge) is the current mainstream approach, but the scoring granularity is too coarse, often giving different solutions the same score, making it hard to distinguish the better one. Stanford AI Laboratory and Berkeley Sky Computing Laboratory, in collaboration with NVIDIA, proposed LLM-as-a-Verifier to improve this selection process. Instead of only considering the final score given by the judge, it reads the probability distribution across each scoring level from the model and calculates a continuous reward value. At the same time, the judge repeats the evaluation multiple times and averages the results to eliminate random bias, and the overall assessment is broken down into three independent dimensions (whether the task requirements are met, whether the output format is correct, and whether there are error signals) for separate verification. In experiments, Gemini 2.5 Flash was used as the verifier, achieving a single verification accuracy of 74.7%, compared to only 57.0% for traditional judges; after 16 repetitions, the Verifier reached 77.4%, while the Judge was at 70.2%. The traditional Judge had a 26.5% tie rate, while the Verifier had a 0% tie rate under all configurations. Practical results: on Terminal-Bench 2, running GPT-5.4 five times on the same task, randomly selecting one solution yields an 81.8% success rate, which increased to 86.4% after selection by the Verifier. On SWE-Bench Verified, selecting one solution each from Claude Opus 4.5, Claude Opus 4.6, and Gemini 3 Flash (total 3 solutions), the success rate increased from 76.1% to 77.8%. As of April 9, both metrics are top-ranked. The framework has been open-sourced. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
438.58K Popularity
#
USSeeksStrategicBitcoinReserve
58.68M Popularity
#
BitcoinETFOptionLimitQuadruples
978.99K Popularity
#
#FedHoldsRateButDividesDeepen
30.34K Popularity
#
DeFiLossesTop600MInApril
10.17M Popularity

Sitemap

Stanford and Berkeley propose LLM-as-a-Verifier, while also setting new records on the Terminal-Bench and SWE-Bench leaderboards

Trending Topics

WCTCTradingKingPK

USSeeksStrategicBitcoinReserve

BitcoinETFOptionLimitQuadruples

#FedHoldsRateButDividesDeepen

DeFiLossesTop600MInApril

Pin