According to Beating Monitoring, Stanford's Erica Zhang and others released the economic negotiation test set TERMS-Bench.
It removes the black-box "large model judge," allowing evaluators to directly see whether the model loses due to bidding, concessions, or violations.
In standard tests, Claude Opus 4.6 and Zhipu GLM 5.1 took the top two spots.
The paper found that they adopted a tough strategy of "high bids, no concessions," which can drain opponents in profitable, favorable situations.
But in the highest difficulty scenarios with extremely narrow profit margins, tough strategies suffer because negotiations frequently break down.
The leaderboard here directly crashes: Gemma 4 31B (an open-weight model) and Gemini 3.1 Pro, which understand moderate concessions to secure deals, leap ahead to the top two;
Meanwhile, the previous leaders Claude drops to fifth place, and GLM drops to ninth.
Besides testing the extreme difficulty, the most impactful aspect of this benchmark is testing survival ability with the Bankroll mode.
A single negotiation is extended into continuous procurement: each agent starts with $100 and negotiates 50 rounds, with fixed operational costs deducted each round, going bankrupt if funds run out.
Here, even tiny negotiation mistakes compound into bankruptcy risk.
Results show that the aforementioned GLM 5.1, Claude Opus 4.6, and Google's duo, despite different strategies, all dominate in control ability, achieving 100% survival, with final cash holdings between $380 and $443.
In contrast, Grok 4.20 and GPT-4o-mini cannot withstand cash flow losses, with bankruptcy rates of 25% and 50%, respectively.
The key of TERMS-Bench is not the success rate, but translating negotiation errors into cash losses and bankruptcy risks.
Whether a model can persuade the opponent is just the first layer;
In continuous trading, whether it can maintain profit and cash flow is what truly makes the difference.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
EventContractsLaunch
70K Popularity
#
SummerCreationCamp
755.43K Popularity
#
GOOGLEarningsBeatButStockDrops3%
3.33M Popularity
#
eslaHolds11509BTCFor4Years
1.64M Popularity
#
GUSDYieldRisesto3.8%
172.46K Popularity

Pinned

Sitemap

Google’s two AI giants pull off a turnaround in a tough matchup—TERMS-Bench turns AI negotiations into bankruptcy stress tests

Trending Topics

EventContractsLaunch

SummerCreationCamp

GOOGLEarningsBeatButStockDrops3%

eslaHolds11509BTCFor4Years

GUSDYieldRisesto3.8%

Pinned