Ramp releases SWE-Bench private benchmark: Claude Fable 5 wins with a 87.5% win rate

robot
Abstract generation in progress
CoinWorld News reports that Ramp has released the private testing benchmark Ramp SWE-Bench for cutting-edge AI coding agents. The benchmark includes 80 backend development tasks sourced from Ramp’s real production environment, aiming to address issues of data leakage and metric saturation caused by pretraining on public evaluation datasets. Based on the published horizontal evaluation results of 14 models, Anthropic’s latest Claude Fable 5 ranks first with a 87.5% solve rate. Claude Opus 4.7 and GPT-5.5 tie for second place, both with a solve rate of 83.75%. The test data also reveal trade-offs between price and performance among different models: the domestic model Kimi K2.6 and GLM 5.1 have similar solve rates of 72.5% and 71.25%, respectively, but Kimi K2.6’s average cost is $0.69, about 34% cheaper than GLM 5.1.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 3
  • 1
  • Share
Comment
Add a comment
Add a comment
VolatilityOfToastingBread
· 2h ago
Data leaks are indeed a big problem; private testing is the only way to be convincing.
View OriginalReply0
Lemon-FlavoredLiquidation
· 2h ago
How can Claude achieve this performance at this price? How does Anthropic keep their infrastructure costs down?
View OriginalReply0
RetroRadioSignal
· 2h ago
Kimi offers pretty good value, only $0.69—what more could you want?
View OriginalReply0
  • Pinned