Zhipu GLM-5.2 Tops DeepSWE Open Source Rankings: Solves 44% of Complex Development Tasks, Outperforms Main Closed-Source Models

robot
Abstract generation in progress
According to Beating Monitoring, Zhipu AI's open-source model GLM-5.2 has officially joined the Long-Range Software Engineering Benchmark DeepSWE. Under the maximum thinking intensity mode, the success rate for complex development tasks reaches 44%, ranking first among open-source models. Compared to the previously listed Kimi K2.7 Code, the success rate is 13 percentage points higher.

The average cost per task for GLM-5.2 is $3.92, slightly higher than Kimi K2.7 Code's $2.82, but its success rate surpasses the performance of several mainstream closed-source models under specific thinking configurations, including Claude Sonnet 4.6 [high] (30%), Gemini 3.5 Flash [medium] (37%), and Claude Opus 4.8 [low] (41%).

The DeepSWE benchmark, designed by evaluator Datacurve, specifically tests the AI agent's ability to solve long tasks. The test includes 113 real programming problems covering five languages. Unlike traditional tests that modify only a single piece of code, DeepSWE requires AI to collaboratively modify multiple files, with an average of over 600 lines of code repaired. The evaluation runs in isolated containers with strict CPU and memory resource limitations.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned