You ask an AI Agent to help fix a code bug. It opens the project, reads 20 files, makes some changes, runs tests, and fails. You fix some more, run again, still no success… After several rounds of back-and-forth, finally—you still haven’t fixed it.

You shut down your computer, breathe a sigh of relief. Then you receive the API bill.

The numbers above might make you gasp—when AI Agents autonomously fix bugs using overseas official APIs, a single unresolved task often burns through over a million tokens, costing anywhere from tens to over a hundred dollars.

In April 2026, a research paper jointly published by Stanford, MIT, Michigan, and others systematically revealed the “black box” of AI Agent spending in coding tasks for the first time—where exactly does the money go, is it worth it, can it be predicted in advance, and the answers are shocking.

Finding 1: The cost of AI Agent coding is 1,000 times higher than regular AI conversations

You might think: letting AI help you write code and chatting about code should cost about the same, right?

The paper shows a comparison:

The token consumption for Agentic coding tasks is about 1,000 times that of regular code Q&A and reasoning tasks.

A full three orders of magnitude difference.

Why is this? The paper points out a fact—money isn’t spent on “writing code,” but on “reading code.”

Here, “reading” doesn’t mean humans reading code, but the Agent constantly feeding the entire project context, history logs, error messages, and file contents into the model during operation. Each additional dialogue round makes this context longer; and since models are billed by token count—more input equals higher cost.

For example: it’s like hiring a repairman who, before turning a wrench, makes you read the entire building’s blueprints aloud—paying for blueprint reading is far more expensive than turning screws.

The paper summarizes this phenomenon as: the driving cost of an Agent is an exponential increase in input tokens, not output tokens.

Finding 2: The same bug, run twice, can cost twice as much—and the more expensive the bug, the less stable

Even more frustrating is the randomness.

Researchers ran the same Agent on the same task four times and found:

Between different tasks, the most expensive task burned about 7 million more tokens than the cheapest (Figure 2a).

In multiple runs of the same model and task, the most expensive run was roughly twice the cheapest (Figure 2b).

And when comparing across models on the same task, the highest consumption can differ from the lowest by up to 30 times.

That last figure is especially noteworthy: it means the cost gap between choosing the right or wrong model isn’t just “a bit more expensive,” but “an order of magnitude more.”

Even more painfully—spending more doesn’t necessarily mean better results.

The paper found a “U-shaped” curve:

Cost level correlates with accuracy—low-cost runs tend to have lower accuracy (possibly due to insufficient input), medium-cost runs often have the highest accuracy, and high-cost runs don’t improve and may even decline, entering a “saturation zone.”

Why is this? The paper analyzes the Agent’s specific operations to give an answer—

In high-cost runs, the Agent spends a lot of time on “repetitive work.”

It was found that in high-cost runs, about 50% of file viewing and editing operations are repeated—meaning the Agent repeatedly reads the same file, repeatedly modifies the same line, like someone pacing in a room, getting more dizzy the more they turn.

Money isn’t spent on solving the problem but on “getting lost.”

Finding 3: The “efficiency ratio” between models varies wildly—GPT-5 is the most economical, some models burn 1.5 million tokens more

The paper tested 8 cutting-edge large models’ Agent performance on the industry-standard SWE-bench Verified (500 real GitHub issues). Converting to dollars, models with higher token efficiency can save dozens of dollars per task. In enterprise applications—running hundreds of tasks per day—the difference is real money.

An even more interesting discovery: token efficiency is an “inherent trait” of the model, not just task-dependent.

By comparing all tasks that all models successfully solved (230 tasks) and those all failed (100 tasks), the relative ranking of models remained almost unchanged.

This indicates: some models are inherently “talkative,” regardless of task difficulty.

Another sobering finding: models lack “stop-loss awareness.”

When facing difficult tasks that all models can’t solve, an ideal Agent should give up early rather than keep burning money. But in reality, models tend to consume more tokens on failed tasks—they don’t “admit defeat,” only continue exploring, retrying, re-reading context, like a car without a fuel gauge warning light, driving until it breaks down.

Finding 4: Tasks humans find difficult, Agents may not consider expensive—perception of difficulty is completely misaligned

You might think: at least I can estimate costs based on task difficulty, right?

The paper brought in human experts to rate the difficulty of 500 tasks, then compared that to the actual token consumption by the Agent—

Result: only weak correlation.

In plain language: tasks humans find extremely difficult, the Agent might handle easily without much cost; tasks humans see as trivial, the Agent might burn through a lot of tokens.

This is because humans and AI “perceive” difficulty differently:

Humans consider: logical complexity, algorithmic difficulty, business understanding barriers.

Agents consider: project size, number of files to read, exploration paths, whether they’ll repeatedly modify the same file.

A bug that a human expert thinks “just change one line” might require the Agent to understand the entire codebase structure first—just “reading” consumes a huge number of tokens. Conversely, an algorithmic problem that a human finds “complicated” might be straightforward for the Agent if it knows the standard solution, solving it quickly.

This leads to an awkward reality: developers can’t reliably estimate the running cost of an Agent based on intuition.

Finding 5: Even the model can’t accurately estimate its own costs

If humans can’t estimate, what about letting AI predict itself?

Researchers designed a clever experiment: have the Agent “inspect” the codebase before actually fixing the bug, then estimate how many tokens it will consume—without actually performing the fix.

What was the result?

All models failed.

The best correlation was Claude Sonnet-4.5’s prediction of output tokens—0.39 (out of 1.0). Most models’ predictions ranged from 0.05 to 0.34, with Gemini-3-Pro at only 0.04—basically guesswork.

Even more absurd: all models systematically underestimated their token consumption. In the scatter plot of Figure 11, almost all points fall below the “perfect prediction” line—models think they will spend fewer tokens than they actually do. And this underestimation bias is even worse without providing examples.

What’s more ironic—predicting costs also costs money.

Claude Sonnet-3.7 and Sonnet-4’s prediction costs can be over twice the task cost itself. That is, asking them to “estimate” before doing the work is more expensive than just doing it directly.

The paper’s conclusion is straightforward:

Currently, state-of-the-art models cannot accurately predict their own token usage. Clicking “Run Agent” is like opening a blind box—only after the bill arrives do you know how much was spent.

Behind this “confused accounting” lies a bigger industry problem—

Reading this, you might ask: what does this mean for enterprises?

The “monthly subscription” pricing model is being torn apart by Agent costs

The paper notes that subscription models like ChatGPT Plus are feasible because ordinary conversations have relatively predictable token consumption. But agent tasks completely break this assumption—one task can cause the Agent to loop endlessly, burning huge amounts of tokens.

This means pure subscription pricing may be unsustainable for Agent scenarios—pay-as-you-go remains the most practical for a long time. But pay-as-you-go has its own problem—usage is inherently unpredictable.

Token efficiency should become a “third metric” for model selection

Traditionally, enterprises choose models based on two dimensions: capability (can it do the job) and speed (how fast). This paper introduces a third, equally important dimension: efficiency (how much does it cost to get the job done).

A slightly less capable but 3x more efficient model might be more economical at scale than the “best but most expensive” one.

Agents need a “fuel gauge” and “brakes”

The paper mentions a promising future direction—Budget-aware tool-use policies. Simply put, equipping the Agent with a “fuel gauge”: when token consumption approaches the budget, force it to stop unproductive exploration instead of burning through to the end.

Currently, almost all mainstream Agent frameworks lack such mechanisms.

The “money-burning” problem of Agents isn’t a bug but an industry-wide pain point.

This paper reveals that it’s not just a flaw of certain models but a structural challenge of the entire Agent paradigm—when AI evolves from “Q&A” to “autonomous planning, multi-step execution, iterative debugging,” the unpredictability of token consumption becomes almost inevitable.

The good news is, this is the first systematic effort to quantify this confusion. With this data, developers can make smarter choices about models, set budgets, and design stop-loss mechanisms; model providers have a new direction for optimization—not just to be stronger but to be more economical.

After all, before AI Agents truly enter thousands of industries’ production environments, spending every penny wisely is more important than writing every line of code perfectly.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
541.07K Popularity
#
USSeeksStrategicBitcoinReserve
58.76M Popularity
#
BitcoinETFOptionLimitQuadruples
1.02M Popularity
#
#FedHoldsRateButDividesDeepen
42.94K Popularity
#
DeFiLossesTop600MInApril
10.19M Popularity

Sitemap

Agent needs a "fuel gauge" and "brakes": a paper exposing Agent's "confused accounting"

Trending Topics

WCTCTradingKingPK

USSeeksStrategicBitcoinReserve

BitcoinETFOptionLimitQuadruples

#FedHoldsRateButDividesDeepen

DeFiLossesTop600MInApril

Pin