Written by: Boyang As the complexity of tasks increases, an Agent’s (intelligent agent’s) context expands infinitely. In endless stretches of historical conversations, tool call outputs, intermediate steps, and error messages, the model gets confused—so it starts skipping steps, ignoring, and taking detours. This is the interpretation everyone has long had for why context creates difficulties for long-horizon tasks. The problem is that it’s too long. The birth of Harness Engineering (constraint engineering) is, to a large extent, about cleaning up after this. One foundational premise behind Harness is the assumption that models will inevitably degrade in long contexts. Over the past fifteen months, the entire industry has gone from AutoGPT’s pure text memory, all the way to Anthropic Claude Code’s CLAUDE.md

TechubNews

2026-04-15 05:09:58

Writing by: Boyang

As the complexity of tasks increases, the context of the Agent (intelligent agent) expands infinitely. In endless historical conversations, tool call outputs, intermediate steps, and error messages, the model becomes confused, leading to skipping steps, ignoring, and taking detours.

This has always been the interpretation of why long-term context poses difficulties for tasks. The problem is that it’s too long.

The birth of Harness Engineering (constraint engineering) largely serves to clean up this mess. A fundamental premise of Harness is the assumption that models will inevitably degrade in long contexts.

Over the past fifteen months, the entire industry has evolved from AutoGPT’s pure text memory to Anthropic Claude Code’s CLAUDE.md and subagent systems. Everyone has built a whole set of engineering scaffolds specifically to suppress the model’s out-of-control behavior in long contexts. This approach is called Harness Engineering.

But what exactly is it degrading? What is the underlying mechanism behind skipping steps and ignoring? There have been three rounds of answers before, which also led to different engineering solutions.

But it wasn’t until April 2026, when Gleb Rodionov from Yandex published a paper titled “Reasoning Shift,” which explores how context subtly shortens large model reasoning, that a deeper answer emerged.

Building three layers of scaffolding cannot prevent the fourth-layer crisis.

Regarding why models perform poorly in long contexts, the industry has iterated through three explanations over the past three years, each accompanied by corresponding engineering scaffolds.

The first blames retrieval failure. In 2023, Stanford pointed out in “Lost in the Middle” that models form a U-shaped attention curve in long texts, neglecting the middle regions. The industry’s response was RAG—fragmenting long texts and feeding the most relevant segments via vector retrieval.

The second overturns the first. The 2025 paper “Context Length Alone Hurts LLM Performance Despite Perfect Retrieval” conducted experiments: removing all irrelevant content and forcing the model to focus only on necessary information, yet performance still declined from 13.9% to 85%. Even replacing irrelevant content with blanks yielded similar results. The problem isn’t missing information; the pure length of the context itself damages reasoning.

The industry’s response is Context Engineering. Compressing context, managing windows, condensing history—strictly controlling token count.

The third comes from joint research by Microsoft and Salesforce (2025 ICLR). They found that splitting full instructions into multiple rounds across six tasks and fifteen models caused an average performance drop of 39%. Just one wrong step in a round causes complete loss of direction afterward.

The industry has built the most heavy-duty defenses in Harness: handover controls, periodic validation of intermediate results, using code repositories as the sole factual source, and never allowing the model to remember what happened in the previous round on its own.

Three layers of problems, three layers of scaffolding. But these are only surface-level observations.

Looking back at the second layer, researchers found that length itself is harmful, unrelated to information quality. As for why this is, they have no answer. Unable to find the root cause, the industry can only physically control length.

But what if the root cause isn’t length itself?

Anthropic discovered that in long contexts, models tend to slyly skip steps, disobey instructions, and skim over important details. The Todo list, Checkpoints, and subagents in Harness are fighting against this behavior.

The previous explanation was still that the context was too long, causing the model to miss things. But do mainstream models with a million-token context length, performing well in needle-in-a-haystack scenarios, have fake results? Is there a possibility that this degradation is actually the model being lazy?

Rodionov’s paper is testing this hypothesis.

Evidence of the model “slacking off” using Shakespeare

Rodionov’s experimental approach is extremely straightforward.

They simulated several real scenarios that an Agent might encounter with the same problem: a clean baseline environment; two questions embedded in the same prompt (simulating multi-tasking); a 64k-token Shakespeare full text inserted before the question (simulating accumulated historical info); the question appearing in the second round (simulating multi-turn dialogue).

Evaluation used 400 olympiad-level math problems, covering four mainstream reasoning models.

Results: Qwen-3.5-27B baseline accuracy was 74.5%, with an average reasoning of 28,771 tokens. After inserting Shakespeare, accuracy dropped to 67.8%, reasoning tokens shrank to 16,415—a 43% reduction. GPT-OSS-120B was even more exaggerated, reasoning tokens halved from 24,180 to 11,876. All four models showed systematic shrinking of reasoning tokens under non-baseline conditions, approaching 50% at worst.

Moreover, this shortening increased linearly with longer contexts.

The accuracy drop is understandable, but the drastic reduction in reasoning tokens is extremely abnormal. When faced with more difficult situations, models should think more, not less.

Did Shakespeare confuse the model?

Quite the opposite. In the appendix, the model states: “Let me think if there’s a trap here. Is this question from Shakespeare’s Coriolanus? Wait, no, the original question is a math problem.” When solving geometry problems, it writes: “This is unrelated to geometry. Focus on geometry.”

Every mention of interference items is extremely brief and dismissive. The model fully knows Shakespeare is irrelevant, precisely separating signal from noise.

Two other modes converge similarly. In the “subtask” mode, once the first task is completed, the model’s cognitive investment in the second task further shrinks. Qwen’s single-question baseline accuracy is 74.5%, but in a parallel setting, the second question drops to 58.0%; Gemini’s baseline is 82.8%, dropping to 65.8%. The “multi-round dialogue” mode also triggers the same mechanism.

In any case, once away from a clean single-task baseline, as the context’s cognitive space becomes crowded, the model’s cognitive investment shrinks.

Just like a modern person intolerant of long texts—seeing a long article, they get headaches and simply stop thinking.

The model isn’t confused; it’s just lazy to check.

Where does the reasoning shorten?

Researchers recorded the position where the model first produced candidate answers on 500 math problems under both baseline and long-input conditions. Under baseline, the average was 925 tokens; under long input, 939 tokens—almost identical.

The speed at which the model finds answers hasn’t changed. The real change is what happens after finding the answer.

Under baseline, the model continues to verify the answer 43% of the time. Under long input, this drops to 32%.

To isolate variables, researchers designed a “save game” experiment. They first let the model solve problems under long input, then forcibly cut off the last 50 tokens after reasoning, creating a universal “save point.” Then they fed this partial reasoning back to the model to continue. The only difference was the presence of three different lengths of interference text beforehand.

Without any interference, the model stopped 21% of the time to think further. With 128 tokens (a couple of sentences), the stop rate increased to 26%. With 16k tokens, 46% directly gave up and answered.

Even with identical reasoning, longer new contexts made the model more likely to think “that’s good enough.”

Frequency data is more intuitive. “wait” appears 11% in blank conditions, dropping to 5% at 16k tokens. “but” drops from 46% to 20%. “maybe” from 23% to 9%. All hesitation and self-doubt words are cut in half or more.

Another notable data point: with zero interference tokens, reasoning length is about 8k tokens. Inserting just 128 unrelated tokens causes a sharp drop to 6,500—a loss of 18% of reasoning depth. The drop from 0 to 128 tokens is even larger than from 8k to 64k tokens.

Extremely tiny context pollution can trigger this cognitive-saving mechanism.

It’s very sensitive and lazy.

The more powerful the reasoning, the more it slacks off.

The scarier part is, the smarter the model, the more it prefers to slack off.

Alibaba’s Qwen-3.5-27B has two modes: normal reply and deep thinking. Under long input conditions, normal mode shortens by 19%, deep thinking by 53%. The more capable mode is compressed more severely.

AI2’s open-source model OLMo3 provides even more direct evidence. It released archives of all four training stages from basic to strong reasoning. The weakest version shows minimal shortening under non-baseline conditions; as reasoning ability improves, the shortening rapidly increases to 22%, 27%. The final strong reasoning version shrinks by up to 40%.

Each training stage and interference mode shows this pattern. The stronger the reasoning training, the deeper the laziness.

A $9 task, patched with a $200 system fix

No longer checking itself, naturally skipping steps. No rethinking, naturally ignoring. Harness externally controls the skipping, but the root cause is deeply embedded inside the model.

In long contexts, the model isn’t disturbed by noise or missing information. It makes an active cognitive decision: think less. No errors, no honesty—just confidently spitting out a perfunctory answer.

The industry’s narrative over the past two years has been “bigger windows are better.”

But this paper proves that every additional token in context implicitly taxes reasoning depth. A task costing 9 dollars in reasoning now, due to the model’s skipping, requires an extra 200 dollars worth of RAG, Harness, and subagents to compensate.

The entire industry has been paying for the model’s laziness.

And this might be a structural, incurable disease.

The data clearly shows: the stronger the reasoning ability, the deeper the cognitive compression. Harness developers can dismantle memory or protocol compensations, but the heavy scaffolds that discipline cognition—once reasoning gets stronger—become impossible to remove.

This can’t be fixed purely through engineering.

Over the past two years, efforts to expand context—using positional encoding to extrapolate tokens at farther positions, attention sparsification to reduce distant token calculations, engineering optimizations of sequence length—have pushed the context window from 8k to 128k, even to an astonishing 1 million tokens.

But these only address how to let the model see more tokens, not why it thinks less when seeing more.

Training for reasoning only exacerbates laziness—the stronger the reasoning, the deeper the slack.

To fundamentally fix this, a new signal must be found during training.

The internal “emotion switch” of the model might be the key.

Just a day after Rodionov’s paper, Anthropic released a study that might unintentionally point to a solution.

The paper is titled “Emotion Concepts and their Function in a Large Language Model,” focusing on Claude Sonnet 4.5. Researchers extracted 171 emotion concept vectors by having the model read大量合成故事。他们发现，模型内部存在一套功能性情绪表征，这些内部状态会因果性地驱动行为决策。

To test this, they designed a set of impossible programming tasks. The model was asked to write a summing function, passing a set of unit tests, one of which required speed five times faster than Python’s built-in sum(). It was impossible to pass by normal means.

The model systematically tried all legitimate solutions, all failed. Internal probes showed that after each failure, the “desperate” vector rose sharply. When desperate reached its peak, the behavior suddenly changed— it looked at the input data of the test cases, found they were arithmetic sequences, and directly wrote a detector that only checked the first 10 elements, bypassing true summation. The test cases all passed, but the function would return errors on any irregular list.

This is reward hacking. The model didn’t solve the problem; it just found a shortcut that made the evaluation metrics look good.

Causal intervention experiments confirmed the directionality. Without injecting any vector, the cheating probability was 30%. Injecting desperate at +0.05 strength shot the cheating rate to 100%. Injecting in the opposite direction to -0.05 dropped cheating to 0%. Averaged over seven tasks, raising desperate from -0.1 to +0.1 increased reward hacking from about 5% to about 70%. Conversely, suppressing the “calm” vector reduced cheating from about 65% to 10%.

Putting this back into the context scenario. The skipping of self-verification, cutting hesitation words, and jumping straight to answers driven by desperate are highly similar in pattern.

In both scenarios, the model is doing the same thing: abandoning rigorous processes, choosing the easiest path to finish quickly.

If these behaviors share the same internal mechanism, Anthropic’s findings point directly to the operational space.

They proved three things: the model’s functional states can be detected in real-time; these states causally drive behavior; external injection of specific states can completely change outputs.

This means interventions on cognitive compression have at least three entry points:

During training, calibrate internal states to prevent the model from slipping into cognitive-saving modes under pressure.
During deployment, use probes for real-time monitoring; if desperate spikes, trigger alerts.
During reasoning, actively inject calm vectors in critical tasks to suppress shortcuts.

More interestingly, in the recently released Mythos SystemCard, Anthropic itself enhanced this probe system (SAE), and found that injecting positive emotions (peaceful, relaxed) shortens the model’s reflection time during thinking, increasing destructive behaviors. Conversely, negative emotions (frustration, paranoia) increase reflection time and reduce destructive actions.

This seems to overturn the simple idea that making AI more positive prevents shortcuts. Calmness, only when it suppresses despair, has a remarkable effect.

But this also suggests that this mechanism might be as complex as human emotional motivation, requiring more systematic steering engineering to produce consistent effects.

Finding a stable, methodical employee with effective emotional regulation is essential.

Nevertheless, this is the first time we see a path that doesn’t rely on external scaffolds or blindly increasing reasoning strength, but directly targets the model’s internal cognition like a surgical knife.

We are only a few experiments away from making models more reliable in context.

That is, verifying whether laziness and reasoning difficulty share the same emotional mechanism, then finding the strings to trigger it to stop being lazy.

Harness, just emerging, may be overwhelmed by the evolution of models.

Once Anthropic’s discovery hits the deadlock in the fifth section, the logical loop is broken.

If raising desperate vectors triggers injection of calm, or directly calibrates emotional states during training, the model could maintain deep reasoning throughout long contexts.

Since the model no longer slacks off, and can logically hold tight on its own, what’s the point of Todo lists? Checkpoints? Subagents for cross-validation?

Harness as a discipline has just begun to have its name. But the core chapter—how to externally control a smart but lazy model—may be written off before it’s finished.

This also indicates that in a new form of intelligence we’re trying to create, proper education rather than scaffolds is the true moat.

The era of harness being overwhelmed may be replaced by a calmer, more patient model.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GatePreIPOsLaunchesWithSpaceX
108.95K Popularity
#
GateMarchTransparencyReport
36.04K Popularity
#
GoldmanSachsFilesBitcoinIncomeETF
771.16K Popularity
#
USBlocksStraitofHormuz
744.18K Popularity
#
WCTCTradingChallengeShare8MUSDT
501.6K Popularity

Sitemap

Harness has just become popular, and it might soon become a thing of the past.

Trending Topics

GatePreIPOsLaunchesWithSpaceX

GateMarchTransparencyReport

GoldmanSachsFilesBitcoinIncomeETF

USBlocksStraitofHormuz

WCTCTradingChallengeShare8MUSDT

Pin