Why You Must Learn Harness Engineering? An In-Depth Analysis of 5 Products, 3 Schools of Thought, and 5 Universal Principles

Question

System Breakdown Harness Engineering: 5 Artifacts, 3 Schools (OpenAI / Anthropic / ThoughtWorks), 5 Universal Principles, and Why "Harness Decay" Forcing You to Cut Half Every 6 Months.
This article is adapted, compiled from @sairahul1's original X article, organized by 動區.
(Background: Introduction to Harness Engineering (AI Governance Engineering): OpenAI's Latest Programming Standards, Teaching You How to Reach Lv.1 Easily)
(Additional context: YC CEO shares AI secrets: The future belongs to those who build compound interest information systems)

Table of Contents

Toggle

1. Definition of Harness
1. OS Analogy
1. What Changed in 2026
1. AGENT.md / CLAUDE.md Files
1. JSON Feature Lists (Progress Tracker)
- Why JSON and not Markdown?
1. Session Initialization Routine
1. Sprint Contracts
- Why It Matters
1. Structured Task Templates
1. OpenAI School: Environment First
- Their Approach
- Evidence
1. Anthropic School: Separate "Doing" and "Review"
- Their Solution: 3 Dedicated Agents
- Results (A/B Testing)
1. ThoughtWorks School: 2×2 Framework
- Their Insight: Classifying Each Harness Control by Two Axes
- 2×2 Matrix
1. Principle 1: Context Over Instruction
1. Principle 2: Planning and Execution Must Be Separated
1. Principle 3: Feedback Loops Are Non-Negotiable
1. Principle 4: Do One Thing at a Time
1. Principle 5: Codebase Is the Document
- Practical Implication
1. Harness Decay Is Real
- This Is Harness Decay
1. Build to Delete
1. Cost Reality
- The Unspoken Part Here
Full Summary
- What is Harness
- 5 Harness Artifacts
- 3 Schools
- 5 Universal Principles
- Paradoxical Aspects

By February 2026, a small OpenAI team produced 1 million lines of production-grade code.

Not a single line handwritten.

Written by AI agents.

Designed by humans — the system that makes agents reliable.

This system now has a name — Harness Engineering.

Within weeks, Anthropic published 3 related papers. ThoughtWorks organized it into a framework. Philipp Schmid from Hugging Face called it "the most important discipline of 2026."

In 90 days, a new engineering discipline took shape. Outside AI infra teams, almost no one understood.

This article aims to clarify it. No fluff, no academic jargon, only the mental models you really need to use.

1. Definition of Harness

ThoughtWorks' simplest definition:

Agent = Model + Harness

Harness is everything outside the model.

Constraints that keep the agent on track
Feedback loops for error detection
Documentation telling the agent where it is
Tools it has permission to use

Remove harness → a raw language model guessing wildly inside your codebase.

Add the right harness → a system capable of producing production-level code.

The name comes from horse tack. Harness includes reins, saddle, bridle — guiding a powerful but unpredictable animal in a useful direction.

You're not making the horse smarter; you're designing gear to make its strength useful.

2. OS Analogy

Philipp Schmid's best technical analogy: Think of it as a computer.

| Role | Corresponds To | | --- | --- | | Model | CPU (raw computing power) | | Context window | RAM (limited, volatile working memory) | | Harness | OS (manages what the CPU sees and when) | | Agent | App running on top |

Your model is powerful. But without an OS to manage memory, schedule tasks, enforce rules — it's just a piece of silicon.

Most run apps "without an operating system." So their agents break as soon as they go into production.

3. What Changed in 2026

LangChain ran the same model twice on Terminal Bench 2.0:

| Harness | Score | | --- | --- | | Old harness | 52.8% | | New harness | 66.5% |

Same model. Different harness. A 13.7 percentage point difference.

Vercel did the opposite — cut 80% of tools from the agent. Result? Better, not worse.

The uncomfortable truth of 2026:

Agents are never the hard part
Harness is

If 2025 was the year AI agents proved they could write code, 2026 is the year they discovered "environment" is more important than "model."

4. AGENT.md / CLAUDE.md Files

The most universal harness artifacts.

Markdown files scattered throughout the codebase. Agents read them at the start of each session — like onboarding documents for new engineers.

What do they contain?

Project context
Coding conventions
Architectural decisions
Guidelines on "how we do things"
Current ongoing tasks

OpenAI calls it AGENT.md. Anthropic calls it CLAUDE.md. Cursor uses .cursorrules.

Different names, same principle. One per main module. Updated as the project evolves.

Without it: agent starts each session blindly. With it: agent comes prepared with context.

5. JSON Feature Lists (Progress Tracker)

When an agent spans multiple sessions building an app, each session's context window is empty. How does it know what’s already done?

A JSON file.

Each entry records:

A feature
How to verify it
Pass / Fail status

Agent reads it at session start — prioritizes highest-fail, implements, marks pass, commits, repeats.

Why JSON and not Markdown?

Anthropic found: Agents are less likely to overwrite JSON accidentally than Markdown.

Details are small but critical in scenarios where the agent runs autonomously for 6 hours.

6. Session Initialization Routine

Start every session the same way. Every time.

Anthropic's 7-step startup process:

Confirm working directory
Read git log and progress files
Pick highest-priority incomplete feature from feature list
Launch dev server
Run basic E2E tests
Implement a feature
Commit with descriptive message, update progress

Without it: agent spends first 20 minutes figuring out current state, re-inventing the wheel each session. With it: agent starts with context and gets straight to work.

7. Sprint Contracts

Before writing a single line of code — two agents negotiate first.

Generator agent proposal:

What to do
How to verify success

Evaluator agent review:

Is the proposal complete?
Are success criteria clear?

Both agree, then implementation begins.

It's a design review. But both are AI.

Why It Matters

Agents that plan and execute in the same round produce unreliable output. The planning step — even if AI-driven — greatly improves quality.

8. Structured Task Templates

Before coding, harness analyzes the real codebase.

It produces a grounded impact map:

Actual file paths (not hallucinated)
Real symbol names
Existing patterns to follow
Concrete acceptance criteria

Only then does implementation start.

Sounds obvious, but most teams skip this step.

Agents guess file structures, invent APIs that don’t exist, create things that don’t match the codebase.

Grounded context first, then execute → much higher output quality.

9. OpenAI School: Environment First

OpenAI's Codex team faced a ridiculous problem:

1 million lines of production code, zero handwritten lines.

At that scale, you can't do line-by-line code review. So they don't.

Instead — they design the environment so thoroughly that the agent produces "reviewable output" from the start.

Their Approach

Strict dependency flow (Types → Config → Repo → Service → Runtime → UI)
Entire codebase scattered with AGENT.md
Agent integrated directly into CI/CD pipeline

Philosophy: Design the environment. Then let the agent run.

Evidence

Sora Android app. 4 engineers. 28 days. #1 on Play Store. 99.9% crash-free.

Codex handles 70% of internal PRs weekly.

10. Anthropic School: Separate "Doing" and "Review"

Anthropic faces another problem:

When they ask the agent to evaluate its own output, it confidently praises its work — even when human observers see the quality as mediocre.

Self-assessment doesn’t work. The agent acts as both student and teacher, then gives itself an A.

Their Solution: 3 Dedicated Agents

| Agent | Role | | --- | --- | | Planner | Converts 2-sentence prompt into full product specs | | Generator | Implements one sprint at a time | | Evaluator | Uses browser automation to test, simulating real user actions |

Insight: Making an "independent evaluator" picky is much easier than making the generator critical of its own work.

Results (A/B Testing)

| Setting | Cost | Time | Result | | --- | --- | --- | --- | | Single agent (no harness) | $9 | 20 min | Broken app | | Full harness | $200 | 6 hours | Working software + polished UI |

11. ThoughtWorks School: 2×2 Framework

ThoughtWorks approaches from a different angle — not building a product, but studying 50+ engineering teams failing in the same ways.

Their Insight: Classify Each Harness Control by Two Axes

Axis 1: When does it operate?

Feedforward = before agent acts (guidance)
Feedback = after agent acts (sensors)

Axis 2: How does it operate?

Computational = deterministic, millisecond-level (linter, type checker, test suite)
Inferential = using LLM, second-level (code review agent, semantic analyzer)

2×2 Matrix

| | Feedforward (Guidance) | Feedback (Sensor) | | --- | --- | --- | | Computational | Type system, linter, architecture rules | Test suite, coverage, mutation tests | | Inferential | Spec files, constraints descriptions | LLM code reviewer, behavior validators |

Feedforward and feedback alone are insufficient. Both are needed.

12. Principle 1: Context Over Instruction

Different teams, same discovery:

OpenAI: Give a map, not a 1,000-page manual
Anthropic: JSON feature list + progress files, so agent always knows where it is
Red Hat: Analyze real codebase before any task
ThoughtWorks: Call it "Feedforward"

Ground the agent in the "current state of the world," which always beats abstract instructions.

Ground in real files → adapt code to the codebase. Starting from vague descriptions → hallucinated paths and invented APIs.

Before the agent types, ensure it knows where it is.

13. Principle 2: Planning and Execution Must Be Separated

OpenAI: Humans design environment, agent executes
Anthropic: Dedicated Planner agent runs before Generator
ThoughtWorks: Hard gate between Phase 1 (impact map) and Phase 2 (implementation)
Red Hat: Strict separation between Phase 1 and Phase 2

All camps independently discover: Planning and executing in the same round yields unreliable output.

Planning — even if AI-driven — should not be done in the same step as execution. It must be a separate step, with output reviewed before starting work.

14. Principle 3: Feedback Loops Are Non-Negotiable

OpenAI: Agent integrated into CI/CD and observability tools
Anthropic: Dedicated Evaluator agent + browser automation
ThoughtWorks: Formalized as "Sensors," warning that relying solely on feedforward can never fully verify guidance effectiveness

Same principle, different approaches:

| School | Feedback Source | | --- | --- | | OpenAI | Automated tests + CI | | Anthropic | Another LLM | | ThoughtWorks | Both combined |

They differ on "who" provides feedback. But agree on "whether" feedback is needed.

Harness without feedback is just prompt + steps.

15. Principle 4: Do One Thing at a Time

OpenAI: Break goals into smaller blocks, depth-first
Anthropic: Enforce "one feature per sprint," then commit after completion
ThoughtWorks: Phased lifecycle (pre-integration → post-integration → continuous monitoring)

Trying to do too many things at once:

Uses up context
Loses coherence
Quietly drops requirements

Anthropic's routine: Read progress → pick ONE feature → implement → commit → repeat.

"Progressive incrementalism" is a common trait of all successful harnesses.

16. Principle 5: Codebase Is the Document

OpenAI: AGENT.md embedded in repo
Anthropic: Feature list, progress files, git history as continuity mechanism
ThoughtWorks: Measures "harnessability" — how readable the codebase is for the agent

No one maintains a separate knowledge base for the agent. The repo is the single source of truth.

If a convention, constraint, or architecture decision is not in the codebase → the agent won't know it.

Practical Implication

Teams willing to invest in code organization get free better agent performance
Messy repo + AI agent = scalable chaos

17. Harness Decay Is Real

When Anthropic upgraded from Opus 4.5 to 4.6 — Sprint decomposition (once essential) became dead weight.

Model's planning ability improved, making that part redundant.

Components that were burdensome in March, by April became overhead.

Then Opus 4.7 launched — models started verifying their own outputs, reducing the role of the Evaluator agent again.

This Is Harness Decay

Every component in the harness implicitly assumes "what the model can't do." As models improve, those assumptions become outdated, turning components into overhead.**

| Model Version | Harness State | | --- | --- | | Opus 4.5 | Sprint decomposition + evaluation per sprint | | Opus 4.6 | No decomposition + single evaluation (38% cost saving) | | Opus 4.7 | Model self-verification → evaluator role shrinks again |

18. Build to Delete

Philipp Schmid's advice: "Build to delete."

Design each harness component to be removable.

Regularly test each component — turn it off, see if output quality drops. No drop → delete.

| Teams | Rebuilds within 6 months | | --- | --- | | Manus | Rebuilt harness 5 times | | LangChain | Reorganized 3 times in a year | | Vercel | Cut 80% of tools → performance improved |

These are not signs of bad engineering. They are natural outcomes of "building on rapidly advancing models."

Dead harness components that run every time waste tokens and contribute nothing — pure waste.

19. Cost Reality

Honest numbers from Anthropic's A/B tests:

| Setting | Cost | Time | Result | | --- | --- | --- | --- | | Single agent (no harness) | $9 | 20 min | UI changed, core broke | | Full harness (Opus 4.5) | $200 | 6 hours | Working software, polished UI, correct physics |

22× cost increase — for a real working product, not just a "screenshot demo."

Is it worth it? Depends on how costly broken releases are to your team.

But the unspoken part here

Harness + model combo is evolving.

A $200 harness, after upgrading the model version, costs only $124.

| Trend | Explanation | | --- | --- | | Better models = simpler harness = cheaper runs = faster outputs |

The winning engineers of 2026 are not those who write the best code.

They are the ones designing the best "constraints."

And willing to throw them away the moment they stop being profitable.

Summary

What Is Harness

Agent = Model + Harness
Model = CPU, Harness = OS
Better harness for the same model → +13% performance

5 Harness Artifacts

CLAUDE.md / AGENT.md — onboarding docs for agents
JSON feature list — progress + test suite in one
Session startup routine — same 7-step boot process
Sprint contract — negotiation before coding
Structured task template — real file paths, real patterns

3 Schools

OpenAI: Design environment, run agent
Anthropic: Separate "doing" and "review"
ThoughtWorks: 2×2 feedforward/feedback framework

5 Universal Principles

Context over instruction
Planning and execution must be separated
Feedback loops are non-negotiable
Do one thing at a time
Codebase is the document

Paradoxical Aspects

Harness Decay — what worked last month may be outdated this month
Build to Delete — regularly test and remove dead components
Cost Reality — better models = simpler harness = cheaper runs

The winning engineers of 2026 are not those who write the best code. They are the ones designing the best constraints — and willing to discard them when they stop being profitable.

View Original