Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
IPO Access
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
Why You Must Learn Harness Engineering? An In-Depth Analysis of 5 Products, 3 Schools of Thought, and 5 Universal Principles
System Breakdown Harness Engineering: 5 Artifacts, 3 Schools (OpenAI / Anthropic / ThoughtWorks), 5 Universal Principles, and Why "Harness Decay" Forcing You to Cut Half Every 6 Months.
This article is adapted, compiled from @sairahul1's original X article, organized by 動區.
(Background: Introduction to Harness Engineering (AI Governance Engineering): OpenAI's Latest Programming Standards, Teaching You How to Reach Lv.1 Easily)
(Additional context: YC CEO shares AI secrets: The future belongs to those who build compound interest information systems)
Table of Contents
Toggle
By February 2026, a small OpenAI team produced 1 million lines of production-grade code.
Not a single line handwritten.
Written by AI agents.
Designed by humans — the system that makes agents reliable.
This system now has a name — Harness Engineering.
Within weeks, Anthropic published 3 related papers. ThoughtWorks organized it into a framework. Philipp Schmid from Hugging Face called it "the most important discipline of 2026."
In 90 days, a new engineering discipline took shape. Outside AI infra teams, almost no one understood.
This article aims to clarify it. No fluff, no academic jargon, only the mental models you really need to use.
1. Definition of Harness
ThoughtWorks' simplest definition:
Harness is everything outside the model.
Remove harness → a raw language model guessing wildly inside your codebase.
Add the right harness → a system capable of producing production-level code.
The name comes from horse tack. Harness includes reins, saddle, bridle — guiding a powerful but unpredictable animal in a useful direction.
You're not making the horse smarter; you're designing gear to make its strength useful.
2. OS Analogy
Philipp Schmid's best technical analogy: Think of it as a computer.
| Role | Corresponds To | | --- | --- | | Model | CPU (raw computing power) | | Context window | RAM (limited, volatile working memory) | | Harness | OS (manages what the CPU sees and when) | | Agent | App running on top |
Your model is powerful. But without an OS to manage memory, schedule tasks, enforce rules — it's just a piece of silicon.
Most run apps "without an operating system." So their agents break as soon as they go into production.
3. What Changed in 2026
LangChain ran the same model twice on Terminal Bench 2.0:
| Harness | Score | | --- | --- | | Old harness | 52.8% | | New harness | 66.5% |
Same model. Different harness. A 13.7 percentage point difference.
Vercel did the opposite — cut 80% of tools from the agent. Result? Better, not worse.
The uncomfortable truth of 2026:
If 2025 was the year AI agents proved they could write code, 2026 is the year they discovered "environment" is more important than "model."
4. AGENT.md / CLAUDE.md Files
The most universal harness artifacts.
Markdown files scattered throughout the codebase. Agents read them at the start of each session — like onboarding documents for new engineers.
What do they contain?
OpenAI calls it AGENT.md. Anthropic calls it CLAUDE.md. Cursor uses .cursorrules.
Different names, same principle. One per main module. Updated as the project evolves.
Without it: agent starts each session blindly. With it: agent comes prepared with context.
5. JSON Feature Lists (Progress Tracker)
When an agent spans multiple sessions building an app, each session's context window is empty. How does it know what’s already done?
A JSON file.
Each entry records:
Agent reads it at session start — prioritizes highest-fail, implements, marks pass, commits, repeats.
Why JSON and not Markdown?
Anthropic found: Agents are less likely to overwrite JSON accidentally than Markdown.
Details are small but critical in scenarios where the agent runs autonomously for 6 hours.
6. Session Initialization Routine
Start every session the same way. Every time.
Anthropic's 7-step startup process:
Without it: agent spends first 20 minutes figuring out current state, re-inventing the wheel each session. With it: agent starts with context and gets straight to work.
7. Sprint Contracts
Before writing a single line of code — two agents negotiate first.
Generator agent proposal:
Evaluator agent review:
Both agree, then implementation begins.
It's a design review. But both are AI.
Why It Matters
Agents that plan and execute in the same round produce unreliable output. The planning step — even if AI-driven — greatly improves quality.
8. Structured Task Templates
Before coding, harness analyzes the real codebase.
It produces a grounded impact map:
Only then does implementation start.
Sounds obvious, but most teams skip this step.
Agents guess file structures, invent APIs that don’t exist, create things that don’t match the codebase.
Grounded context first, then execute → much higher output quality.
9. OpenAI School: Environment First
OpenAI's Codex team faced a ridiculous problem:
At that scale, you can't do line-by-line code review. So they don't.
Instead — they design the environment so thoroughly that the agent produces "reviewable output" from the start.
Their Approach
Philosophy: Design the environment. Then let the agent run.
Evidence
Sora Android app. 4 engineers. 28 days. #1 on Play Store. 99.9% crash-free.
Codex handles 70% of internal PRs weekly.
10. Anthropic School: Separate "Doing" and "Review"
Anthropic faces another problem:
When they ask the agent to evaluate its own output, it confidently praises its work — even when human observers see the quality as mediocre.
Self-assessment doesn’t work. The agent acts as both student and teacher, then gives itself an A.
Their Solution: 3 Dedicated Agents
| Agent | Role | | --- | --- | | Planner | Converts 2-sentence prompt into full product specs | | Generator | Implements one sprint at a time | | Evaluator | Uses browser automation to test, simulating real user actions |
Insight: Making an "independent evaluator" picky is much easier than making the generator critical of its own work.
Results (A/B Testing)
| Setting | Cost | Time | Result | | --- | --- | --- | --- | | Single agent (no harness) | $9 | 20 min | Broken app | | Full harness | $200 | 6 hours | Working software + polished UI |
11. ThoughtWorks School: 2×2 Framework
ThoughtWorks approaches from a different angle — not building a product, but studying 50+ engineering teams failing in the same ways.
Their Insight: Classify Each Harness Control by Two Axes
Axis 1: When does it operate?
Axis 2: How does it operate?
2×2 Matrix
| | Feedforward (Guidance) | Feedback (Sensor) | | --- | --- | --- | | Computational | Type system, linter, architecture rules | Test suite, coverage, mutation tests | | Inferential | Spec files, constraints descriptions | LLM code reviewer, behavior validators |
Feedforward and feedback alone are insufficient. Both are needed.
12. Principle 1: Context Over Instruction
Different teams, same discovery:
Ground in real files → adapt code to the codebase. Starting from vague descriptions → hallucinated paths and invented APIs.
Before the agent types, ensure it knows where it is.
13. Principle 2: Planning and Execution Must Be Separated
All camps independently discover: Planning and executing in the same round yields unreliable output.
Planning — even if AI-driven — should not be done in the same step as execution. It must be a separate step, with output reviewed before starting work.
14. Principle 3: Feedback Loops Are Non-Negotiable
Same principle, different approaches:
| School | Feedback Source | | --- | --- | | OpenAI | Automated tests + CI | | Anthropic | Another LLM | | ThoughtWorks | Both combined |
They differ on "who" provides feedback. But agree on "whether" feedback is needed.
15. Principle 4: Do One Thing at a Time
Trying to do too many things at once:
Anthropic's routine: Read progress → pick ONE feature → implement → commit → repeat.
"Progressive incrementalism" is a common trait of all successful harnesses.
16. Principle 5: Codebase Is the Document
No one maintains a separate knowledge base for the agent. The repo is the single source of truth.
If a convention, constraint, or architecture decision is not in the codebase → the agent won't know it.
Practical Implication
17. Harness Decay Is Real
When Anthropic upgraded from Opus 4.5 to 4.6 — Sprint decomposition (once essential) became dead weight.
Model's planning ability improved, making that part redundant.
Components that were burdensome in March, by April became overhead.
Then Opus 4.7 launched — models started verifying their own outputs, reducing the role of the Evaluator agent again.
This Is Harness Decay
| Model Version | Harness State | | --- | --- | | Opus 4.5 | Sprint decomposition + evaluation per sprint | | Opus 4.6 | No decomposition + single evaluation (38% cost saving) | | Opus 4.7 | Model self-verification → evaluator role shrinks again |
18. Build to Delete
Philipp Schmid's advice: "Build to delete."
Design each harness component to be removable.
Regularly test each component — turn it off, see if output quality drops. No drop → delete.
| Teams | Rebuilds within 6 months | | --- | --- | | Manus | Rebuilt harness 5 times | | LangChain | Reorganized 3 times in a year | | Vercel | Cut 80% of tools → performance improved |
These are not signs of bad engineering. They are natural outcomes of "building on rapidly advancing models."
19. Cost Reality
Honest numbers from Anthropic's A/B tests:
| Setting | Cost | Time | Result | | --- | --- | --- | --- | | Single agent (no harness) | $9 | 20 min | UI changed, core broke | | Full harness (Opus 4.5) | $200 | 6 hours | Working software, polished UI, correct physics |
22× cost increase — for a real working product, not just a "screenshot demo."
Is it worth it? Depends on how costly broken releases are to your team.
But the unspoken part here
Harness + model combo is evolving.
A $200 harness, after upgrading the model version, costs only $124.
| Trend | Explanation | | --- | --- | | Better models = simpler harness = cheaper runs = faster outputs |
Summary
What Is Harness
5 Harness Artifacts
3 Schools
5 Universal Principles
Paradoxical Aspects
The winning engineers of 2026 are not those who write the best code. They are the ones designing the best constraints — and willing to discard them when they stop being profitable.