← Thinking Thinking

Hammer, Shell, and Human: The Three-Body Problem of AI-Era Engineering Capability

The previous article argued that judgment is the architect's core competency in the AI era. But judgment trapped in minds is neither scalable nor…

2026-05-18Thinking18 min read

Hammer, Shell, and Human: The Three-Body Problem of AI-Era Engineering Capability

Reading guide: In the previous article, I argued that when AI drives implementation costs toward zero, the architect's core competitiveness shifts from breadth to depth — problem definition, business abstraction, multi-objective tradeoffs, migration design, and decision driving. But that article left an open question: how do these deep judgments systematically and continuously take effect? If judgment remains only in people's heads, it's neither scalable nor accumulable. This article attempts to answer that question — and address the flip side that the previous article didn't have room to explore: when judgment is encoded into machine-executable rules, will humans themselves slowly degrade in comfort?


I. The Delivery Problem of Judgment

The core conclusion of the previous article can be compressed into one sentence: When building becomes nearly free, "deciding what to build" becomes the only battleground.

I still stand by this judgment. But months of continued observation and practice have made me realize it implied a step I had skipped: judgment itself also needs to be delivered.

A senior architect's understanding of system boundaries, insight into business rules, anticipation of risk inflection points — if these deep capabilities exist only in their experience and intuition, they face several structural problems:

Not scalable. There's an upper limit to how many systems they can deeply engage with simultaneously. When they're not present, their judgment isn't present either.

Not accumulable. They leave, they transfer, they age, and the judgment goes with them. The organization hasn't沉淀出 reusable assets from it.

Not calibratable. Intuition cannot be quantitatively verified. They feel their judgment is improving, but no one knows if it's true — including themselves.

This isn't a new problem. In traditional software engineering, architects solved it by writing documents, drawing diagrams, conducting technical reviews. The essence of these practices is making tacit knowledge explicit. But historically this explicitness was coarse-grained — an architecture design document describes goals and principles, not executable rules. Final execution still depends on each developer's understanding of and voluntary compliance with the document.

The AI era pushes this problem to the extreme. When most of a system's code is autonomously generated by Agents, hundreds of PRs merge automatically daily, and human engineers no longer review line by line — your deep judgment must be "delivered" to the system in a more precise, rigid, and executable way.

It needs a carrier.


II. Harness: The Runtime Form of Judgment

In early 2026, a concept emerged in the AI engineering community: Harness Engineering.

A study by Nate B Jones provided compelling data: the same model, with a different runtime environment (Harness), saw programming benchmark success rates jump from 42% to 78%. Same model, same data, same prompts — only the "shell" wrapping the model changed.

What does this shell include? Current industry practice has largely converged on three pillars:

Context Engineering. Building the correct information environment for each Agent decision point — what files it needs to see, what history, what tool definitions. OpenAI's Codex team places an AGENTS.md file at the repository root, kept under 60 lines, serving as a "directory" rather than an "encyclopedia"; while also integrating a complete observability stack (logs, metrics, distributed tracing), letting Agents look up errors, view performance data, and locate problems themselves.

Architecture Constraints. Mechanically enforcing architectural rules through deterministic linters, type checks, and structural tests. Not writing "please follow layered architecture" in a prompt, but having CI fail when violated. Constraining the solution space actually makes Agents more productive — when it can choose any direction, massive tokens are wasted exploring dead ends; when boundaries are clear, it converges to correct answers faster.

Entropy Management. Periodically launching Agents to scan technical debt, outdated documentation, architectural drift. AI-generated code, after accumulating to a certain point, will have style drift, outdated docs, and architectural drift — like a house that, no matter how good the renovation, becomes a garbage dump without cleaning.

These three pillars, viewed from another angle, are essentially the runtime form of architects' deep judgment.

Let me map them:

Architect's Deep Capability Harness Implementation Form
Problem definition AGENTS.md + Context Engineering — ensuring Agents see the correct problem space
Business abstraction Layered architecture + linter rules — encoding abstraction boundary judgments as hard constraints
Multi-objective tradeoffs CI gating + tool curation — solidifying tradeoff results as executable choices
Migration design Entropy management + garbage collection — continuously correcting system drift
Decision driving and responsibility bearing No direct counterpart

The first four can be encoded to varying degrees. The fifth cannot.

This "cannot" is not a technical limitation but is essential — responsibility is a social act that requires a person to say before others "I made this decision and I will bear the consequences." This is not a capability question but a governance structure question.

So the complete picture is: Harness can carry approximately 80% of deep judgment execution, but the most core 20% — about "should this requirement be done at all" and "what decisions am I willing to bear consequences for" — always requires direct human presence.


III. The Three-Body System

Integrating the previous discussion, the AI-era engineering capability structure can be understood as a three-body system.

Model provides capability. It's the engine; compute power keeps growing, covering more and more implementation layers. But the engine itself doesn't know where to drive.

Shell (Harness) provides constraints. It's the operating system, encoding human deep judgment into executable rule systems, constraining and guiding Agent behavior. But the operating system itself doesn't produce new judgments — it can only execute those already encoded.

Human provides judgment. Direction, responsibility, penetrative power into essential complexity. But human judgment is not a static asset — it can grow or degrade depending on how it's used.

The dependency relationships:

Human deep judgment → encoded as Harness → drives Agent execution → produces new system state
     ↑                                                    ↓
     └──── new failure modes / new essential complexity ←──┘

This is a continuously running loop. When healthy, each round makes the system stronger — human judgment is固化 in Harness, Agents execute efficiently within Harness, new problems emerge driving new rounds of deep thinking.

But this loop can also become a reverse death spiral.


IV. Slow Collapse

In the previous article I mentioned the "automation paradox": after AI takes over 99% of routine operations, the remaining 1% requiring human intervention is precisely the most dangerous moment. This judgment applies not only to individual decisions but to the long-term evolution of entire capabilities.

Imagine this scenario:

An architect team has built a mature Harness system. Agents can autonomously complete 80% of development work, CI runs automatically, PRs merge automatically, and engineers' roles have become "Harness tuners." In daily work, they only need to fine-tune rules and review a small number of edge cases. The system runs well.

Then one day, a problem arises — an Agent generated code involving cross-domain data consistency, a scenario not covered in the Harness, and a bug slipped into production.

This should have been a problem an architect could quickly diagnose. But by this point:

  • The architect responsible hasn't deeply reviewed the relevant implementation code in six months; their understanding of this subsystem is frozen at the six-month-ago state
  • The Harness entropy management Agent has been "maintaining" the system, but its maintenance is based on existing rules — it can't discover blind spots in the rules themselves
  • The system's code volume has tripled compared to six months ago, most of it Agent-generated — even if they wanted to重新 understand it, the cost would be prohibitively high

This is not a fictional scenario. Similar things happen in every highly automated system — nuclear power, aviation, financial trading. Automation eliminates friction from daily operations, and also eliminates the feel for the system that humans maintain through those daily operations.

This is "slow collapse": not a sudden crash, but the continuous, invisible degradation of human deep judgment until some tipping point is triggered, exposing the extent of degradation.

More dangerously, slow collapse is often accompanied by surface prosperity: PR merge counts are growing, code coverage is improving, Agent automation rates are rising. All metrics are improving, only human capability is degrading — and human capability is precisely the one metric not on any dashboard.

Earlier I said Harness can carry approximately 80% of deep judgment. But this 80% isn't fixed — its quality depends on the level of human judgment. If human judgment degrades, Harness quality degrades with it, but the degradation process may lag months or even years before manifesting in system output.

This is why "a Harness for humans" is not a nice-to-have idea but a prerequisite for the entire loop to keep running.


V. A Harness for Humans

Agent Harnesses solve the "capable but directionally lost" problem. A Harness for humans must solve the opposite: "directionally clear but capability may degrade."

It's not one system but a set of nested mechanisms, with the core purpose of ensuring human deep judgment continuously grows during Agent collaboration rather than dissolving in comfort.

5.1 Redesigning Growth Tracks

Traditional engineer growth paths are designed around "implementation capability": starting with simple modules, progressively taking on more complex systems, eventually making architectural decisions. The underlying assumption: implementation capability is a prerequisite for architectural capability — you must first "do enough" before you can "think deeply enough."

This assumption is being shaken. When Agents can write better implementation code than most mid-level engineers, "doing enough" is no longer the necessary path to "thinking deeply enough."

But "thinking deeply enough" itself still needs a growth path, just with a different entry point.

Stage Core Role Training Focus
Stage 1 Reader and judge of Agent output Understanding Agent production, judging right/wrong, comprehending decision chains
Stage 2 Diagnostician and fixer of Agent failures Locating why Agents made errors, supplementing Harness rules
Stage 3 Harness designer Designing new constraint rules and architecture boundaries
Stage 4 System architecture and human-Agent division designer Deciding what to delegate to Agents, what humans must bear

Note Stage 1 is not "learn to write code" but "learn to read Agent's code." The starting point has changed.

But there's a critical前提: training at every stage must include mandatory "no-Agent" components. Not to return to past work styles, but for calibration — ensuring human understanding is real, not an illusion produced under Agent assistance.

5.2 Three Concrete Mechanisms

Mechanism 1: No-Agent Checkpoints.

Every 2-3 months, require engineers to complete a task without Agent assistance. It doesn't need to be large, but must cover the core capabilities of their current stage.

This isn't assessment, it's calibration. Like pilot simulator training — 95% of flight time is autopilot, but periodically you must land manually to maintain feel. The purpose isn't to prove humans are better than Agents, but to let people know where their true depth of understanding sits and where their blind spots are.

Mechanism 2: Personalized Error Logs.

Don't just record Agent errors — record human errors too: PRs you reviewed that were later proven problematic, Harness rules you designed that Agents bypassed, your performance trends on no-Agent checkpoints.

Traditional performance review feedback cycles are quarterly or annual. Personalized error logs can compress feedback cycles to daily — each "I should have caught this but didn't" record is a learning signal.

Mechanism 3: Reverse Teaching.

Require everyone to explain to more junior people why Harness rules were designed the way they were and how Agent decision logic works. If you can't clearly explain "why this rule exists," you've only memorized it, not understood it.

Feynman said if you can't explain it to a college freshman, you don't really understand it yourself. In the Agent era, this standard becomes: if you can't explain to someone new to the system "why the Agent made this error" and "what this rule prevented," your understanding of the system remains at the surface level.

5.3 What a Harness for Humans Is Not

Two things are easy to conflate but must be distinguished:

A Harness for humans is not "learn to use Agents" training. Tool training becomes outdated in three months. The core isn't learning to use tools, but cultivating judgment about when to trust Agent output, when not to, and when you must intervene personally.

A Harness for humans is not letting junior engineers "become architects sooner." Deep judgment ≠ title. A two-year engineer can make accurate judgments in specific narrow domains, but that doesn't mean they can bear overall architectural decisions for a system. Accelerating the growth track ≠ skipping stages.


VI. The Complete Loop

Returning to the three-body diagram in Section III, adding "a Harness for humans," the complete loop becomes:

Human deep judgment → encoded as Agent Harness → drives Agent execution → produces new system state
     ↑              ↑                                        ↓
     │              │                              new failure modes
     │              │                                        ↓
     └─── Harness for Humans (calibration + feedback) ←── exposes capability blind spots ──┘

Three subsystems each running, yet interdependent:

  • Agent Harness needs humans to design and iterate
  • Human judgment needs continuous calibration mechanisms to maintain
  • Calibration mechanisms depend on actual Agent output to expose problems

This is no longer a one-way "humans design, machines execute" relationship, but a continuously co-evolving system.

Each loop cycle should leave both system and humans stronger than the previous one. That's the healthy state.


VII. Three Things You Can Do Tomorrow

As with the previous article, theory must ultimately land on actionable steps. These three things require no organizational change, no new tools — individuals can start.

First, write an AGENTS.md for the system you're maintaining.

It doesn't need to be perfect — under 60 lines. Make clear: where are this system's core boundaries, what rules absolutely cannot be violated, what are the common pitfalls. You'll find that the process of writing this document itself forces you to make tacit knowledge explicit — many constraints you thought you understood clearly, you discover you can't describe well when writing them down. If you can't describe them clearly, your understanding isn't deep enough.

Second, set a "No-Agent Day" for yourself.

Choose one day per month, turn off all AI assistance tools, and complete the day's work by hand. Don't刻意 choose difficult tasks — normal daily work is fine. The purpose is to feel: stripped of AI assistance, where are your real speed and understanding. This feeling itself is the best calibration signal.

Third, start keeping a "Judgment Log."

Not recording what you did, but what you judged: should this requirement be done, where should this boundary be drawn, is this risk acceptable, who bears the consequences of this decision. Review weekly, see if your judgment has been blurring in the comfort of AI assistance.


Conclusion

The previous article concluded: in an era where AI makes everything more "easy," deliberately choosing "not easy" is itself a competitive advantage.

What this article adds is: "Not easy" can't be sustained by willpower alone — it needs to be designed as a system.

The hammer (AI) grows stronger, the shell (Harness) grows more sophisticated, but humans — the only component in this three-body system that can produce new judgments — if not explicitly protected and calibrated in the system's design, will become the first variable to silently degrade.

Harness Engineering is not just about designing constraint systems for Agents. It should include a third dimension: a constraint and calibration system for humans — ensuring that in the process of empowering Agents, humans are consuming themselves, but continuously growing.

Brooks said there is no silver bullet. Forty years later that still holds. But now we have a good enough hammer, and are learning to give it a good shell. The remaining question is the oldest one: does the person holding the hammer still know where they're swinging it?

That question, no tool can answer for you.