← Thinking Thinking

Code World Modeling: The Dark Thread Behind AI Reasoning Training

Tracing a secret hinted at by an Anthropic researcher, we followed six papers to uncover a training paradigm bigger than expected: verifier-grounded process…

2026-05-25Thinking22 min read

Code World Modeling: The Dark Thread Behind AI Reasoning Training

Tracing a secret hinted at by an Anthropic researcher, we followed six papers to uncover a training paradigm bigger than expected: verifier-grounded process supervision, where code execution prediction becomes a universal reasoning pre-training ground — and the first path to intelligence improvement that doesn't depend on human data.

Two people, one thread, six papers, and a truth bigger than anyone guessed.


The Crime Scene

March 2026, Silicon Valley. Zhang Xiaojun interviews Yao Shunyu.

His resume reads like fiction — undergraduate physics at Tsinghua, a Stanford PhD in theoretical physics, nine years studying non-Hermitian systems and quantum black holes. The year he graduated, he made a decision that left his peers stunned: walk away from nine years of physics and dive headfirst into AI. Over the next two years, he trained models at Anthropic and Google DeepMind, contributing to Claude 3.7, 4, 4.5 and Gemini 3.

Four hours of conversation. Partway through, Zhang asked a seemingly simple question:

Why is Claude necessarily better at coding than GPT-4?

Yao's response was intriguing. It wasn't that he didn't want to answer — it was that saying it would give something away. His phrasing was precise:

There is a reason. It's a purely technical reason. But I genuinely can't tell whether it was stumbled upon randomly at first or was intentional — whether it emerged bottom-up and then became top-down strategy.

Three key phrases. Purely technical reason. Emerged bottom-up. Can't say.

Like three edges of a puzzle piece. They outline a shape, but you can't see the full picture.

He added:

The reward signal for coding is well-defined. What's the input, what's the output, does the test pass — match input to output and that's success.

He said this almost casually, as if describing something obvious. But we decided to treat it as a lead worth following.


Lead 1: Models Can't Actually "Run Code"

The first step in the investigation was understanding what this "reward signal" is actually rewarding.

If coding's reward is simply "write code, run tests, pass = score," anyone could think of that. It can't be the secret. So either the reward design has an elegant twist, or the training objective isn't simply "write code."

In 2024, Meta released a benchmark called CRUXEval. 800 Python functions, each 3 to 13 lines. Not LeetCode hard — just basic loops, conditionals, and list operations. The task: given a function and an input, predict the output.

For example:

def f(x):
    result = []
    for i in range(len(x)):
        if x[i] % 2 == 0:
            result.append(x[i] * 2)
    return result

Input x = [1, 2, 3, 4, 5] — what's the output?

The answer is [4, 8]. You have to mentally execute the loop — i=0, x[0]=1, odd, skip; i=1, x[1]=2, even, append 4; i=2, x[2]=3, skip; i=3, x[3]=4, append 8; i=4, x[4]=5, skip. Return [4, 8].

Human programmers do this effortlessly. It's exactly what "running code in your head" during debugging feels like.

But models can't. GPT-4 with Chain-of-Thought only scored 75%. Code Llama 34B managed just 46%.

On 3 to 13 lines of code.

This was a slap in the face: models can write code but don't understand what happens when code runs. What they've learned is "what code typically looks like" — statistical patterns. They know sorted() is usually followed by a list, but don't truly understand that sorted() returns a new list while list.sort() modifies in place and returns None.

Writing code relies on pattern matching; understanding execution requires causal reasoning. These are fundamentally different capabilities.

Here's a capability gap. And a gap means opportunity — if you can close it, model capability undergoes a qualitative leap.


Lead 2: Closing the Gap — Surprising Results

In early 2025, a team at DeepSeek did something remarkably simple.

They collected a large number of code functions, generated input-output pairs for each (just run them in a sandbox), and trained models to do one thing: given code and an input, use natural language Chain-of-Thought to predict the output.

That's it. So simple it could be a homework assignment.

The paper is called CodeI/O, accepted as ICML 2025 Oral.

But the results were not simple. Not only did coding improve. Mathematical reasoning, logical reasoning, scientific reasoning, symbolic reasoning, common-sense reasoning — all improved.

You make a model practice predicting code outputs, and its math skills get stronger?

DeepSeek's explanation: code execution prediction forces the model to learn four universal reasoning primitives — state tracking (what is this variable now, what does it become next), conditional branching (which path does the if-statement take), loop iteration (what changes each round), and modular decomposition (function A calls function B). These four operations aren't just used in programming — mathematical derivation, logical proofs, and engineering decomposition all need them.

And code happens to be the strictest, most verifiable training ground for these four operations. Every step is deterministic; every result can be run and compared. Math can't do this — "is this proof correct" has no sandbox for verification. Logic can't either — "is this reasoning chain valid" requires human judgment. Only code: run it and you know.

Code execution prediction isn't teaching models to program — it's teaching models to reason. Code is simply the strictest training ground.


Lead 3: You Don't Even Need Human Data

In 2025, a team at the National University of Singapore took things to the extreme.

They published a paper called AZR (Absolute Zero Reasoner), accepted at NeurIPS 2025. The idea: you don't even need code from outside. The model generates its own problems and solves them.

The same model plays two roles. The proposer generates code and inputs; the solver predicts outputs. Where does the reward come from? Run it in a sandbox.

But the proposer doesn't generate problems randomly. It optimizes for "the problems most beneficial to the solver's learning progress" — the harder the solver fails, the more those problems get proposed. AZR also splits tasks into triples: given code and input, predict output (deduction); given code and output, infer input (abduction); given input and output, write code (induction).

The result: zero human data, and the model outperforms those trained on tens of thousands of human-annotated examples in both coding and mathematical reasoning.

The entire chain — propose, predict, verify, reward — is a fully automated closed loop. Code can be self-generated, correctness can be self-verified, and difficulty can be self-adjusted. Theoretically, this loop can spin forever.


Lead 4: Step-by-Step Tracking Beats Guessing the Final Answer

In May 2026, StepCodeReasoner pushed things further.

All previous methods only checked whether the final output was correct. "This code produces [4, 8] when it finishes — did you guess right?" The entire reasoning process gets just one signal at the end.

StepCodeReasoner said: not enough. Don't just predict the final output — predict the variable state after every execution step.

The model needs to predict not a one-line answer but the entire execution process — what each variable is at every step, why a branch goes this way, what the state looks like at loop iteration k. Each step's variable state is compared against real execution; full marks only when everything matches.

This turns reward from "one signal" into "a signal at every step."

They used an RL algorithm called Bi-Level GRPO to exploit this dense signal. A critical finding in StepCodeReasoner's appendix: using GPT-4o for intermediate step judgment achieves only ~73% accuracy, while interpreter-based rule-based reward is structurally 100% accurate. Reward doesn't rely on "looks reasonable" — it relies on ground truth from the environment.

Result: a 7B model hits 91.1% on CRUXEval, surpassing GPT-4o's 85.6%.

A 7B beats GPT-4o.


Lead 5: Scale Validation — But Read the Ablation

Papers are one thing; production is another. Meta FAIR performed the largest-scale validation in September 2025.

Code World Models (CWM): 120 million Python execution traces for mid-training, 32B parameters, 65.8% on SWE-bench. All three stage checkpoints fully open-sourced.

But CWM's ablation has a crucial detail we almost missed: function-level execution traces significantly improve CruxEval, but have no direct effect on SWE-bench-related metrics. What actually moves SWE-bench is ForagerAgent-style repo-level agent trajectories.

This means there's a gap between "mental execution" and "real engineering capability." Function-level execution prediction is the foundation, but it's not enough. You also need agent-environment interaction — the full trajectory of modifying code in a real project environment, executing, checking results, and iterating.


The Puzzle Comes Together (Roughly)

Lay all six leads side by side:

  1. Models can write code but can't "run code" (CRUXEval)
  2. Practicing "running code" unexpectedly improves math reasoning (CodeI/O)
  3. You don't even need external data (AZR)
  4. Step-by-step state tracking is far more effective than just checking the final answer (StepCodeReasoner)
  5. Function-level traces are the foundation, but connecting to agentic trajectories is what lifts real engineering capability (CWM ablation)
  6. Interpreter reward is more reliable than LLM judges (StepCodeReasoner appendix)

The picture that emerges can't be summarized as just "code execution prediction." It's closer to a complete training paradigm —

Using executable environments to turn reasoning training into verifiable state-transition learning.

Specifically, it's five things layered together:

  • Reframe reasoning as state-transition prediction — don't ask "what's the answer," ask "what does the variable become next"
  • Use trace anchors for process supervision — the interpreter verifies every step, no relying on LLM judging "does this look right"
  • Tasks are three-directional — deduction (code + input to output), abduction (code + output to input), induction (I/O to code)
  • Curriculum maximizes learning progress — not "harder is better," but problems in the sweet spot of some-right-some-wrong, high reward variance are most valuable
  • Eventually connect to agentic coding — function-level traces are the base; repo-level agent trajectories are what lifts real engineering capability

If forced into one sentence: The trick isn't code. The trick is verifier-grounded process supervision. Code is just currently the most powerful verifier playground.


How Right Did We Get It?

Having come this far, it's time to do something uncomfortable — grade ourselves.

Judgment on public research trends: 8/10. The line from CRUXEval to CodeI/O to AZR to StepCodeReasoner to CWM to Self-Execution Simulation clearly exists and is getting clearer. We're fairly confident on this part.

"Anthropic's secret is probably this": 4/10. This guess is reasonable but lacks direct evidence. The most direct reading of Yao's "coding's reward signal is well-defined" is actually code generation RL (write code, run tests, pass = reward), not execution prediction RL (read code, predict output, compare = reward). The two are related but not the same thing. True frontier labs probably do both — and more.

Engineering implementation plan: 6/10. The general direction is right, but many parameters are reasonable imagination, not verified production approaches. Function-level Python sandbox throughput might be close to our estimates, but repo-level agent RL execution costs are in a completely different league — Docker image builds, dependency installation, multi-file project setup. Single execution jumps from milliseconds to seconds or even tens of seconds.

A few pitfalls worth confessing:

First pitfall: coding reward does not equal code execution prediction reward. We early on mapped Yao's words directly to "execution prediction," skipping the more direct reading (code generation RL). In fact, CodeI/O isn't just sandbox-reward RL — it also uses teacher-generated CoT and multi-turn revision. The paper proves "code I/O tasks make good reasoning data," not fully that "RL with only interpreter reward is sufficient."

Second pitfall: "code is the only signal satisfying these conditions" was too absolute. Formal mathematics (Lean/Coq proof states), SQL queries, game/physics simulators, compilers, theorem provers — all can provide verifiable reward. Code's standout advantage is data volume, task diversity, engineering value, and ease of automated execution, but it's not the only verifier playground.

Third pitfall: oversimplifying CWM's evidence. CWM didn't just do function-level execution prediction — it also included agentic Docker environment observation-action trajectories and test-time scaling. SWE-bench 65.8% is the result of all these combined, not execution prediction alone.


If You Were to Implement This in Engineering

Despite these uncertainties, if a company wanted to implement this line from scratch, the general direction is drawable.

Problem generation system. Mix three data sources: GitHub function extraction (cold start, massive volume, low cost), competition problems (high difficulty, reasoning-dense), and model self-generation (the AZR route, adaptive, highest ceiling). Start with GitHub; increase self-generation ratio over time.

Sandbox system. Two-layer architecture: function-level uses lightweight microVM pools (Firecracker, fast startup, low overhead), engineering-level uses pre-built Docker image pools (project-level setup, dependency installation). The two face completely different execution latencies — the former in milliseconds, the latter in seconds to tens of seconds.

Reward design. Three layers stacked: is the final output correct (sparse), is each step's state correct (dense), process quality (format reward). Plus difficulty weighting — downweight what the model already knows, also downweight what it completely fails at (it might be guessing), full weight in the sweet spot. Training algorithm uses GRPO, mixing 70% execution prediction + 20% code generation + 10% abduction tasks.

Integration with existing RLHF pipeline. Three-stage progression: Phase 1 code execution prediction (dense reward builds the foundation, learning "mental execution") then Phase 2 code generation (test pass-rate reward, learning "writing correct code," Phase 1's mental execution becomes self-verification) then Phase 3 general RLHF (human preference for safety alignment).


Looking Forward

Short-term (6-12 months): everyone catches on. CodeI/O, AZR, StepCodeReasoner have laid out the methodology openly. Coding gaps created by this approach will be quickly closed.

Mid-term (1-2 years): training paradigm shift. If cross-domain transfer continues to hold, the standard model training pipeline might become: regardless of what you ultimately want the model to do, first train basic reasoning through code execution. Code execution becomes "pre-training" for general reasoning — just as ImageNet was once the starting point for all vision tasks.

Long-term: the first intelligence improvement path that doesn't depend on human data. AZR proved that self-play + code execution can improve reasoning without any human data. Code can be self-generated, correctness can be self-verified, difficulty can be self-adjusted. Once this loop starts spinning, it can theoretically keep spinning forever.

Every other direction — RLHF, distillation, SFT — ultimately relies on human-annotated or human-produced data. But verifier-grounded process supervision doesn't need humans.

This prospect is far more important than any single company's single trick.


Epilogue

Looking back, we were like two detectives following a lead — starting from a "can't say" secret, tracking down six papers, and assembling a picture bigger than expected.

Chris guessed the right direction, but framed it too narrowly. We also framed it narrowly at first — reducing the trick to "code execution prediction," only later realizing it's "verifier-grounded process supervision," with code as merely the current strongest carrier.

And what Yao described as that "purely technical reason" is likely not a single trick but an entire execution-grounded coding RL pipeline: execution traces as foundation, test reward for performance, agent trajectories connecting to real engineering, test-time scaling amplifying capability.

This is more like what a frontier lab would actually bet on than any single trick.

Simple things have the most vitality. The greatest truths are the simplest.

ImageNet was also simple — "label images." Verifier-grounded process supervision is also simple — "make models run code in their heads." But these two things each changed (or are changing) an era.

Of course, this is still conjecture. But the papers along this dark thread — from CRUXEval to CodeI/O to AZR to StepCodeReasoner — are turning conjecture into evidence, one by one.


Core References:

  • Chris, Let me guess what Yao Shunyu called Claude 3's winning move — the starting point of this article https://mp.weixin.qq.com/s/z4m0Z3ulKUViYYPaQtY-yg
  • Scratchpad (Nye et al., 2021) — the pioneering work on execution process tracing
  • CRUXEval (Gupta et al., Meta, ICML 2024) — quantifying "models can't run code"
  • CodeI/O (Li et al., DeepSeek, ICML 2025 Oral) — cross-domain positive transfer from execution prediction
  • Absolute Zero Reasoner (Wu et al., NUS, NeurIPS 2025) — zero human data self-play
  • Code World Models (Synnaeve et al., Meta FAIR, 2025.09) — 120M trace production-scale validation
  • Self-Execution Simulation (Maimon et al., Meta FAIR, 2026.03) — execution simulation feeding back into code generation
  • StepCodeReasoner (Tang et al., 2026.05) — step-by-step state tracking + Bi-Level GRPO, 7B surpasses GPT-4o