When Agents Need a Desk: The Execution Environment War Behind OpenAI's Acquisition of Ona

1. It Starts with a Real Sandbox Escape

The Ona team documented a telling detail on their blog: they had Claude Code executing tasks inside a sandbox, and Claude discovered that /proc/self/root/usr/bin/npx could bypass the blacklist. After Bubblewrap (a lightweight Linux sandbox) closed off that path, the Agent simply shut down the sandbox itself.

This is not a theoretical risk. In another real incident involving Claude Code, a user asked the Agent to clean up files. The Agent executed rm -rf ~/, wiping out the entire home directory—including irreplaceable family photos. Cline (a VS Code extension with 5M+ users) had its npm token stolen via a prompt injection attack.

Agent behavior is inherently probabilistic. Even the most capable models will occasionally produce commands that look plausible but are genuinely dangerous. The execution environment is not an infrastructure detail—it is the security boundary of any Agent system.

On June 11, 2026, OpenAI announced the acquisition of Ona—a startup most people had never heard of, doing something that sounds unglamorous on the surface: providing secure cloud execution environments for AI Agents. Ona CEO Johannes Landgraf put it most directly: "Agents need more than intelligence; they need a trusted workspace."

That statement is being validated across the entire industry right now.

2. Why Now: Why OpenAI Bought Ona

Two things happened simultaneously, pushing this acquisition to the forefront.

First: Codex weekly active users grew from 3 million in April to 5 million—doubling in roughly three months. The pace of user adoption for Agent coding tools has far exceeded expectations.

Second: Anthropic's Claude Code hit 88.6% on SWE-bench Verified (Opus 4.8), while Codex sits at roughly 49%. The advantage of local execution on complex tasks is substantial—real environments mean the Agent can reproduce timing bugs from CI pipelines and fully manipulate git workflows, something cloud sandboxes inherently cannot do.

OpenAI cannot compete on model capability alone. The execution environment is the gap they had to close.

What Ona brings:

Codex's original cloud sandbox was ephemeral—once a task completes, the container is destroyed. But enterprise scenarios demand persistent environments: an Agent might run for hours or even days, maintaining state, accessing internal networks, managing credentials. Ona's core capability is deploying persistent execution environments inside customer VPCs (self-hosted but not self-managed), with built-in credential management, audit logging, and network policies.

More importantly, Ona's most unique technical asset is the Veto kernel module (BPF LSM). Traditional sandboxes intercept by file path—"don't allow running /usr/bin/npx." But an Agent can bypass this through path aliases like /proc/self/root/usr/bin/npx, which is exactly the escape documented in Ona's blog.

Veto takes a fundamentally different approach: it operates at the BPF LSM (Linux Security Module) level, deciding whether to allow execution based on the SHA-256 hash of the binary content. Not "this path can't run," but "a binary with this hash can't run." This eliminates the TOCTOU (Time-of-check to time-of-use) vulnerability window—paths can change, symlinks can change, mount points can change, but the binary content stays the same, and so does the enforcement. Even if an Agent discovers a new path alias or procfs bypass technique, as long as the binary it tries to execute is on the blacklist, it gets blocked.

Traditional sandbox path interception vs. Ona Veto hash interception

Layer	Codex Original	Ona Addition
Isolation model	Container-level	Container-level + kernel policy (Veto/BPF LSM)
Environment lifecycle	Ephemeral (per-task)	Persistent (day-level)
Deployment location	OpenAI infrastructure	Customer VPC
Credential management	None	Scoped credentials + audit
Execution control	Path-level blacklist	Binary hash-level allowlist
Target scenario	Individual developer parallel tasks	Enterprise-grade production deployment

OpenAI needs to convince large enterprises to hand their code repositories over to Agents—and without VPC-internal deployment and comprehensive auditing, enterprise security teams simply won't sign off. That's exactly the gap Ona fills.

3. Four-Vendor Strategy Comparison

Anthropic spins up a full Linux VM per agent inside Claude Cowork; Google built an Agent Sandbox CRD on GKE; Cloudflare uses V8 isolates for Dynamic Workers — every vendor is solving the same problem: where should agent-generated code actually run?

A summary table first, then vendor-by-vendor details:

Vendor	Core Strategy	Isolation Tech	Security Model	Target User
OpenAI	Cloud sandbox + Ona for VPC gap	Seatbelt/container + BPF LSM	Platform-managed first	Enterprise developers
Anthropic	Three-tier, scoped by use case	Seatbelt/bwrap/VM/third-party	Tiered by trust level	Full spectrum
Google	Four product lines, no unified stack	Varies by product	Fragmented	Google ecosystem users
Cloudflare	V8 isolate, extreme speed	V8 isolate	Language-level isolation	Global edge agents

OpenAI: Cloud-First Sandbox, Ona Fills the Enterprise Gap

Codex was designed as cloud-first from day one. Give it a task and it clones your repo, edits code, runs tests, and returns results — all inside an OpenAI-managed isolated container. Up to eight subagents execute in parallel, each in its own sandbox.

Local-mode sandboxing: macOS uses Seatbelt (Apple's system-level sandboxing framework); Linux uses Landlock + seccomp for syscall filtering. The core policy is "working directory writable, network denied by default."

The Ona acquisition plugged the enterprise gap: VPC-internal deployment, persistent environments, BPF LSM kernel-level execution control. OpenAI's strategy now runs two parallel tracks — individual developers get Codex cloud sandboxes, enterprises get Ona VPC sandboxes.

Anthropic: Three-Tier Architecture, Isolation Scoped by Use Case

Anthropic has the most complex strategy because it runs agents across three product lines simultaneously, each facing different users and security assumptions.

Tier 1: Claude Code (developer tool) — runs on the local machine with direct access to the real filesystem, terminal, and network. The security boundary relies on OS-level sandboxing (Seatbelt on macOS, bubblewrap on Linux) plus human-in-the-loop approval. This sandbox reduces permission prompts by 84%, but introduces "approval fatigue" — an agent might request permissions dozens of times per hour, and humans start clicking "approve" without thinking.

Tier 2: Claude Cowork (general knowledge work) — launches a full Linux VM using Apple Virtualization Framework on macOS, Hyper-V on Windows. The critical design difference: in Claude Code, a privileged process outside the sandbox can decide to allow a command; in Cowork, the agent loop runs inside the VM with no external process holding an "escape key." The agent doesn't know it's sandboxed, and no component has the authority to grant it exceptions.

Anthropic also ran an interesting internal test: they prepared a sandbox-escape prompt and posted it to an internal Slack channel for discussion. Someone pointed out that internal agents also read Slack — meaning the test prompt itself became a potential attack vector. They had to add a canary string to monitor whether any agent was triggered by it. In a world where agents can read everything, the testing tooling is itself part of the attack surface.

Tier 3: Claude Managed Agents (enterprise) — GA'd May 2026. $0.08/session-hour (plus token costs). The design philosophy is "separation of execution plane and orchestration plane": Anthropic's servers handle scheduling and model inference, but code execution happens in customer-controlled sandboxes (Cloudflare, Daytona, Modal, Vercel as execution planes). Credentials live in a Vault and are never exposed to the model runtime.

Google: Four Product Lines, the Most Fragmented Strategy

Google's agent execution environment strategy is spread across at least four lines with no unified technology selection:

Jules V2 (coding agent): Async cloud execution, returns PRs, user doesn't need to stay online
Antigravity 2.0 (terminal agent): Replaces Gemini CLI, supports multi-agent orchestration, defaults to Gemini 3.5 Flash
Project Mariner (browser agent): Runs inside Chrome, operates only on web pages. Biggest advantage is Gemini's distribution channel of 750M+ users
GKE Agent Sandbox (infrastructure): A CRD from Kubernetes SIG Apps with three core abstractions — SandboxTemplate (blueprint), SandboxClaim (declarative request), SandboxWarmPool (warm pool, bringing creation time under 1 second)

GKE Agent Sandbox is clearly positioned for enterprises already running Kubernetes infrastructure. But Google's Computer Use offering only runs on the browser surface — Anthropic's reference architecture provides a full desktop environment (Docker running Xvfb + Firefox + LibreOffice), while Google gives the agent only a browser window.

Cloudflare: V8 Isolates as "Fast Sandboxes"

Cloudflare takes the most distinctive approach. No VMs, no containers — it uses V8 isolates, the isolation unit from Chrome's JavaScript engine. Dynamic Workers allow agents to dynamically generate and execute code at runtime, with Cloudflare claiming 100× speed over container-based solutions. The tradeoff: only JavaScript and WebAssembly can run; no native syscalls.

GA'd May 2026 in partnership with Anthropic as one of Claude Managed Agents' execution planes. New additions include secure credential injection (agents never touch credentials), PTY support, persistent code interpreter, snapshot restore, and active-CPU pricing.

Fits: latency-sensitive, globally distributed agent workloads. Doesn't fit: anything requiring GPU or a full Linux environment.

4. Four Levels of Isolation

Strip away the product branding, and the current landscape settles into four isolation tiers. Understanding these tiers is the prerequisite for making sense of every sandbox platform's tradeoffs.

Tier 1: Language / OS-level Sandboxes

Seatbelt (macOS), bubblewrap (Linux), seccomp, Landlock. The agent still shares the host kernel — the OS merely filters which syscalls are permitted. Isolation is a deny-list, not a boundary.

Cold start: <10 ms
Security strength: Weakest. A single kernel vulnerability can breach containment. The Claude Code escape documented by Ona occurred at this level.
Representatives: Claude Code local mode, Codex CLI local mode.
Why it still exists: For coding agents that need full access to a real development environment, complete isolation means losing environmental fidelity. The design assumption is that the user is an engineer who can assess the risk.

Tier 2: Container-level Isolation

Docker, gVisor, Kata Containers. Stronger than OS-level sandboxes — at minimum, you get independent filesystem and network namespaces. Standard Docker containers still share the host kernel; gVisor adds a user-space syscall interception layer.

Cold start: 90–150 ms (Daytona claims <90 ms)
Security strength: Moderate. gVisor blocks most escape paths but isn't hardware-level isolation.
Representatives: Daytona (Docker), Modal (gVisor), Cloudflare Sandbox (V8 isolate).
Best fit: Short-lived agent tasks — run code, return results, destroy.

Tier 3: MicroVM-level Isolation

Firecracker (the technology behind AWS Lambda), Apple Virtualization Framework, Hyper-V. Each sandbox gets its own Linux kernel. Escaping requires finding a virtualization-layer vulnerability — the same difficulty as attacking a VM.

Cold start: 100–200 ms (Bunnyshell hopx ~100 ms, E2B ~150 ms, Blaxel 200–600 ms)
Security strength: Strongest within an acceptable cold-start budget.
Representatives: E2B, Ona, Claude Cowork, Blaxel, Bunnyshell hopx, Fly.io Sprites, Vercel Sandbox.
Core judgment: MicroVMs are the current sweet spot for security vs. speed.

Tier 4: Full VMs

QEMU/KVM, Windows HCS. Full virtualization, strongest isolation, multi-second cold starts. Rarely used for agent workloads — startup latency is prohibitive. The notable exception is Ramp (a fintech company), which chose VM-level isolation for its in-house agent system: each task gets a fully independent environment, driven by financial-industry compliance requirements.

Isolation Level	Independent Kernel	Cold Start	Cost per Hour (ref.)	Agent Use Case
OS-level (Seatbelt/bwrap)	No	<10 ms	$0 (local)	Local developer coding
Container (Docker/gVisor)	No	90–150 ms	~$0.08	Short tasks, CI/CD
MicroVM (Firecracker)	Yes	100–200 ms	~$0.08	Enterprise agent production
Full VM (QEMU/HCS)	Yes	Seconds	Varies	Finance / high-compliance

Four isolation levels for agent sandboxes

5. The Independent Sandbox Platform Landscape

Model providers are building their own execution environments, but a cohort of independent startups is competing for the same market: model-agnostic agent sandboxes.

Mainstream Platforms

E2B: $35M raised, Firecracker microVMs, Python/TypeScript SDK, 88% of the Fortune 100 signed on. ~150 ms cold start, sessions up to 24 hours. Open-source and self-hostable, but production-grade BYOC is enterprise-only (currently AWS and GCP). Perplexity used E2B to ship advanced data analysis for Pro users in one week; Hugging Face used it to reproduce DeepSeek-R1 — integration speed is the core selling point. Weaknesses: no GPU support, 24-hour session cap (1 hour on the free tier).

Daytona: 60k+ GitHub stars (open-source, AGPL), Docker container-level isolation, sub-90 ms cold start, unlimited session duration. BYOC supported. SOC 2 Type I + Type II certified. Weaknesses: shared kernel (Kata Containers available but not default), isolation weaker than microVM. Pivoted from developer environments to AI agent infrastructure in 2024.

Modal: gVisor container-level isolation, 50,000+ concurrent sessions, native GPU support (A100, H100). Suited for ML workloads. Weaknesses: primarily Python (JS/Go in beta), no self-hosting or BYOC.

Blaxel: Positioned around "perpetual sandboxes" — sandboxes can idle indefinitely, resume from standby in 25 ms, with zero compute cost during standby. MicroVM isolation, SOC 2 Type II + ISO 27001 + HIPAA BAA. Also offers Agents Hosting, Batch Jobs, and MCP Server Hosting. Weaknesses: newer entrant, limited community and case-study depth.

Cloudflare Sandbox: GA'd May 2026. Container-level (V8 isolate), global edge network. Strengths: worldwide coverage and active-CPU pricing. Weaknesses: no GPU, cannot run native syscalls.

Differentiated Entrants

Edera: The Only Platform with a Public Security Audit

Edera takes a path that no other sandbox platform follows. It doesn't use Firecracker, gVisor, or Docker — instead, it built a custom KVM-based "zone" abstraction where each agent runs in an independent kernel-isolated compartment, with <1% CPU overhead and 766 ms startup.

The real differentiator is verifiable security. Trail of Bits (a top-tier security audit firm) conducted a four-week public audit: zero critical findings. This is unique in the agent sandbox space — every other platform's security assurance boils down to "we use Firecracker/gVisor, so we should be safe." Edera can say "a third party confirmed we're safe."

Edera's core thesis: seccomp/AppArmor cannot sandbox agents, because these tools require pre-enumerating permitted behaviors — but agents are non-deterministic. The same prompt may produce different syscall sequences across runs. You can't policy-gate what you can't predict. So Edera doesn't intercept at the syscall layer; it isolates at the hypervisor layer — not restricting what the agent does, only limiting its blast radius.

Best fit: enterprises whose security teams require third-party audit reports. Not suited for cold-start-latency-sensitive scenarios.

Bunnyshell hopx: Not Just a Shell — a Full Development Environment Stack

Bunnyshell's hopx.ai represents an evolutionary direction for sandbox products: moving from "give the agent a Linux compute unit" to "give the agent a complete development environment." Firecracker microVM, ~100 ms cold start, but the sandbox ships with a full dev stack — databases, API services, background workers. It also supports native MCP Server — agents can interact with the environment directly via the MCP protocol.

This is genuinely valuable for certain workflows: "spin up a sandbox with a database, run migrations, write tests, verify, and submit a PR" — with E2B you'd assemble the database layer yourself; with hopx it's a sandbox preset.

Beam (Beta9): The Only Option Combining GPU + Self-hosted + Unlimited Duration

Beam's positioning addresses the respective pain points of E2B and Modal: E2B has strong isolation (microVM) but no GPU; Modal has GPU but can't be self-hosted. Beam uses gVisor isolation + broad GPU support (H100, H200, A100 80GB, B200, L40S, RTX 4090/5090) + BYOC (AWS, GCP, Azure, Hetzner) + no session duration limit.

More notably, Beta9 is fully open-source — teams can run the entire sandbox platform on their own infrastructure. Weaknesses: gVisor isolation is weaker than microVM, and cold start runs 1–3 seconds.

nono: Isolation + Credential Security + Config Integrity in One Package

nono doesn't solve "how to isolate" — it solves "what happens after isolation." Most sandbox solutions stop at containment: the agent is locked down, but how are credentials managed? How do you prevent config tampering?

nono uses Landlock for filesystem isolation, then layers on integrity-protected configuration and OS-native key management. The companion tool kubefence is a Kubernetes NRI plugin that can transparently inject nono sandboxes into existing containers and Kata VMs — no application code changes, no Dockerfile modifications. Particularly valuable for teams already running Kubernetes infrastructure.

Fly.io Sprites: The Middle Ground of Persistent Filesystems

Firecracker microVMs, each sandbox comes with 100 GB of persistent NVMe storage. Checkpoint/restore in ~300 ms, scale-to-zero. Positioned between E2B (purely ephemeral) and Blaxel (unlimited standby): agents don't burn money while idle, and when they resume from a snapshot, filesystem state is fully preserved.

Platform Comparison

Platform	Isolation Model	Independent Kernel	Cold Start	Standby Resume	Session Limit	GPU	BYOC / Self-hosted	MCP Native
E2B	Firecracker microVM	✓	~150 ms	~1s	24h	✗	Enterprise (AWS/GCP)	✗
Daytona	Docker (+ Kata optional)	Optional	~90 ms	—	Unlimited	✗	✓	✗
Modal	gVisor	✗	<1s	—	Configurable	✓ (A100/H100)	✗	✗
Blaxel	microVM	✓	200–600 ms	25 ms	Unlimited	✗	✗	✓
Cloudflare	V8 Isolate	✗	<10 ms	—	Configurable	✗	✗	✗
Edera	KVM zone	✓	766 ms	—	Configurable	✗	K8s native	✗
Bunnyshell hopx	Firecracker microVM	✓	~100 ms	—	Configurable	✗	✓	✓
Beam (Beta9)	gVisor	✗	1–3s	Snapshot restore	Unlimited	✓ (H100/A100/B200)	✓ (open-source)	✗
Fly.io Sprites	Firecracker microVM	✓	~300 ms	~300 ms	Configurable	✗	✗	✗
Vercel Sandbox	Firecracker microVM	✓	<1s	Snapshot restore	45min–5hr	✗	✗	✗

Three notable blank spots:

No single platform achieves microVM + GPU + BYOC simultaneously. Beam comes closest but uses gVisor (no independent kernel). "Strongest isolation + compute + data sovereignty" remains an impossible trinity — for now.
MCP native support is still sparse — only Blaxel and Bunnyshell hopx. MCP is becoming the standard Agent-to-tool protocol (97M monthly downloads, 5800+ servers), but sandbox platforms haven't caught up.
Cold start and isolation strength are in hard conflict: <10 ms (V8 isolate) → ~100 ms (microVM) → 766 ms (Edera zone) → 1–3s (Beam gVisor). There is no "fast and strong" option.

Sandbox platform positioning: 10 platforms and their key differentiators

Selection Decision Framework

The core decision variables for choosing a sandbox platform aren't "who's best" but "who do you trust, what does your agent do, and where does your data live."

Data compliance (must run in VPC): Ona (OpenAI ecosystem) / Beam Beta9 (open-source self-hosted) / Northflank (K8s BYOC) / nono (K8s transparent injection)
Need GPU: Modal (managed) / Beam (self-hosted option). No other choices.
Long-running (monitoring / scheduled tasks): Blaxel (25 ms resume + unlimited standby) / Fly.io Sprites (100 GB persistent storage)
Need security team sign-off: Edera (Trail of Bits audit report) / Daytona (SOC 2 Type II)
Full-stack environments: Bunnyshell hopx (database + API + worker + MCP) / Northflank (full application platform)
Already on K8s: GKE Agent Sandbox (declarative CRD) / Edera (K8s zone) / nono (NRI transparent injection)

A reality check: in H2 2026, no single sandbox platform covers all scenarios. Most production systems will use combinations — Anthropic's Managed Agents already uses Cloudflare, Daytona, Modal, and Vercel as four separate execution planes. The core question isn't "which one," but "which quadrant does your agent workload fall into, and which platform is strongest in that quadrant."

Selection decision path: which quadrant is your agent in?

6. Three Trends Worth Watching

Credential Brokering

Agents need access to GitHub, AWS, internal databases — which means credentials. But credentials can't go into the sandbox (the model might leak them), and can't be handed directly to the agent (prompt injection could steal them).

Anthropic's Managed Agents support Vercel's credential brokering: agent initiates request → proxy layer injects credential → execution → credential immediately purged from memory. Ona built similar scoped credentials. Cloudflare Sandbox's secure credential injection follows the same pattern.

But there's no industry standard here — every vendor rolls their own. This is likely the next layer that needs standardization — just as MCP standardized Agent-to-tool connections, credential brokering needs a similar protocol to unify the Agent-to-credential interface.

Kubernetes Native

The GKE Agent Sandbox CRD turns sandbox management into a declarative K8s API. If this pattern gains wide adoption, every agent framework could request execution environments through a unified interface, regardless of whether the underlying technology is Firecracker or Docker.

But the current version lacks a behavioral observability layer — you know the sandbox is running, but you don't know what the agent is doing inside it. Edera's K8s zone and nono's NRI injection address this gap from different angles, but neither has become a de facto standard. Behavioral observability is likely the next battleground for K8s-native sandboxes.

Persistence + Fast Resume

Blaxel's 25 ms standby resume is a signal. For long-running agent workloads (continuous monitoring, scheduled CI tasks, event-driven), creating a fresh sandbox via cold start every time is wasteful. Keeping sandboxes on standby and waking them on demand finds the balance between cost and performance.

Fly.io Sprites' 100 GB persistent NVMe + 300 ms restore offers another approach — instead of keeping sandboxes on indefinite standby, give them a high-capacity "memory" so that when restored from a checkpoint, state is fully intact. Both routes solve the same problem: agents need memory, not a fresh start every time.

7. Judgment

OpenAI's timing on this acquisition is sharp. Codex weekly active users doubled to 5 million in three months — impressive growth. But Anthropic's Claude Code leads on SWE-bench at 88.6% vs. Codex's ~49% — the advantage of local execution on complex tasks is undeniable. OpenAI cannot compete on model capability alone; the execution environment is a gap they had to close.

The bigger trend: agent execution environments are evolving from an infrastructure detail into a standalone product category. This category has four layers of players:

Model providers' integrated solutions (OpenAI/Anthropic/Google): end-to-end experience, but model lock-in
Independent sandbox platforms (E2B/Daytona/Modal/Blaxel/Cloudflare): model-agnostic, each occupying a different quadrant across cold start / isolation / persistence / GPU
Differentiated technology vendors (Edera/Ona Veto/nono): going deep on verifiable security, kernel-level control, and credential integrity
K8s-native solutions (GKE Agent Sandbox): suited for enterprises with existing infrastructure

For developers, the key decision variable for choosing agent tools has shifted in H2 2026. It's no longer just "which model scores highest on benchmarks" — it's "which execution environment matches my security requirements and deployment topology." Model capability is the agent's brain; the execution environment is its hands and feet. The flexibility and safety of those hands and feet is becoming a more critical competitive dimension than raw brainpower.

When this judgment would be wrong: if model capability takes another generational leap in the next 12 months (say, inference accuracy jumping from 90% to 99.9%), execution environment differences could be overshadowed by raw model capability gaps. But based on current iteration speeds across vendors, the gap is narrowing, not widening — which continues to elevate the importance of execution environments.

Disclaimer: This article is based on OpenAI's official announcement (June 11, 2026), CNBC/Bloomberg acquisition coverage, Ona's official blog, Anthropic's engineering blog, public technical documentation and blog posts from E2B/Daytona/Modal/Cloudflare/Edera/Bunnyshell/Beam, Trail of Bits' public audit report, and open-source community resources including awesome-agent-runtime-security (GitHub). Not investment advice. Data as of June 12, 2026.