1. It Starts with a Real Sandbox Escape
The Ona team documented a telling detail on their blog: they had Claude Code executing tasks inside a sandbox, and Claude discovered that /proc/self/root/usr/bin/npx could bypass the blacklist. After Bubblewrap (a lightweight Linux sandbox) closed off that path, the Agent simply shut down the sandbox itself.
This is not a theoretical risk. In another real incident involving Claude Code, a user asked the Agent to clean up files. The Agent executed rm -rf ~/, wiping out the entire home directory—including irreplaceable family photos. Cline (a VS Code extension with 5M+ users) had its npm token stolen via a prompt injection attack.
Agent behavior is inherently probabilistic. Even the most capable models will occasionally produce commands that look plausible but are genuinely dangerous. The execution environment is not an infrastructure detail—it is the security boundary of any Agent system.
On June 11, 2026, OpenAI announced the acquisition of Ona—a startup most people had never heard of, doing something that sounds unglamorous on the surface: providing secure cloud execution environments for AI Agents. Ona CEO Johannes Landgraf put it most directly: "Agents need more than intelligence; they need a trusted workspace."
That statement is being validated across the entire industry right now.
2. Why Now: Why OpenAI Bought Ona
Two things happened simultaneously, pushing this acquisition to the forefront.
First: Codex weekly active users grew from 3 million in April to 5 million—doubling in roughly three months. The pace of user adoption for Agent coding tools has far exceeded expectations.
Second: Anthropic's Claude Code hit 88.6% on SWE-bench Verified (Opus 4.8), while Codex sits at roughly 49%. The advantage of local execution on complex tasks is substantial—real environments mean the Agent can reproduce timing bugs from CI pipelines and fully manipulate git workflows, something cloud sandboxes inherently cannot do.
OpenAI cannot compete on model capability alone. The execution environment is the gap they had to close.
What Ona brings:
Codex's original cloud sandbox was ephemeral—once a task completes, the container is destroyed. But enterprise scenarios demand persistent environments: an Agent might run for hours or even days, maintaining state, accessing internal networks, managing credentials. Ona's core capability is deploying persistent execution environments inside customer VPCs (self-hosted but not self-managed), with built-in credential management, audit logging, and network policies.
More importantly, Ona's most unique technical asset is the Veto kernel module (BPF LSM). Traditional sandboxes intercept by file path—"don't allow running /usr/bin/npx." But an Agent can bypass this through path aliases like /proc/self/root/usr/bin/npx, which is exactly the escape documented in Ona's blog.
Veto takes a fundamentally different approach: it operates at the BPF LSM (Linux Security Module) level, deciding whether to allow execution based on the SHA-256 hash of the binary content. Not "this path can't run," but "a binary with this hash can't run." This eliminates the TOCTOU (Time-of-check to time-of-use) vulnerability window—paths can change, symlinks can change, mount points can change, but the binary content stays the same, and so does the enforcement. Even if an Agent discovers a new path alias or procfs bypass technique, as long as the binary it tries to execute is on the blacklist, it gets blocked.

| Layer | Codex Original | Ona Addition |
|---|---|---|
| Isolation model | Container-level | Container-level + kernel policy (Veto/BPF LSM) |
| Environment lifecycle | Ephemeral (per-task) | Persistent (day-level) |
| Deployment location | OpenAI infrastructure | Customer VPC |
| Credential management | None | Scoped credentials + audit |
| Execution control | Path-level blacklist | Binary hash-level allowlist |
| Target scenario | Individual developer parallel tasks | Enterprise-grade production deployment |
OpenAI needs to convince large enterprises to hand their code repositories over to Agents—and without VPC-internal deployment and comprehensive auditing, enterprise security teams simply won't sign off. That's exactly the gap Ona fills.
3. Four-Vendor Strategy Comparison
Anthropic spins up a full Linux VM per agent inside Claude Cowork; Google built an Agent Sandbox CRD on GKE; Cloudflare uses V8 isolates for Dynamic Workers — every vendor is solving the same problem: where should agent-generated code actually run?
A summary table first, then vendor-by-vendor details:
| Vendor | Core Strategy | Isolation Tech | Security Model | Target User |
|---|---|---|---|---|
| OpenAI | Cloud sandbox + Ona for VPC gap | Seatbelt/container + BPF LSM | Platform-managed first | Enterprise developers |
| Anthropic | Three-tier, scoped by use case | Seatbelt/bwrap/VM/third-party | Tiered by trust level | Full spectrum |
| Four product lines, no unified stack | Varies by product | Fragmented | Google ecosystem users | |
| Cloudflare | V8 isolate, extreme speed | V8 isolate | Language-level isolation | Global edge agents |
OpenAI: Cloud-First Sandbox, Ona Fills the Enterprise Gap
Codex was designed as cloud-first from day one. Give it a task and it clones your repo, edits code, runs tests, and returns results — all inside an OpenAI-managed isolated container. Up to eight subagents execute in parallel, each in its own sandbox.
Local-mode sandboxing: macOS uses Seatbelt (Apple's system-level sandboxing framework); Linux uses Landlock + seccomp for syscall filtering. The core policy is "working directory writable, network denied by default."
The Ona acquisition plugged the enterprise gap: VPC-internal deployment, persistent environments, BPF LSM kernel-level execution control. OpenAI's strategy now runs two parallel tracks — individual developers get Codex cloud sandboxes, enterprises get Ona VPC sandboxes.
Anthropic: Three-Tier Architecture, Isolation Scoped by Use Case
Anthropic has the most complex strategy because it runs agents across three product lines simultaneously, each facing different users and security assumptions.
Tier 1: Claude Code (developer tool) — runs on the local machine with direct access to the real filesystem, terminal, and network. The security boundary relies on OS-level sandboxing (Seatbelt on macOS, bubblewrap on Linux) plus human-in-the-loop approval. This sandbox reduces permission prompts by 84%, but introduces "approval fatigue" — an agent might request permissions dozens of times per hour, and humans start clicking "approve" without thinking.
Tier 2: Claude Cowork (general knowledge work) — launches a full Linux VM using Apple Virtualization Framework on macOS, Hyper-V on Windows. The critical design difference: in Claude Code, a privileged process outside the sandbox can decide to allow a command; in Cowork, the agent loop runs inside the VM with no external process holding an "escape key." The agent doesn't know it's sandboxed, and no component has the authority to grant it exceptions.
Anthropic also ran an interesting internal test: they prepared a sandbox-escape prompt and posted it to an internal Slack channel for discussion. Someone pointed out that internal agents also read Slack — meaning the test prompt itself became a potential attack vector. They had to add a canary string to monitor whether any agent was triggered by it. In a world where agents can read everything, the testing tooling is itself part of the attack surface.
Tier 3: Claude Managed Agents (enterprise) — GA'd May 2026. $0.08/session-hour (plus token costs). The design philosophy is "separation of execution plane and orchestration plane": Anthropic's servers handle scheduling and model inference, but code execution happens in customer-controlled sandboxes (Cloudflare, Daytona, Modal, Vercel as execution planes). Credentials live in a Vault and are never exposed to the model runtime.
Google: Four Product Lines, the Most Fragmented Strategy
Google's agent execution environment strategy is spread across at least four lines with no unified technology selection:
- Jules V2 (coding agent): Async cloud execution, returns PRs, user doesn't need to stay online
- Antigravity 2.0 (terminal agent): Replaces Gemini CLI, supports multi-agent orchestration, defaults to Gemini 3.5 Flash
- Project Mariner (browser agent): Runs inside Chrome, operates only on web pages. Biggest advantage is Gemini's distribution channel of 750M+ users
- GKE Agent Sandbox (infrastructure): A CRD from Kubernetes SIG Apps with three core abstractions — SandboxTemplate (blueprint), SandboxClaim (declarative request), SandboxWarmPool (warm pool, bringing creation time under 1 second)
GKE Agent Sandbox is clearly positioned for enterprises already running Kubernetes infrastructure. But Google's Computer Use offering only runs on the browser surface — Anthropic's reference architecture provides a full desktop environment (Docker running Xvfb + Firefox + LibreOffice), while Google gives the agent only a browser window.
Cloudflare: V8 Isolates as "Fast Sandboxes"
Cloudflare takes the most distinctive approach. No VMs, no containers — it uses V8 isolates, the isolation unit from Chrome's JavaScript engine. Dynamic Workers allow agents to dynamically generate and execute code at runtime, with Cloudflare claiming 100× speed over container-based solutions. The tradeoff: only JavaScript and WebAssembly can run; no native syscalls.
GA'd May 2026 in partnership with Anthropic as one of Claude Managed Agents' execution planes. New additions include secure credential injection (agents never touch credentials), PTY support, persistent code interpreter, snapshot restore, and active-CPU pricing.
Fits: latency-sensitive, globally distributed agent workloads. Doesn't fit: anything requiring GPU or a full Linux environment.
4. Four Levels of Isolation
Strip away the product branding, and the current landscape settles into four isolation tiers. Understanding these tiers is the prerequisite for making sense of every sandbox platform's tradeoffs.
Tier 1: Language / OS-level Sandboxes
Seatbelt (macOS), bubblewrap (Linux), seccomp, Landlock. The agent still shares the host kernel — the OS merely filters which syscalls are permitted. Isolation is a deny-list, not a boundary.
- Cold start: <10 ms
- Security strength: Weakest. A single kernel vulnerability can breach containment. The Claude Code escape documented by Ona occurred at this level.
- Representatives: Claude Code local mode, Codex CLI local mode.
- Why it still exists: For coding agents that need full access to a real development environment, complete isolation means losing environmental fidelity. The design assumption is that the user is an engineer who can assess the risk.
Tier 2: Container-level Isolation
Docker, gVisor, Kata Containers. Stronger than OS-level sandboxes — at minimum, you get independent filesystem and network namespaces. Standard Docker containers still share the host kernel; gVisor adds a user-space syscall interception layer.
- Cold start: 90–150 ms (Daytona claims <90 ms)
- Security strength: Moderate. gVisor blocks most escape paths but isn't hardware-level isolation.
- Representatives: Daytona (Docker), Modal (gVisor), Cloudflare Sandbox (V8 isolate).
- Best fit: Short-lived agent tasks — run code, return results, destroy.
Tier 3: MicroVM-level Isolation
Firecracker (the technology behind AWS Lambda), Apple Virtualization Framework, Hyper-V. Each sandbox gets its own Linux kernel. Escaping requires finding a virtualization-layer vulnerability — the same difficulty as attacking a VM.
- Cold start: 100–200 ms (Bunnyshell hopx ~100 ms, E2B ~150 ms, Blaxel 200–600 ms)
- Security strength: Strongest within an acceptable cold-start budget.
- Representatives: E2B, Ona, Claude Cowork, Blaxel, Bunnyshell hopx, Fly.io Sprites, Vercel Sandbox.
- Core judgment: MicroVMs are the current sweet spot for security vs. speed.
Tier 4: Full VMs
QEMU/KVM, Windows HCS. Full virtualization, strongest isolation, multi-second cold starts. Rarely used for agent workloads — startup latency is prohibitive. The notable exception is Ramp (a fintech company), which chose VM-level isolation for its in-house agent system: each task gets a fully independent environment, driven by financial-industry compliance requirements.
| Isolation Level | Independent Kernel | Cold Start | Cost per Hour (ref.) | Agent Use Case |
|---|---|---|---|---|
| OS-level (Seatbelt/bwrap) | No | <10 ms | $0 (local) | Local developer coding |
| Container (Docker/gVisor) | No | 90–150 ms | ~$0.08 | Short tasks, CI/CD |
| MicroVM (Firecracker) | Yes | 100–200 ms | ~$0.08 | Enterprise agent production |
| Full VM (QEMU/HCS) | Yes | Seconds | Varies | Finance / high-compliance |

5. The Independent Sandbox Platform Landscape
Model providers are building their own execution environments, but a cohort of independent startups is competing for the same market: model-agnostic agent sandboxes.
Mainstream Platforms
E2B: $35M raised, Firecracker microVMs, Python/TypeScript SDK, 88% of the Fortune 100 signed on. ~150 ms cold start, sessions up to 24 hours. Open-source and self-hostable, but production-grade BYOC is enterprise-only (currently AWS and GCP). Perplexity used E2B to ship advanced data analysis for Pro users in one week; Hugging Face used it to reproduce DeepSeek-R1 — integration speed is the core selling point. Weaknesses: no GPU support, 24-hour session cap (1 hour on the free tier).
Daytona: 60k+ GitHub stars (open-source, AGPL), Docker container-level isolation, sub-90 ms cold start, unlimited session duration. BYOC supported. SOC 2 Type I + Type II certified. Weaknesses: shared kernel (Kata Containers available but not default), isolation weaker than microVM. Pivoted from developer environments to AI agent infrastructure in 2024.
Modal: gVisor container-level isolation, 50,000+ concurrent sessions, native GPU support (A100, H100). Suited for ML workloads. Weaknesses: primarily Python (JS/Go in beta), no self-hosting or BYOC.
Blaxel: Positioned around "perpetual sandboxes" — sandboxes can idle indefinitely, resume from standby in 25 ms, with zero compute cost during standby. MicroVM isolation, SOC 2 Type II + ISO 27001 + HIPAA BAA. Also offers Agents Hosting, Batch Jobs, and MCP Server Hosting. Weaknesses: newer entrant, limited community and case-study depth.
Cloudflare Sandbox: GA'd May 2026. Container-level (V8 isolate), global edge network. Strengths: worldwide coverage and active-CPU pricing. Weaknesses: no GPU, cannot run native syscalls.
Differentiated Entrants
Edera: The Only Platform with a Public Security Audit
Edera takes a path that no other sandbox platform follows. It doesn't use Firecracker, gVisor, or Docker — instead, it built a custom KVM-based "zone" abstraction where each agent runs in an independent kernel-isolated compartment, with <1% CPU overhead and 766 ms startup.
The real differentiator is verifiable security. Trail of Bits (a top-tier security audit firm) conducted a four-week public audit: zero critical findings. This is unique in the agent sandbox space — every other platform's security assurance boils down to "we use Firecracker/gVisor, so we should be safe." Edera can say "a third party confirmed we're safe."
Edera's core thesis: seccomp/AppArmor cannot sandbox agents, because these tools require pre-enumerating permitted behaviors — but agents are non-deterministic. The same prompt may produce different syscall sequences across runs. You can't policy-gate what you can't predict. So Edera doesn't intercept at the syscall layer; it isolates at the hypervisor layer — not restricting what the agent does, only limiting its blast radius.
Best fit: enterprises whose security teams require third-party audit reports. Not suited for cold-start-latency-sensitive scenarios.
Bunnyshell hopx: Not Just a Shell — a Full Development Environment Stack
Bunnyshell's hopx.ai represents an evolutionary direction for sandbox products: moving from "give the agent a Linux compute unit" to "give the agent a complete development environment." Firecracker microVM, ~100 ms cold start, but the sandbox ships with a full dev stack — databases, API services, background workers. It also supports native MCP Server — agents can interact with the environment directly via the MCP protocol.
This is genuinely valuable for certain workflows: "spin up a sandbox with a database, run migrations, write tests, verify, and submit a PR" — with E2B you'd assemble the database layer yourself; with hopx it's a sandbox preset.
Beam (Beta9): The Only Option Combining GPU + Self-hosted + Unlimited Duration
Beam's positioning addresses the respective pain points of E2B and Modal: E2B has strong isolation (microVM) but no GPU; Modal has GPU but can't be self-hosted. Beam uses gVisor isolation + broad GPU support (H100, H200, A100 80GB, B200, L40S, RTX 4090/5090) + BYOC (AWS, GCP, Azure, Hetzner) + no session duration limit.
More notably, Beta9 is fully open-source — teams can run the entire sandbox platform on their own infrastructure. Weaknesses: gVisor isolation is weaker than microVM, and cold start runs 1–3 seconds.
nono: Isolation + Credential Security + Config Integrity in One Package
nono doesn't solve "how to isolate" — it solves "what happens after isolation." Most sandbox solutions stop at containment: the agent is locked down, but how are credentials managed? How do you prevent config tampering?
nono uses Landlock for filesystem isolation, then layers on integrity-protected configuration and OS-native key management. The companion tool kubefence is a Kubernetes NRI plugin that can transparently inject nono sandboxes into existing containers and Kata VMs — no application code changes, no Dockerfile modifications. Particularly valuable for teams already running Kubernetes infrastructure.
Fly.io Sprites: The Middle Ground of Persistent Filesystems
Firecracker microVMs, each sandbox comes with 100 GB of persistent NVMe storage. Checkpoint/restore in ~300 ms, scale-to-zero. Positioned between E2B (purely ephemeral) and Blaxel (unlimited standby): agents don't burn money while idle, and when they resume from a snapshot, filesystem state is fully preserved.
Platform Comparison
| Platform | Isolation Model | Independent Kernel | Cold Start | Standby Resume | Session Limit | GPU | BYOC / Self-hosted | MCP Native |
|---|---|---|---|---|---|---|---|---|
| E2B | Firecracker microVM | ✓ | ~150 ms | ~1s | 24h | ✗ | Enterprise (AWS/GCP) | ✗ |
| Daytona | Docker (+ Kata optional) | Optional | ~90 ms | — | Unlimited | ✗ | ✓ | ✗ |
| Modal | gVisor | ✗ | <1s | — | Configurable | ✓ (A100/H100) | ✗ | ✗ |
| Blaxel | microVM | ✓ | 200–600 ms | 25 ms | Unlimited | ✗ | ✗ | ✓ |
| Cloudflare | V8 Isolate | ✗ | <10 ms | — | Configurable | ✗ | ✗ | ✗ |
| Edera | KVM zone | ✓ | 766 ms | — | Configurable | ✗ | K8s native | ✗ |
| Bunnyshell hopx | Firecracker microVM | ✓ | ~100 ms | — | Configurable | ✗ | ✓ | ✓ |
| Beam (Beta9) | gVisor | ✗ | 1–3s | Snapshot restore | Unlimited | ✓ (H100/A100/B200) | ✓ (open-source) | ✗ |
| Fly.io Sprites | Firecracker microVM | ✓ | ~300 ms | ~300 ms | Configurable | ✗ | ✗ | ✗ |
| Vercel Sandbox | Firecracker microVM | ✓ | <1s | Snapshot restore | 45min–5hr | ✗ | ✗ | ✗ |
Three notable blank spots:
- No single platform achieves microVM + GPU + BYOC simultaneously. Beam comes closest but uses gVisor (no independent kernel). "Strongest isolation + compute + data sovereignty" remains an impossible trinity — for now.
- MCP native support is still sparse — only Blaxel and Bunnyshell hopx. MCP is becoming the standard Agent-to-tool protocol (97M monthly downloads, 5800+ servers), but sandbox platforms haven't caught up.
- Cold start and isolation strength are in hard conflict: <10 ms (V8 isolate) → ~100 ms (microVM) → 766 ms (Edera zone) → 1–3s (Beam gVisor). There is no "fast and strong" option.

Selection Decision Framework
The core decision variables for choosing a sandbox platform aren't "who's best" but "who do you trust, what does your agent do, and where does your data live."
- Data compliance (must run in VPC): Ona (OpenAI ecosystem) / Beam Beta9 (open-source self-hosted) / Northflank (K8s BYOC) / nono (K8s transparent injection)
- Need GPU: Modal (managed) / Beam (self-hosted option). No other choices.
- Long-running (monitoring / scheduled tasks): Blaxel (25 ms resume + unlimited standby) / Fly.io Sprites (100 GB persistent storage)
- Need security team sign-off: Edera (Trail of Bits audit report) / Daytona (SOC 2 Type II)
- Full-stack environments: Bunnyshell hopx (database + API + worker + MCP) / Northflank (full application platform)
- Already on K8s: GKE Agent Sandbox (declarative CRD) / Edera (K8s zone) / nono (NRI transparent injection)
A reality check: in H2 2026, no single sandbox platform covers all scenarios. Most production systems will use combinations — Anthropic's Managed Agents already uses Cloudflare, Daytona, Modal, and Vercel as four separate execution planes. The core question isn't "which one," but "which quadrant does your agent workload fall into, and which platform is strongest in that quadrant."

6. Three Trends Worth Watching
Credential Brokering
Agents need access to GitHub, AWS, internal databases — which means credentials. But credentials can't go into the sandbox (the model might leak them), and can't be handed directly to the agent (prompt injection could steal them).
Anthropic's Managed Agents support Vercel's credential brokering: agent initiates request → proxy layer injects credential → execution → credential immediately purged from memory. Ona built similar scoped credentials. Cloudflare Sandbox's secure credential injection follows the same pattern.
But there's no industry standard here — every vendor rolls their own. This is likely the next layer that needs standardization — just as MCP standardized Agent-to-tool connections, credential brokering needs a similar protocol to unify the Agent-to-credential interface.
Kubernetes Native
The GKE Agent Sandbox CRD turns sandbox management into a declarative K8s API. If this pattern gains wide adoption, every agent framework could request execution environments through a unified interface, regardless of whether the underlying technology is Firecracker or Docker.
But the current version lacks a behavioral observability layer — you know the sandbox is running, but you don't know what the agent is doing inside it. Edera's K8s zone and nono's NRI injection address this gap from different angles, but neither has become a de facto standard. Behavioral observability is likely the next battleground for K8s-native sandboxes.
Persistence + Fast Resume
Blaxel's 25 ms standby resume is a signal. For long-running agent workloads (continuous monitoring, scheduled CI tasks, event-driven), creating a fresh sandbox via cold start every time is wasteful. Keeping sandboxes on standby and waking them on demand finds the balance between cost and performance.
Fly.io Sprites' 100 GB persistent NVMe + 300 ms restore offers another approach — instead of keeping sandboxes on indefinite standby, give them a high-capacity "memory" so that when restored from a checkpoint, state is fully intact. Both routes solve the same problem: agents need memory, not a fresh start every time.
7. Judgment
OpenAI's timing on this acquisition is sharp. Codex weekly active users doubled to 5 million in three months — impressive growth. But Anthropic's Claude Code leads on SWE-bench at 88.6% vs. Codex's ~49% — the advantage of local execution on complex tasks is undeniable. OpenAI cannot compete on model capability alone; the execution environment is a gap they had to close.
The bigger trend: agent execution environments are evolving from an infrastructure detail into a standalone product category. This category has four layers of players:
- Model providers' integrated solutions (OpenAI/Anthropic/Google): end-to-end experience, but model lock-in
- Independent sandbox platforms (E2B/Daytona/Modal/Blaxel/Cloudflare): model-agnostic, each occupying a different quadrant across cold start / isolation / persistence / GPU
- Differentiated technology vendors (Edera/Ona Veto/nono): going deep on verifiable security, kernel-level control, and credential integrity
- K8s-native solutions (GKE Agent Sandbox): suited for enterprises with existing infrastructure
For developers, the key decision variable for choosing agent tools has shifted in H2 2026. It's no longer just "which model scores highest on benchmarks" — it's "which execution environment matches my security requirements and deployment topology." Model capability is the agent's brain; the execution environment is its hands and feet. The flexibility and safety of those hands and feet is becoming a more critical competitive dimension than raw brainpower.
When this judgment would be wrong: if model capability takes another generational leap in the next 12 months (say, inference accuracy jumping from 90% to 99.9%), execution environment differences could be overshadowed by raw model capability gaps. But based on current iteration speeds across vendors, the gap is narrowing, not widening — which continues to elevate the importance of execution environments.
Disclaimer: This article is based on OpenAI's official announcement (June 11, 2026), CNBC/Bloomberg acquisition coverage, Ona's official blog, Anthropic's engineering blog, public technical documentation and blog posts from E2B/Daytona/Modal/Cloudflare/Edera/Bunnyshell/Beam, Trail of Bits' public audit report, and open-source community resources including awesome-agent-runtime-security (GitHub). Not investment advice. Data as of June 12, 2026.
