From CLOS to ZCube: Network Topology Evolution for AI Computing Clusters

From Charles Clos's non-blocking telephone switching network in 1953 to ByteDance's SIGCOMM 2025 Best Paper ZCube — topology design has evolved from expert intuition to automated search. ZCube's asymmetric architecture eliminates 60% of switches at 16K GPU scale while boosting training speed by 5%, and Zhipu AI's production deployment proves it works beyond training.

This article is one of two companion pieces to "AI Training Networks: A Technical Panorama," focusing on physical-layer topology design. Companion piece two, "From RoCE to MRC: AI Cluster Transport Protocols and Chip Rearchitecture", dives deep into transport protocols and chip implementations. For the full overview, see "AI Training Networks: From CLOS to ZCube, from RoCE to MRC".

A Counterintuitive Finding: Symmetry Is Not Optimal

One remark from the review of the SIGCOMM 2025 Best Paper is worth noting: all traditional topology designs assume switches are homogeneous — same port count, same specifications. This assumption seems self-evident: identical switches are easier to procure, easier to stock spares for, and easier to operate.

But ATOP (Automated Topology Optimization Pipeline) discovered through automated search that under AI training traffic patterns, asymmetric architectures consistently outperform symmetric ones on the performance-cost trade-off. Not "roughly the same" — at every GPU scale searched, the optimal solution at the Pareto front knee exhibits the same asymmetric characteristics.

The deeper implication: the human preference for "symmetry" is a cognitive bias under AI training workloads. ATOP eliminated this bias and discovered ZCube.

From Clos (1953) to Fat-Tree (2008) to BCube (2009) to Rail-Optimized (2020s) to ZCube (2025), the topology design goal has shifted from "general-purpose non-blocking" to "AI-training-specific," and the design method from "manual design" to "automated search." This evolutionary path is worth tracing in full.

The Starting Point: Three Assumptions of Clos Topology

In 1953, Charles Clos published his theory of non-blocking multistage switching networks. The modern datacenter Fat-Tree is a direct application of the Clos topology. Clos carries three implicit assumptions, each of which collapses under AI training workloads:

Assumption 1: Traffic patterns are unpredictable. Clos is designed to provide full bisection bandwidth for "any input to reach any output." But AI training's collective communication patterns are highly regular — every training step repeats the same AllReduce/All-to-All patterns. You don't need "any-to-any"; you need "optimal for specific patterns."

Assumption 2: Many independent small flows. ECMP hashing achieves statistical evenness with many independent small flows. But AI training produces a small number of elephant flows, and the probability of two large flows being hashed to the same link is non-trivial. Flow collisions cause tail latency — in synchronous training, the slowest transfer determines the entire step time.

Assumption 3: Three layers of switching suffice. A three-stage 64-port Clos can connect at most ~32K endpoints. 100K GPUs require four layers or oversubscription, but every additional switch layer adds a hop of latency, and costs and power consumption escalate sharply.

Topology Evolution Timeline: Clos(1953) → Fat-Tree(2008) → BCube(2009) → Rail(2020s) → ZCube(2025)

Rail-Optimized: From "General Connectivity" to "Matching Communication Patterns"

Core Idea

A GPU server has 8 GPUs internally (e.g., DGX H100), each with its own NIC. Rail-Optimized's idea: instead of connecting all 8 NICs to the same ToR switch, connect GPUs in the same position across all servers to the same switch, forming 8 independent "Rails."

Why It Works for AI

In data parallelism, same-rank GPUs across servers communicate most frequently. Rail topology enables same-rank GPU communication in a single hop; most traffic is absorbed at the Leaf layer and never reaches the Spine.

Rail-Only vs Rail+Global (ROFT)

Dimension	Rail-Only	Rail+Global (ROFT)
Structure	Rail switches only, no upper-layer interconnect	Rail Leaf + Global Spine
Cost	Significantly lower (no top-layer switches)	Requires additional Spine layer
Use case	DP training with highly localized communication	General AI training (including All-to-All)

NVIDIA's DGX SuperPOD reference architecture uses ROFT. But ROFT's problem is that its three-layer structure means PP traffic needs 3-5 hops (Leaf→Spine→Aggregation→Spine→Leaf), causing severe latency stacking. The switch count is also high — 16K GPU ROFT requires 640 switches.

ZCube: Turning Topology Design into Hyperparameter Search

Paper Background

ZCube comes from the SIGCOMM 2025 Best Paper "From ATOP to ZCube." First author Zihan Yan (Tsinghua University), corresponding author Dan Li (李丹) — who was also the author of the 2009 SIGCOMM BCube paper. Core industrial validation comes from ByteDance; Haibin Lin is also a core author of MegaScale (NSDI'24) and ByteScale (SIGCOMM'25).

The name ZCube carries the lineage of BCube — B for Byte/Cube, Z for "the last letter, the ultimate Cube." From BCube to ZCube over 16 years, topology design moved from "manual" to "automated."

The ATOP Methodology

ATOP's core insight: you don't need to search topologies from scratch — instead, distill expert design intuition into a structured hyperparameter space, then search automatically.

Intuitively, what ATOP does: imagine you're designing a multi-room floor plan. The traditional approach is an architect drawing from experience — how many bedrooms, how many bathrooms, how large the living room, how wide the corridors. ATOP's approach: first define a set of tunable parameters ("bedroom count 1-8," "corridor width 1-3m," "living room faces south or east"...), then search across all parameter combinations for the optimum in "best lighting + highest area utilization + shortest circulation paths."

ATOP does exactly this for network topologies: it parameterizes design decisions like "how many switch layers," "how many per layer," "how to connect between layers," "how to connect within a layer," then searches the parameter space for the optimal balance across 14 objectives (training speed, cost, fault tolerance...).

The study surveyed nearly all mainstream topologies (CLOS, Fat-Tree, ROFT, Rail-only, HPN, BCube, DCell, HyperX, Torus, Dragonfly, etc.) and distilled them into 11 categories of searchable hyperparameters:

Inter-layer connections: GPU count, max layer count, nodes per layer, blocking parameter, connection count, bandwidth factor (1-4 corresponding to 200G-800G)
Intra-layer connections: number of dimensions, nodes per dimension, outward connection count, coordinate computation factor

This compresses the search space from O(2^N²) adjacency matrices to something searchable on a single CPU server in under 3 days.

NSGA-II evolutionary algorithm: 14 optimization objectives (9 DP/PP/Mixed JCT + 2 MoE JCT + ForestColl all-gather + APS fault tolerance + cost), with non-dominated sorting ensuring multi-objective fairness.

Flow-level simulator: max-min fairness + SimGrid congestion modeling, achieving an average error of only 1.5% compared to NS-3 packet-level simulation. Two-stage evaluation reduces 100K full evaluations to ~5,000, a 20× speedup.

ATOP Pipeline: Hyperparameter Space → NSGA-II Evolution → Flow-Level Simulation → Pareto Front → ZCube

Search efficiency (single 256-core AMD EPYC server):

GPU Scale	Search Duration	Switch Throughput Limit
256	6.5 hours	6.4 Tbps
1,024	10.6 hours	12.8 Tbps
4,096	25.4 hours	25.6 Tbps
16,384	71.2 hours (~3 days)	51.2 Tbps

ZCube Recursive Definition

Across searches at four scales — 256, 1024, 4096, and 16384 — the architectures at the Pareto front knee were strikingly similar. The paper formalizes this as ZCube.

ZCube(n, 1) = 1 switch + n GPUs
ZCube(n, k+1) = n × ZCube(n, k) + n^k switches

ZCube(n, k) contains: n^(k+1) GPUs, k+1 switch layers, n^k switches per layer, (k+1) × n^k total switches. Each GPU has (k+1) NIC ports.

Practical scales:

ZCube(128,2): 16,384 GPUs, 256 switches, each GPU has 2 NIC ports (1×400G NIC split into 2×200G)
ZCube(84,3)-partial: 592,704 GPUs, 84 pods interconnected via CLOS, network diameter 4

Asymmetry: Why ZCube Breaks the "Homogeneous Switch" Assumption

This is ZCube's core innovation. In ZCube(n, k):

Edge layers (level-0 and level-(k-1)) require switches with 2n ports
Middle layers (level-1 through level-(k-2)) require switches with 3n ports

Take ZCube(128,2): level-0 and level-1 switches both have 256 ports (128×200G), but entirely different traffic patterns. Edge layers connect directly to GPUs and primarily carry collective communication traffic (AllReduce/AllGather); core layers primarily carry inter-pod PP traffic.

Why is asymmetric better? Different switch layers carry different traffic patterns with different port utilization. Forcing all layers to use the same port-count switch inevitably over-provisions one layer. Asymmetric design precisely matches each layer's actual requirements.

ZCube Asymmetric Structure: Edge Layers 2n Ports vs Middle Layers 3n Ports

Operational reality check: Asymmetric topology means different layers need different switch specifications (2n vs 3n ports), increasing procurement and spare-parts management complexity. The paper does not deeply discuss operational issues. This is one of the practical barriers to ZCube adoption in production — but if the hardware cost savings are large enough (60% fewer switches), the operational complexity may be worth it.

Diameter: What 2 Hops Means

The paper's Theorem 1 proves: the network diameter of ZCube(n,k) is k.

Topology	GPU Count	Diameter
ZCube(128,2)	16,384	2
3-layer ROFT	16,384	5
ZCube(42,4)	3,111,696	4
128-port 3-layer Fat-Tree	524,288	5

PP traffic requires 3-5 hops in ROFT but at most 2 hops in ZCube(128,2). 3 fewer hops = 3 fewer switch forwarding latencies + 3 fewer serdes retiming stages + 3 fewer O-E-O conversions. For latency-sensitive PP traffic, this gap directly shows up in training speed.

The Sweet Spot: ZCube(32,2) at 1024 GPUs

The ZCube paper's flagship case is 16K GPU ZCube(128,2). But in practice, most training and inference clusters operate at 128-1024 GPU scale. Is ZCube worthwhile at these scales?

A hard constraint: ZCube(n, k) requires each GPU to have k NIC ports. k=2 means each GPU needs 2 ports (1×800G NIC broken out into 2×400G), which most servers can support. k=3 requires 3 ports, which most servers cannot accommodate. So k=2 is the only practical option for ZCube deployment.

Four-Scale Topology Comparison: Switch Count and Port Count for 128/256/512/1024 GPU Configurations

128 GPUs: Flat Is Optimal

A single 128×400G switch directly connecting 128 GPUs: 1 hop, full bisection bandwidth, 0 extra switches. ZCube(11,2) would need 22 switches with only 22 ports each — more switches than 1/6 the GPU count, completely uneconomical.

Conclusion: Use Flat for 128 GPUs, not ZCube.

256 GPUs: Transition Zone

ZCube(16,2) requires 32 switches with 32 ports each. Switch/GPU ratio = 1:8, too dense. Rail-only (8×32-port switches) is simpler — although only same-rail GPUs are 1 hop apart, 256-GPU AllReduce communication patterns are usually well-covered within rails.

A 2-tier Clos also needs only 4×128-port switches. At 256 GPU scale, ZCube's advantage is not significant enough.

512 GPUs: ZCube Starts to Show Advantages

ZCube(23,2) requires 46 switches with 46 ports each, with all GPU pairs at 2 hops and full bisection bandwidth. Rail-only uses 8×64-port switches, with cross-rail communication requiring longer paths.

46 switches sounds like a lot, but each has only 46 ports — on Tomahawk 5 (51.2T), this uses less than half the port density, leaving headroom. Switch/GPU = 1:11, entering the reasonable range.

1024 GPUs: ZCube(32,2) Is the Sweet Spot

Scheme	Switches	Switch Ports	Diameter	NIC Ports/GPU	Full Bisection BW
ZCube(32,2)	64	64	2	2	✅
Rail-only (8×128)	8	128	2	1	❌ Cross-rail degraded
2-tier Clos	16	128	2	1	✅

ZCube(32,2)'s parameters map almost perfectly to off-the-shelf switches:

64-port switches = Tomahawk 5 standard configuration (64×800G or 128×400G breakout)
2 ports per GPU = 1×800G NIC broken out into 2×400G
64 switches = 4 standard racks (16 switches per rack)
Switch/GPU = 1:16 = proportional scaling of ZCube(128,2)'s 1:64 at smaller scale

Why is 1024 the sweet spot rather than 512? 512-GPU ZCube(23,2) needs 46-port switches — not a standard spec. Real deployment would require 64-port switches with 18 wasted ports, or custom ordering. 1024-GPU ZCube(32,2) needs 64-port switches — exactly Tomahawk 5's native port configuration, zero waste.

Zhipu AI's 1000-card inference cluster falls right in the 1024 GPU range — at this scale, ZCube saves 1/3 of optical modules and switches while boosting throughput by 15%, not from theoretical projection but from production data.

Sweet Spot Summary

GPU Scale	Recommended Topology	ZCube Worthwhile?
128	Flat (single switch)	❌ No
256	Rail-only or 2-tier Clos	⚠️ Marginally, advantage is small
512	ZCube(23,2) starts to pay off	✅ Worth considering
1024	ZCube(32,2)	✅ Sweet spot
4096	ZCube(64,2)	✅ Paper-validated
16384	ZCube(128,2)	✅ Paper's core case

NVLink Domain Expansion: When the Atomic Unit Changes from GPU to Rack

Impact of NVLink Domain Expansion on ZCube Parameters: Per-GPU vs Per-NVLink-Domain

All ZCube calculations use a single GPU as the atomic unit. But in 2025-2026, the Scale-Up domain is expanding from 8 GPUs (single server) to 72 GPUs (NVL72), and potentially to 256-576 GPUs. When the NVLink domain = 72 GPUs, the Scale-Out network's "atomic unit" shifts from GPU to NVLink domain.

Impact on ZCube parameters:

Parameter	Per GPU	Per NVLink Domain (72 GPUs)
1024 GPU cluster	ZCube(32,2), 64 switches	ZCube(15,2), 30 switches (1024/72 ≈ 14.2 domains)
16K GPU cluster	ZCube(128,2), 256 switches	ZCube(16,2), 32 switches (16384/72 ≈ 227 domains)
NIC ports/unit	2 per GPU	2 per NVLink domain (shared by 72 GPUs within)
Switch ports	32-128 ports	15-227 ports

Key change: The larger the NVLink domain, the fewer Scale-Out network nodes, the smaller ZCube's n value, and the fewer switches needed. But each NVLink domain has limited external NIC ports — NVL72 typically provisions 8-18 external NIC ports — which constrains the upper bound of k.

Open question: When the NVLink domain expands to 576 GPUs (NVSwitch), a 16K GPU cluster has only ~28 Scale-Out nodes. At that point, ZCube(5,2) needs only 10 switches — but 5-port switches are too small; a 1-hop Flat topology might be better. Scale-Up domain expansion may shift ZCube's advantage zone upward — from 512-16K GPUs to 4K-64K GPUs.

This analysis has no paper support yet; it's simple arithmetic based on ZCube's recursive definition. When NVIDIA's NVLink 5 + Vera Rubin NVLink domain specs are published, it will be worth re-running ATOP search.

16,384 GPU Quantitative Comparison

Using Broadcom Tomahawk 5 (51.2T) switches, each server with 8 GPUs + 8×400G NICs:

Topology	Switches	Cables	GPT-3 175B Iteration	Network Cost
ROFT	640	49,152×400G	5.19s	$92.93M
Rail-only	384	32,768×400G	5.15s	$76.38M
HPN	384	16,384×400G + 32,768×200G	5.10s	$84.03M
ZCube(128,2)	256	49,152×200G	4.95s	$57.28M

Four-Topology Comparison: Switch Count, Network Cost, Training Iteration Time

Key findings:

60% fewer switches (256 vs 640), 33% fewer than Rail-only/HPN
Optical module cost reduced 25%-50%: ZCube uses 200G cables (vs ROFT's 400G)
3%-7% faster training, 26%-46% lower cost — winning on both performance and cost simultaneously
MoE-GPT: ZCube 6.06s vs ROFT 6.41s; BCube(128,2) degrades to 13.79s due to lacking full bisection bandwidth
ForestColl all-gather: all topologies achieve the same theoretical optimum

Key insight: ZCube is not trading off performance against cost — it simultaneously optimizes both on the Pareto front. Fewer switches, cheaper cables, faster training.

Fault Tolerance: Fewer Switches = Higher Reliability

Single ToR failure (4K GPUs, GPT-3 175B):

Topology	Performance Degradation
ZCube(64,2)	2.8% (failed GPU traffic switches to alternate NIC port)
HPN	9.0%
Rail-only	15.0%
ROFT	46.9% (ToR failure forces GPU traffic through PXN via NVLink forwarding)

ZCube's fault tolerance advantage comes from two factors: fewer switches (lower failure probability) and each GPU having multiple NIC ports connected to different switches (rapid failover).

Failure-free probability (16,384 GPUs, per-switch failure rate 0.03%):

Topology	Switch Count	Failure-Free Probability
ROFT	640	83%
HPN/Rail-only	384	89%
ZCube	256	93%

Link failure degradation curves (1%-15% random link failures): ZCube degrades most gradually with the smallest standard deviation. Fewer switches don't just save money — they are a source of higher reliability.

Production Validation

16 GPU Physical Testbed

The paper built a 16 GPU physical testbed (4 servers × 4 H800 GPUs, 8 Mellanox QM9790 IB switches), comparing ZCube(4,2) with ROFT:

All-reduce: identical performance across all message sizes (1M-16G)
All-to-all: identical performance
Cost: ZCube uses 48×200G links vs ROFT's 32×400G — 25% cost reduction

Zhipu AI Inference Cluster

Zhipu AI (z.ai) upgraded its ~1000-GPU GLM-5.1 inference cluster from ROFT to ZCube. Without changing GPU hardware or modifying applications:

Saved 1/3 of optical modules and switches
15% inference throughput improvement

ZCube's advantage in inference scenarios is particularly pronounced: MoE expert all-to-all, KV Cache migration in PD-separated setups, and other traffic patterns align naturally with ZCube's asymmetric topology. This is direct evidence that ZCube is not limited to training scenarios.

ZCube and MRC Synergy: Why Two Revolutions Must Be Examined Together

ZCube and MRC are currently evolving independently, but in combination they have natural structural advantages. Understanding these synergies is key to understanding why each is a prerequisite for the other.

2-Hop Diameter Simplifies Protocol-Layer Failure Management

ZCube's 2-hop diameter is one of its most important topological properties — it doesn't just reduce latency, it directly simplifies transport protocol complexity.

MRC's EV four-state machine (active → congested → suspected_failed → confirmed_failed) needs to determine path liveness within each RTT. In a 5-7 hop three-layer Clos, one RTT can be tens of microseconds, and any intermediate node on the path can be a failure point. In ZCube's 2-hop topology, paths are short with few intermediate nodes — the EV state machine converges faster, and failure determination has higher confidence.

SRv6 encoding also benefits from short paths: 2 hops require only 2-3 uSID segments, making header overhead almost negligible. A 5-7 hop Clos requires 5-7 uSID segments, which even with C-SID compression has cumulative overhead eating into payload space.

Asymmetric Topology Requires Source Routing

Traditional networks rely on ECMP for load balancing, and ECMP assumes multiple equal-cost paths — which implicitly assumes a symmetric topology. ZCube's asymmetric structure (edge layers with 2n ports vs middle layers with 3n ports) creates unequal path counts under ECMP: some links may be overutilized or idle due to port count differences.

MRC's SRv6 static source routing bypasses this problem: the NIC encodes the complete path at packet emission time, and switches forward based on the SRv6 header without needing to understand the topology. This transforms path management for asymmetric topologies from "switches need complex routing protocols" to "NIC-side pre-computation + static encoding."

k Ports per GPU = Natural Multipath Foundation

ZCube requires each GPU to have k NIC ports (k=2 means each GPU connects to 2 different switches), with each port providing an independent path. This is exactly the physical foundation for MRC's multipath transport — packet spraying across 128-256 paths requires GPUs to have sufficient physical egress. ZCube's k=2 configuration (2 independent paths) is the minimum viable configuration; at k=3 or higher, MRC's path diversity becomes more ample.

An Unverified Critical Question

The ZCube + MRC synergy has not been validated by papers or production data. The key unknowns: Does ZCube's asymmetric topology require special adaptation for MRC's source routing? How does standard ECMP actually perform on ZCube (the paper only verified full bisection bandwidth without deeply analyzing ECMP hash distribution on asymmetric topologies)? The answers may determine whether ZCube at scale requires MRC as a mandatory companion, or whether ECMP suffices.

→ For in-depth analysis, see companion piece two: "From RoCE to MRC: AI Cluster Transport Protocols and Chip Rearchitecture".

ATOP's Generality: Beyond Greenfield Datacenters

The paper demonstrates ATOP's applicability across five scenarios:

Greenfield (256-16K GPUs): The most cost-effective topology at every scale is ZCube
Retrofitting existing DCs (4K GPU ROFT rewiring): Replacing only optical modules and cables, ZCube remains optimal
Expansion (1K→4K GPUs): Retaining existing equipment, ZCube remains on the Pareto front
Multi-tenant (16K split across 4 tenants at 4K each): ZCube remains the knee point
Heterogeneous (H100+A100 mixed): Under strict constraints, ZCube falls outside the search space, but ATOP still finds a topology 6% higher performance and 11% lower cost than ROFT

ATOP's value extends beyond discovering ZCube — it is a reusable topology optimization tool. When new hardware, new models, or new constraints emerge, simply re-run ATOP.

OCS: Another Direction for Topology Evolution

ZCube represents "optimizing topology within the electrical switching framework." There is another evolutionary path: replacing Spine-layer electrical switches with Optical Circuit Switching (OCS).

Google Apollo

Google has deployed OCS at scale since 2022 (codenamed Apollo), using 3D MEMS micromirror arrays to replace the Spine layer. SemiAnalysis estimates Apollo has saved Google over $3 billion in network equipment procurement.

InfiniteHBD (SIGCOMM 2025)

Going further — integrating connectivity and dynamic switching capability at the optical transceiver level. Each optical transceiver can dynamically change its connection target, constructing datacenter-scale high-bandwidth domains.

OCS Applicability Conditions

OCS is unsuitable for packet-level switching (millisecond-scale switching speed), but AI training traffic is coarse-grained — collective communication persists for milliseconds to hundreds of milliseconds. OCS's winning conditions are: large flows, predictable patterns, and sufficient scale for the Spine layer to become a bottleneck.

Dimension	Electrical Switching	OCS
Switching granularity	Packet-level (microseconds)	Circuit-level (milliseconds)
Latency	Multi-hop accumulation	All-optical path, extremely low
Power consumption	Electrical processing per hop	Optical path passthrough
Applicable scenario	General-purpose	Coarse-grained, predictable traffic

Topology Evolution Trend Analysis

Four Directions

From symmetric to asymmetric: ZCube proves asymmetric is superior under AI workloads
From three layers to two layers: OpenAI's multi-plane Clos (MRC) and Microsoft Fairwater both compress to two layers
From electrical to electro-optical hybrid: Google Apollo proves OCS is feasible at the Spine layer
From switch intelligence to endpoint intelligence: MRC shifts routing decisions from switches to NICs

Topology by Scale

GPU Scale	Recommended Topology	Rationale
< 500	1-hop (single 128×800G switch)	Simplest, lowest latency
500 - 4K	ZCube or Rail-only	Two layers suffice, asymmetric optimization yields real benefits
4K - 16K	ZCube(128,2)	Optimal cost/performance/fault-tolerance balance point
16K - 64K	Multi-Plane Clos + ZCube pod	Multi-plane + ZCube as intra-pod topology
64K+	Multi-Plane Clos (MRC) or Fairwater flat network	Requires multipath protocol cooperation

Open Questions

ZCube performance at 64K+ scale is unknown — the paper only validates up to 16K. ATOP search computational cost and simulation accuracy at larger scales are both challenges.
Operational cost of asymmetric topology — different layers require different switch specifications, increasing spare-parts management complexity. Will production environments accept the operational overhead for the performance/cost gains?
ZCube + MRC synergy — the two are currently evolving independently but have natural structural advantages when combined. ZCube requires each GPU to have k NIC ports; at k=2, each GPU has 2 independent paths — exactly the physical foundation for MRC's multipath transport. MRC's EV state machine could sense path state across ZCube's different layers for more precise path selection. But their synergy has not been validated by papers or production data. A key open question: does ZCube's asymmetric topology (2n vs 3n port switches) require special adaptation for MRC's source routing? Or is ECMP sufficient on ZCube?
OCS vs ZCube — competition or complement — ZCube eliminates 60% of electrical switches; OCS attempts to replace the Spine layer with optical switches. Will the two paths converge at some scale point?

Physical Constraints: The Real Ceiling on Topology Choice

The above discussion operates in the theoretical space of topology structure. But actual deployment faces three physical constraints that may limit the freedom of topology choice.

Optical module cost and power. ZCube's sweet spot uses 200G optical modules instead of 400G — lower per-port cost. But optical modules remain one of the largest cost items in the network. For 1024-GPU ZCube(32,2), 49,152 × 200G links at $50-100 each = $2.5-5M, accounting for 40-60% of total network cost. In the 1.6T era, optical module costs may double, compressing the cost savings from topology optimization.

Rack power limits. Each GPU server (8×H100/H200) draws approximately 10-12kW. A standard 42U rack is typically limited to 20-30kW (air-cooled), accommodating at most 2-3 servers. This means ToR switch GPU port density is naturally limited — even if the topology design allows denser connections, physical space and power constraints prevent it. 800V DC distribution can raise rack power limits to 120kW+, but requires entirely new power infrastructure.

Copper vs fiber distance constraints. GPU server to ToR switch cabling is typically 1-3m (copper) or 3-10m (active optical cable AOC / direct attach copper DAC). ZCube's asymmetric structure means different layers may be in different physical locations, and inter-layer cabling distances may exceed the effective range of copper, forcing the use of optical modules — increasing cost. ATOP's search space currently assumes all links are equivalent, but physical distance is a hard constraint in deployment.

These three constraints mean: the "optimal topology" from ATOP search needs revalidation under physical constraints. The theoretical Pareto front may shrink due to physical limitations.

ATOP's True Contribution: Methodology > Specific Topology

If the ZCube paper only delivered the ZCube topology, its value would be limited — hardware generations change, models evolve, and optimal topologies shift.

But ATOP's contribution is at the methodology level:

Transforming topology design from "experts relying on intuition" to "automated optimization in a formalized search space"
Eliminating human cognitive biases (symmetry preference, homogeneity preference)
Multi-objective optimization (performance, cost, fault tolerance considered simultaneously)
Reusable — when new hardware, models, or constraints emerge, simply re-run

From BCube (2009) to ZCube (2025), Dan Li's group's 16 years of topology research traces a clear line: topology design is transitioning from "art" to "engineering."

Disclaimer: This article is based on the SIGCOMM 2025 Best Paper "From ATOP to ZCube," public technical materials from ByteDance and Zhipu AI, and public reports on Google Apollo, cross-validated before writing. This article does not constitute investment advice. Data herein is current as of June 1, 2026.

From CLOS to ZCube: Network Topology Evolution for AI Computing Clusters