Two Revolutions, One Network: The Co-Evolution of AI Training Cluster Topology and Protocol
AI training networks at 100K GPU scale are undergoing simultaneous paradigm shifts on two axes: physical-layer topology (ZCube's asymmetric design cuts 60% of switches) and logical-layer protocol (MRC pushes intelligence from switches to NICs). The real story is that these two revolutions are mutually dependent — they must be co-designed. This overview builds a dual-axis framework, covering RoCEv2→MRC→UET protocol evolution, Clos→Rail→ZCube→OCS topology innovation, chip/vendor/cloud landscapes, and a scale-by-scenario decision framework.
This is the overview article of a three-part series. Companion piece one, "From CLOS to ZCube: Network Topology Evolution for AI Computing Clusters", dives deep into physical-layer topology innovation — ATOP automated search, asymmetric structures, sweet-spot scale, physical constraints. Companion piece two, "From RoCE to MRC: AI Cluster Transport Protocols and Chip Rearchitecture", dives deep into transport protocol evolution and chip implementation — EV state machine, SRv6 uSID, NIC/switch chip rearchitecture, China industry gap.
One Problem, Two Dimensions

By late 2025, OpenAI's production cluster runs 131,072 GPUs. ByteDance's training cluster reaches 16,384 GPUs. Google, Meta, Alibaba, and Microsoft are all pushing toward 100K GPU scale.
At this scale, the design assumptions of traditional datacenter networks collapse entirely. But the collapse doesn't happen in one place — it happens simultaneously on two dimensions:
Physical layer: how to connect. Traditional three-tier Clos at 100K GPU requires four switch layers, explosive numbers of optical modules, and severe latency stacking. Topology design shifts from "how to connect more" to "how to connect smarter."
Logical layer: how to route packets. RoCEv2's ECMP hashing inevitably produces flow collisions at scale. PFC creates cascading storms in multi-tier topologies. Dynamic routing convergence takes seconds while training jobs need microsecond-level recovery. Protocol design shifts from "how to route reliably" to "how to route fast and tolerate failures."
Over the past three years, each dimension has produced a paradigm-level breakthrough:
- Topology: ByteDance's SIGCOMM 2025 Best Paper ZCube transformed topology design from "expert intuition" to "automated search," discovering that asymmetric structures consistently outperform symmetric ones in AI training — cutting 60% of switches at the sweet-spot scale.
- Protocol: OpenAI, jointly with NVIDIA/AMD/Broadcom/Cisco/Arista, introduced MRC, simultaneously overturning five consensus assumptions in datacenter networking — rewriting everything from ECMP to PFC to dynamic routing, with the core idea of pushing intelligence from switches to NICs.
But these two revolutions didn't happen independently. ZCube's 2-hop diameter dramatically simplifies MRC's failure detection; MRC's SRv6 static source routing makes asymmetric topology path management feasible. They are two faces of the same problem and must be understood together.
Topology Revolution: From General-Purpose Non-Blocking to AI-Load-Specific
Three Assumptions of Clos
Charles Clos's 1953 non-blocking multistage switching network forms the basis of modern datacenter Fat-Tree topologies. Charles Leiserson brought it into parallel computing in 1985. It carries three implicit assumptions:
- Traffic patterns are unpredictable — requiring full bisection bandwidth for "any input to any output"
- Many independent small flows — ECMP hashing distributes evenly statistically
- Switches are homogeneous — same port count, same specs, simplifying procurement and operations
Under AI training workloads, all three assumptions collapse: collective communication patterns are highly regular (AllReduce/All-to-All repeat the same pattern every step — "any-to-any" is unnecessary); elephant flows collide easily (ECMP is uneven with few large flows); and ATOP's automated search reveals asymmetric structures consistently outperform symmetric ones.
Scale ceiling: A three-tier Fat-Tree with 64-port switches supports ~32K endpoints; four tiers reach ~64K but with sharply higher latency and cost. Three-tier Clos at 100K GPU scale either needs four tiers (higher latency/cost) or oversubscription (no bandwidth guarantee).
Rail-Optimized: From "General Connectivity" to "Matching Communication Patterns"
NVIDIA DGX SuperPOD's Rail+Global design represents the first step in topology evolution — no longer pursuing full connectivity, but matching actual AI training communication patterns.
Core idea: a GPU server has 8 GPUs, each with its own NIC. Instead of connecting all 8 NICs to the same ToR switch, connect GPUs in the same position across all servers to the same switch, forming 8 independent "Rails." Same-rank GPU communication requires only one hop; most traffic is absorbed at the Leaf layer, reducing Spine pressure.
Rail-Only vs Rail+Global: Rail-Only eliminates top-tier switches for lowest cost, but only supports highly localized communication (e.g., pure data parallelism). Rail+Global adds a Spine layer for All-to-All and other global communication, but at higher cost. Rail-Optimized is effective for AI, but remains fundamentally a symmetric topology — it doesn't challenge the "switches must be homogeneous" assumption.
→ Detailed Rail-Optimized design and PD-disaggregated inference analysis in Companion Piece One.
ZCube: Turning Topology Design into Hyperparameter Search
ByteDance, jointly with Tsinghua University, did something simple but unprecedented with ATOP (Automated Topology Optimization Pipeline): encoded all topology design decisions into 11 hyperparameter classes and searched the Pareto front using NSGA-II evolutionary algorithm.
11 hyperparameter classes cover inter-layer connections (GPU count, layer count, nodes per layer, blocking parameters, connection count, bandwidth factor 200G-800G) and intra-layer connections (dimensions, nodes per dimension, outward connections, coordinate computation factors), compressing the search space from O(2^N²) adjacency matrices to something searchable on a single 256-core server in 3 days.
14 optimization objectives: 9 DP/PP/Mixed traffic JCT + 2 MoE JCT + ForestColl all-gather + APS fault tolerance + cost. The flow-level simulator achieves only 1.5% average error versus NS-3 packet-level simulation.
Asymmetric structure discovery: Across searches at 256/1024/4096/16384 GPU scales, the optimal solution at the Pareto front knee consistently exhibits the same asymmetric characteristic — 2n ports for first/last layers, 3n ports for middle layers. The paper formalizes this as ZCube.
ZCube(n,k) recursive definition:
- ZCube(n,1) = 1 switch + n GPUs
- ZCube(n,k+1) = n × ZCube(n,k) + n^k switches
- GPU count = n^(k+1), switch count = (k+1) × n^k, each GPU has k+1 NIC ports
Key property: Network diameter = k (ZCube(128,2) has diameter just 2, vs 5-7 hops for three-tier Clos). Low diameter directly reduces PP traffic completion time — the primary source of end-to-end training speedup.
16K GPU Quantitative Comparison
Using Broadcom Tomahawk 5 (51.2T) switches:
| Topology | Switches | Cables | Training Iteration | Network Cost |
|---|---|---|---|---|
| ROFT | 640 | 49,152×400G | 5.19s | $92.93M |
| Rail-only | 384 | 32,768×400G | 5.15s | $76.38M |
| HPN | 384 | 16,384×400G + 32,768×200G | 5.10s | $84.03M |
| ZCube(128,2) | 256 | 49,152×200G | 4.95s | $57.28M |
ZCube uses 60% fewer switches than ROFT, 200G cables (vs 400G) cutting optical module cost by 25-50%, training 3-7% faster, with 26-46% lower network cost.
Sweet-Spot Scale and Production Validation
ZCube(n,k) requires each GPU to have k NIC ports. k=2 means each GPU needs 2 ports (one 800G NIC breakout into 2×400G), which most servers support. k=3 is beyond most servers.
| GPU Scale | Optimal ZCube | Switches | Worth it? |
|---|---|---|---|
| <500 | — | — | ❌ Flat is optimal |
| 512 | ZCube(23,2) | 46 | ⚠️ Marginal advantage |
| 1024 | ZCube(32,2) | 64 | ✅ Sweet spot |
| 4096 | ZCube(64,2) | 128 | ✅ |
| 16384 | ZCube(128,2) | 256 | ✅ Paper's core case |
1024-GPU ZCube(32,2) is the sweet spot: 64-port switches perfectly map to Tomahawk 5's standard configuration, zero port waste. Zhipu AI's ~1000-GPU inference cluster benefits at this scale — saving 1/3 optical modules and switches, boosting inference throughput 15%.
Fault tolerance: at 16K GPU, single ToR failure causes only 2.8% performance degradation for ZCube (vs 46.9% for ROFT). Fault-free probability: ZCube 93% (vs ROFT 83%) — fewer switches is itself a source of higher reliability.
→ Complete ZCube analysis (ATOP methodology, NVLink domain expansion, fault tolerance, production validation details, OCS comparison) in Companion Piece One.
OCS: Another Path for Topology Evolution
ZCube optimizes topology "within the electrical switching framework." Another path: replace Spine-layer electrical switches with Optical Circuit Switching (OCS).
Google Apollo has deployed OCS at scale since 2022 (3D MEMS micro-mirror arrays), saving an estimated $3B+ (SemiAnalysis). SIGCOMM 2025's InfiniteHBD goes further — integrating dynamic connection capability at the optical transceiver level.
| Dimension | Electrical Switching | OCS |
|---|---|---|
| Switching granularity | Packet-level (μs) | Circuit-level (ms) |
| Latency | Multi-hop accumulation | All-optical path, minimal |
| Power | Electrical processing per hop | Optical path passthrough |
| Best for | General purpose | Coarse-grained, predictable traffic |
OCS wins when: traffic is large, patterns are predictable, and scale is sufficient to make the Spine layer a bottleneck. AI training's collective communication fits perfectly. ZCube and OCS aren't competing — ZCube saves electrical switches, OCS replaces the Spine layer; they complement at different scale points.
Topology Evolution Summary
Four parallel directions:
- From symmetric to asymmetric (ZCube) — ATOP search proves asymmetric is superior
- From three tiers to two tiers (MRC multi-plane Clos) — OpenAI/Microsoft compress to two tiers
- From electrical to optical-electrical hybrid (Google Apollo OCS) — optics replacing electrical in Spine
- From switch intelligence to endpoint intelligence (MRC) — routing decisions shift from switch to NIC
→ Complete topology evolution path, quantitative comparisons, NVLink domain expansion, fault tolerance analysis, physical constraints (optical module cost, cabinet power, cabling distance as hard limits on topology choice) in Companion Piece One.
Protocol Revolution: From Switch Intelligence to NIC Intelligence
RoCEv2: Status and Limitations
RoCEv2 (RDMA over Converged Ethernet v2) is the de facto standard for AI training networks. 2-5μs latency, mature operations, only 0.5%-3% performance gap versus InfiniBand. Meta deploys RoCEv2 + DCQCN + ECN at >30K GPU scale — the industry benchmark.
But as scale grows, three bottlenecks cascade:
ECMP hash collisions. Each flow is hashed to one path. AI training produces few elephant flows (collective communication); two large flows landing on the same link causes congestion. Larger scale = higher collision probability.
PFC storms. RoCEv2 relies on PFC for lossless transport — when receive buffers near full, pause frames tell senders to stop. In multi-tier topologies, pause frames propagate upstream forming head-of-line blocking; pause frames at different priorities can even deadlock each other.
Dynamic routing convergence. BGP/OSPF recalculation after link failure takes tens to hundreds of milliseconds. Training jobs are extremely latency-sensitive — one convergence event can cause AllReduce timeout.
Meta's Ghost paper (SIGCOMM 2024) reveals a deeper problem: link flapping causes topology knowledge invalidation, producing "ghost" nodes. This isn't a "fix RoCEv2" problem — it's systematic design assumption collapse at scale.
RFC 9800: SRv6 Source Routing Infrastructure
RFC 9800 (published June 2025) defines Compressed SRv6 Segment List Encoding (C-SID/uSID/micro-SID). Standard SRv6 SID occupies 128 bits; a 10-segment path needs 160 bytes overhead — unacceptable in large-scale datacenters. C-SID packs multiple 16-32 bit compressed SIDs into one 128-bit container, reducing SRH overhead by 50%+.
Two implementation approaches: NEXT-C-SID (Cisco/F5, shift-and-lookup) and REPLACE-C-SID (China Mobile, deployed in 2022 cloud backbone, 10+ vendor interop tested).
Significance for AI networks: source routing enables multipathing — the sender encodes the complete path in the packet header; switches don't need dynamic routing. C-SID compression makes encoding overhead manageable. Microsecond-level failure bypass — path information is in the packet header, not the forwarding table; NICs detect failures and immediately switch to backup paths.
MRC: Overturning Five Consensus Assumptions
In May 2026, OpenAI jointly with AMD/Broadcom/Intel/Microsoft/NVIDIA released MRC (Multipath Reliable Connection) at OCP. Not a patch on RoCEv2 — a systematic overturning:
| Consensus | Traditional | MRC | Core Change |
|---|---|---|---|
| Load balancing | ECMP hash (flow-level) | Entropy Value packet spraying (packet-level) | Eliminates flow collision |
| Lossless transport | PFC (pause frames) | Disable PFC + selective retransmission | Eliminates head-of-line blocking |
| Ordered delivery | Single-path ordered | Out-of-order direct write (per-packet virtual address) | Eliminates ordering latency |
| Routing | Dynamic (BGP/OSPF) | SRv6 uSID static source routing | Eliminates convergence latency |
| Congestion control | Switch + host cooperative | Switch only marks ECN | Eliminates control plane conflict |
Design philosophy: push intelligence from switches to NICs; let switches return to stateless forwarding.
MRC is a minimal extension to RoCEv2 RC transport, retaining only RDMA Write and Write-with-Immediate (AI workloads need only this subset), reusing existing RDMA Verbs/QP ecosystem. MRC explicitly "draws on multiple techniques from UET" (paper's own words).
Multi-plane two-tier Clos: Each 800G NIC breakouts to 8×100G connecting 8 T0 switches. A 51.2T switch goes from 64×800G to 512×100G; a single plane accommodates 131,072 GPUs. Versus three-tier: optical modules reduced to 2/3, switches to 3/5, longest path only 3 hops.
Production deployment: OpenAI's largest NVIDIA GB200 supercomputer (including Oracle/OCI's Abilene, Texas site), Microsoft Fairwater (Atlanta + Wisconsin). During training, hot-restarting 4 T1 switches required no coordination with the training team — jobs continued running.
UET: Parallel Evolution
UEC (Ultra Ethernet Consortium, 120+ members, Linux Foundation's fastest-growing working group) released UET Specification 1.0 in June 2025. Technical foundation ~75% from HPE Slingshot transport protocol.
UET and MRC share multiple core concepts: packet spraying, out-of-order placement, selective retransmission, packet trimming. Key differences:
| Dimension | MRC | UET |
|---|---|---|
| Design path | Minimal RoCEv2 RC extension | Entirely new transport stack |
| Software interface | RDMA Verbs (Write+WriteImm) | libfabric v2.0 |
| Flow control | Disable PFC | Credit-based |
| Source routing | SRv6 uSID | None (relies on switch routing) |
| Deployment barrier | Medium (MRC NIC + SRv6 switch) | High (entirely new software stack) |
| Production validation | OpenAI/MS 131K GPUs | Spec just released |
AMD's NSCC congestion control algorithm also became part of the UEC congestion control specification. MRC and UET are complementary, not competing — MRC takes the pragmatic fast-deployment path; UET takes the clean-slate long-term evolution path.
InfiniBand: The Last Bastion of Closed Ecosystems
NVIDIA dominates IB through its Mellanox acquisition. XDR (800 Gb/s) is being deployed (Quantum-X800 + ConnectX-8); GDR (1600 Gb/s) is on the roadmap.
IB technical advantages: native lossless (credit-based, no PFC storms), native multipathing, ultra-low latency (1-2μs), NVIDIA full-stack compatibility guarantee. Disadvantages: high cost, vendor lock-in (effectively only NVIDIA), scarce operations talent, closed ecosystem.
Trend: IB retains the high-end market in 2026, but Ethernet (MRC/UET) eroding it is the probable medium-term outcome. NVIDIA itself supports both paths (ConnectX-8 supports both RoCEv2 and MRC). Gartner predicts >65% of generative AI clusters will be Ethernet-based by 2029.
Protocol Comparison Matrix
| Dimension | RoCEv2 | MRC | UET | InfiniBand |
|---|---|---|---|---|
| Multipathing | None (ECMP flow-level) | ✅ Packet spraying 128-256 paths | ✅ Packet spraying | ✅ Adaptive routing |
| Loss recovery | Go-Back-N/selective retransmission | Selective retransmission + trimming | Selective retransmission + trimming | Link-level + transport-level retransmission |
| Flow control | PFC (lossless) | Disable PFC | Credit-based | Credit-based |
| Source routing | None | SRv6 C-SID | None | None |
| Failure recovery | Seconds (routing convergence) | Microseconds (NIC bypass) | Milliseconds (TBD) | Seconds (Subnet Manager) |
| Deployment complexity | Medium | Medium-high | High | Medium (NVIDIA integrated) |
| Cost | Low | Medium | Medium | High |
| Suitable scale | ≤64K GPU | 100K+ GPU | 100K+ GPU | ≤64K GPU (economic scale) |
Standardization Landscape
Three parallel tracks: IETF (SRv6/RFC 9800) provides underlying source routing infrastructure; OCP (MRC) takes the pragmatic path of minimally modifying RoCEv2 for rapid deployment; UEC (UET/UEC 1.0) takes the clean-slate transport stack path. The three aren't mutually exclusive: MRC draws on UET technology; SRv6 serves MRC's source routing needs.
→ Protocol core mechanisms (EV state machine details, SRv6 uSID forwarding process, Packet Trimming, NIC/switch chip quantitative analysis), protocol comparison details and standardization progress (IETF/OCP/UEC/IEEE/IBTA) in Companion Piece Two.
Why These Two Revolutions Are Mutually Prerequisite
This is the overview's core argument: topology and protocol revolutions are deeply coupled co-designs; they cannot be chosen independently.

Synergy 1: Short Diameter Reduces Protocol Complexity
ZCube's 2-hop diameter doesn't just reduce latency — it directly simplifies every key aspect of the transport protocol:
- Faster failure detection: MRC's EV four-state machine (active → congested → suspected_failed → confirmed_failed) judges path health every RTT. 2-hop topology RTT is far shorter than 5-7-hop three-tier Clos; convergence is faster, confidence higher.
- Lower SRv6 overhead: 2 hops need only 2-3 uSID segments; overhead after compression is negligible. 5-7 hops need more segments; accumulated overhead eats into payload.
- Simpler reordering: Fewer intermediate nodes = less out-of-order extent = lighter NIC reorder buffer and SACK logic.
Conversely, MRC works in traditional three-tier Clos, but 5-7-hop paths weaken packet spraying advantage, SRv6 overhead is larger, and failure detection is slower.
Synergy 2: Source Routing Makes Asymmetric Topology Feasible
Traditional ECMP requires multiple equal-cost paths — this implicitly assumes symmetric topology. Asymmetric topology (ZCube's 2n first/last layer vs 3n middle layer) creates unequal path counts under ECMP; some links may be overused or idle.
MRC's SRv6 static source routing bypasses this constraint: NICs encode complete paths at send time; switches forward per SRv6 headers without understanding topology. Path management shifts from "switches need complex routing protocols" to "NIC pre-computation + static encoding."
Without MRC (or similar source routing), ZCube's asymmetric structure would be much harder to manage in production. Conversely, without ZCube-class short-diameter topology, MRC's packet spraying and fast failure detection advantages are weakened.
Synergy 3: Dual Savings from Switch Simplification and Topology Cost
| Dimension | Traditional 3-tier Clos + RoCEv2 | ZCube + MRC |
|---|---|---|
| Switch count | Baseline | -60% |
| Per-switch complexity | Large TCAM/Buffer/PFC needed | Net reduction ~50MB buffer, no dynamic routing |
| Optical module count | Baseline | -40% (shorter paths = fewer modules) |
| Failure recovery | ~100ms (routing convergence) | ~10μs (NIC autonomous bypass) |
| Paths per flow | 1 | 128-256 |
| NIC overhead | Baseline | +16KB/QP (EV/SRv6/SACK/retransmit/OOO) |
Network cost reduction comes not from topology alone or protocol alone, but from both together. Switches become simpler, so reducing count doesn't sacrifice reliability; topology becomes shallower, so protocol failure detection windows shrink.
Not Two Independent Choices
- ZCube with RoCEv2: Asymmetric topology creates ECMP path management difficulties; PFC cascading effects persist
- MRC with traditional 3-tier Clos: 5-7 hops weaken packet spraying; SRv6 overhead larger; failure detection slower
- Only both together achieve 2-hop + source routing + no PFC + microsecond-level failure recovery
→ More detailed technical analysis: Companion Piece One's "ZCube and MRC Synergy" section (how 2-hop diameter simplifies EV state machine, why asymmetric topology needs source routing, relationship between k ports and multipathing), and Companion Piece Two's "MRC and Asymmetric Topology Synergy" section (source routing enables asymmetry, short paths amplify packet spraying, unexplored joint optimization space).
Chip Industry Landscape
NIC: From Accessory to Core
Traditional NICs are server peripherals — simple function, low differentiation. MRC makes NICs the core of network intelligence — EV state machine, SRv6 encoding, packet spraying scheduling, and out-of-order reassembly all happen on the NIC. Per-QP state balloons from 512 bytes to ~16KB (EV set 2KB + SRv6 mapping 4KB + SACK 0.5KB + retransmit 8KB + OOO tracker 1.5KB); at 2000+ QP scale, on-chip SRAM is insufficient, requiring DDR or HBM.
| Vendor | Product | Strategy | SRv6 | MRC |
|---|---|---|---|---|
| NVIDIA | ConnectX-8 | Firmware + DDR cache | ✅ | ✅ |
| AMD | Pollara 400 | Hardware + HBM cache | ✅ (via UEC) | ✅ (first compatible) |
| Broadcom | Thor Ultra | NPL programmable | ✅ native | ✅ native |
NIC die cost is trending up: MRC functionality occupies ~15-20% of die area. Acceptable for NVIDIA/Broadcom; a higher barrier for newcomers.
Switches: Simpler But Not Cheaper
MRC returns switches to stateless forwarding — net reduction ~50MB buffer + significant TCAM. But bandwidth demand grows exponentially: 102.4T → 204.8T requires more advanced SerDes (200G→400G/lane) and CPO (co-packaged optics).
| Chip | Generation | Bandwidth | MRC Support | Key Features |
|---|---|---|---|---|
| Broadcom TH6 | 102.4T | 64×800G / 128×400G / 512×100G | ✅ Hardware | Cognitive Routing 2.0, Packet Trimming (CSIG) |
| Cisco G300 | 102.4T | 64×800G | ✅ P4 | uSID acceleration, high programmability |
| NVIDIA Spectrum-6 | 102.4T | 64×800G | Limited/roadmap | Unified IB ops stack |
| Marvell Teralynx 10 | 51.2T | 64×800G | — | No multipathing support currently |
| Huawei CloudEngine | — | — | — | China market leader |
Chip × Protocol Matrix: Broadcom leads in MRC support with TH6 + Thor Ultra end-to-end; Cisco maintains flexibility through P4 programmability; NVIDIA keeps exclusivity in IB ecosystem. MRC/UET support is the key differentiator in the 102.4T generation — switch chips without multipath reliable transport will be marginalized in the AI market.
Path to 204.8T: 200G/lane SerDes design difficulty grows exponentially; CPO shifts from "optional" to "essential"; cross-die coherence in chiplet architectures is an unsolved problem.
Competitive Landscape
| Strategy | Players | Advantage | Disadvantage |
|---|---|---|---|
| Closed full-stack | NVIDIA (IB+ETH+GPU), Google (TPU+ICI+OCS) | Peak performance, tight integration | High cost, vendor lock-in |
| Chip + Device | Cisco (Silicon One+Nexus), Huawei (in-house+CloudEngine) | Differentiation + control | Narrower ecosystem |
| Device + Software | Arista (Broadcom chip+EOS) | Software differentiation | Chip dependency on Broadcom |
| Component Supply | Broadcom, Marvell | Horizontal platform | Lower margins |
The open ecosystem (MRC/OCP + UEC, 120+ members) is systematically challenging closed ecosystems. Most hyperscalers choose a hybrid strategy: NVIDIA IB for core training, open Ethernet for scale-out and inference.
Who Uses What
| Cloud Provider | Network Solution | Scale | Key Characteristic |
|---|---|---|---|
| OpenAI/MS | MRC + multi-plane 2-tier Clos | 131K GPU | Largest MRC production deployment |
| OCS + ICI + Virgo | TPU full-stack in-house | Only fully self-developed path | |
| Meta | RoCEv2 + large-scale tuning | >30K GPU | Largest RoCEv2 + ECN/DCQCN deployment; Ghost paper reveals reliability risks |
| ByteDance | ZCube + Rail-Optimized | 16K GPU | ZCube paper source |
| Alibaba | HPN + Stellar | >15K GPU | Most SIGCOMM 2025 contributions (11 papers) |
| AWS | EFA/SRD | In-house | Non-mainstream approach |
| xAI | Ethernet | — | Arista + Broadcom |
Software-side common trends:
- AI-driven network operations (AgenticOps / Intent-Based Networking) becoming the new management battlefield
- Shift from "manually tune PFC/ECN parameters" to "AI adaptive optimization"
- Dramatically increased observability investment — 100K GPU network state cannot be inspected manually
China market specifics:
- Domestic substitution is irreversible; Huawei + H3C + Ruijie dominate
- Domestic 800G switch shipments grew from 15K (2023) to 60K (2025), CAGR >100%
- DeepSeek and other domestic LLMs driving inference-side 200G/400G switch demand
- SIGCOMM 2025 Chinese institutional contributions are exceptionally prominent (Alibaba 11 papers, ByteDance 2 best papers, Tsinghua/PKU/HKUST) — hyperscale practice demands drive systematic academic innovation
Decision Framework

Topology by Scale × Protocol by Scenario
| GPU Scale | Recommended Topology | Recommended Protocol | Rationale |
|---|---|---|---|
| >50K | Multi-plane 2-tier Clos | MRC + SRv6 | Failure recovery and packet spraying are essential |
| 10K-50K | Multi-plane or Rail+Global | MRC or RoCEv2+tuning | Transition zone |
| 1K-10K | ZCube / Rail-Optimized | RoCEv2 or MRC | ZCube sweet spot |
| <1K | 1-hop / Flat | RoCEv2 | Protocol choice insensitive |
Scenario Differentiation
Large-scale synchronous pre-training (>10K GPU): Most sensitive to tail latency and failure recovery. MRC's packet spraying and microsecond-level failure bypass are optimal. Multi-plane 2-tier Clos provides shortest paths.
Medium-scale training (1K-10K): RoCEv2 + DCQCN + ECN tuning is manageable at current scale. ZCube provides better cost efficiency.
Inference services: More sensitive to cost and throughput; lower tail latency requirements. ZCube's advantage is most pronounced here (Zhipu AI: 15% throughput + 1/3 hardware savings).
Mixed workloads (training+inference+general): Consider RoCEv2 + UET evolution path, or zoned deployment (training zone MRC + inference zone ZCube).
China Industry: Gap and Opportunity
→ Full analysis in Companion Piece Two's China industry section. Core judgments:
- Hardware gap: Domestic NIC chips lag 1-2 generations in MRC support; cannot produce ConnectX-8/Thor Ultra equivalents short-term
- Software gap: MRC's open-source implementation (OCP) provides a catch-up window, but requires deep standards participation
- Opportunity: China's hyperscale deployment demands are driving original academic contributions. ZCube itself is a ByteDance+Tsinghua collaboration
Key Judgments and Risks
Core Judgments
Judgment 1: Ethernet will become the mainstream AI backend network within 3-5 years. RoCEv2 + MRC/UET systematically resolve Ethernet's three core shortcomings in AI (single path, PFC storms, slow failure recovery). InfiniBand won't disappear but retreats from mainstream to a latency-sensitive high-end niche.
Judgment 2: MRC is the most aggressive Ethernet solution today. MRC + SRv6 + multi-plane Clos represents the frontier: eliminate dynamic routing, replace with source routing, replace ECMP with packet spraying, replace multi-tier with multi-plane. OpenAI and Microsoft production validation provides the strongest practical endorsement.
Judgment 3: Topology design shifts from "engineering intuition" to "automated search." ATOP's methodological contribution is greater than ZCube itself — reusable, eliminates cognitive bias, supports multi-objective. When new hardware/models appear, just re-run ATOP.
Judgment 4: The open ecosystem is systematically challenging closed ecosystems. MRC (OCP open source), UET (UEC 120+ members), P4 programmable chips, multi-vendor devices — building high-performance AI networks without NVIDIA lock-in. But NVIDIA's top-end full-stack optimization remains irreplaceable.
Judgment 5: Chinese institutions are already globally leading in AI networking academic research. Not accidental — hyperscale practice demands drive systematic innovation.
Key Risks
- MRC interoperability: Not yet fully validated in multi-vendor heterogeneous environments. UET is still early; from spec to large-scale deployment takes 1-2 years.
- PFC alternative uncertainty: MRC disables PFC, UET uses credit-based, SIGCOMM 2025 DCP proposes a third path — which approach works reliably at the broadest scale needs more validation.
- Ghost problem: Link flapping causing topology knowledge invalidation may become a systemic risk at 100K GPU scale; faster failure detection alone cannot fundamentally solve it.
- 1.6T physical layer: 200G/lane SerDes difficulty grows exponentially; CPO serviceability and supply chain unresolved; 204.8T chiplet cross-die coherence is unknown.
- Supply chain geopolitics: Export controls and domestic substitution requirements affect equipment availability and cost.
Three-Year Roadmap
2026: MRC begins small-scale deployment beyond OpenAI/MS. ZCube trials by vendors beyond ByteDance/Zhipu. UEC 1.0 interoperability testing begins. 102.4T switch chips enter mass production.
2027: 800V power distribution + 102.4T + MRC "golden combination" becomes the default for new hyperscale training clusters. OCS pilots in non-Google environments. 1.6T ports and CPO begin small-scale deployment.
2028: UET ecosystem matures, complementing MRC. 204.8T chips in mass production. Topology-protocol co-design becomes mainstream methodology — ATOP-class tools integrate MRC constraints for joint search.
Advice by Role
AI infrastructure decision-makers:
- Short-term (2026): New training clusters should prioritize Ethernet + RoCEv2 with MRC-upgradable equipment; 1K-10K scale consider ZCube
- Medium-term (2027-2028): Evaluate MRC/UET migration as ecosystem matures; watch 1.6T and OCS deployment timing
- Avoid lock-in: Choose P4 programmable switch chips for protocol evolution headroom
Network equipment vendors:
- Differentiation shifts from "speed competition" to "architecture competition" — Buffer management, programmability, load balancing strategy
- Software value: AI-driven network operations (AgenticOps), Intent-Based Networking (IBN) are the new battlefield
- Chinese vendors: Domestic substitution window accelerating; Huawei full-stack + H3C DDC innovation has differentiation space
Chip vendors:
- MRC/UET support is the key differentiator in the 102.4T generation
- CPO capability is the entry ticket for the 204.8T generation
- P4 programmability provides "standards not yet set, chips first" insurance for customers
Researchers and investors:
- Watch: MRC adoption speed, UEC interoperability results, ZCube validation at 64K+, OCS deployment outside Google
- Investment directions: Optical interconnect (CPO/DSP/silicon photonics), open Ethernet ecosystem (UEC/MRC), AI network management, China domestic substitution
Appendix: Paper and Standards Tracking
SIGCOMM/NSDI Papers
SIGCOMM 2025 (core AI networking papers):
- ZCube / ATOP (ByteDance + Tsinghua) — Best Paper, automated topology search
- InfiniteHBD (OCS optical circuit switching new approach)
- DCP (de-PFC congestion control new approach)
SIGCOMM 2024:
- Ghost in the Datacenter (Meta) — link flapping causes topology knowledge invalidation
- MegaScale / ByteScale (ByteDance) — large-scale training systems engineering
IETF / OCP / UEC / IEEE / IBTA Standards
→ See Companion Piece Two's standardization landscape section for details. Core standards: RFC 9800 (SRv6 C-SID), MRC 1.0 (OCP), UET 1.0 (UEC), 802.3dj (1.6T Ethernet in progress), XDR (IBTA).
Three-Part Series Navigation
| Article | Focus | Core Content |
|---|---|---|
| This overview | Dual-axis synergy framework | Topology × Protocol cross-relationships, chip/device/cloud landscape, decision framework, risks and roadmap |
| Companion One: Topology | Physical-layer deep dive | ATOP methodology, asymmetric structures, sweet-spot quantitative comparison, NVLink domain expansion, fault tolerance, production validation, OCS, physical constraints |
| Companion Two: Protocol | Logical-layer deep dive | MRC five consensus overturns, EV state machine, SRv6 uSID forwarding, Per-QP state, three NIC comparison, switch resource add/subtract, China industry gap |
