Starting from ZCube: Network Topology Design for PD Disaggregated Inference

As inference replaces training as the primary battlefield for AI infrastructure, GPU cluster network topology needs to be rethought from scratch.

How do you connect tens of thousands of GPUs so that large language models can serve users quickly and efficiently? This article tackles that question. As AI workloads shift from training (building models) to inference (serving them), the assumptions underlying today's GPU network designs are breaking down. We start by examining ZCube, a recently deployed architecture that challenges conventional thinking, identify where it excels and where it falls short, and then propose two new approaches — RailFly for mixed workloads and Rail-Fabric for pure inference — that push further. The argument proceeds step by step: what the bottleneck is, how big it is, how to provision for it, and how to wire the network accordingly.

1. Why Network Topology Suddenly Matters

Before diving in, two quick definitions for readers who aren't AI infrastructure specialists. Training means using massive datasets to build a model — it's a batch computation that takes days or weeks. Inference means using a trained model to respond to user requests in real time. A growing trend in inference is PD disaggregation (Prefill-Decode Disaggregation): splitting each request into two stages handled by different GPUs. The Prefill (P) stage reads and processes the user's input prompt; the Decode (D) stage then generates the model's response, token by token. Separating these stages onto dedicated GPUs allows each to be optimized independently — but it means large amounts of intermediate data must travel across the network from P to D between stages. Understanding this data movement is the key to understanding the rest of this article.

As large model clusters evolve from thousands to tens of thousands of GPUs, raw GPU compute is no longer the sole bottleneck. The network links connecting that compute now directly determine a cluster's effective throughput.

Two landmark events occurred almost simultaneously:

May 5, 2026: OpenAI, together with NVIDIA, AMD, Intel, Microsoft, and Broadcom, released the MRC (Multipath Reliable Connection) protocol — solving the network communication bottleneck for ultra-large-scale GPU clusters at the transport protocol layer.
May 21, 2026: The ZCube networking architecture (ByteDance + Tsinghua University, SIGCOMM 2025 Best Paper) completed deployment validation on Zhipu AI's GLM-5.1 thousand-GPU inference production cluster — solving the same problem at the topology architecture layer.

These represent two complementary technical paths: protocol-layer optimization (MRC) and architecture-layer restructuring (ZCube).

This article uses ZCube as a baseline, analyzes its core value and limitations, derives P:D ratios and placement strategies for PD disaggregated inference, and then proposes RailFly — a pragmatic topology family covering mixed training+inference workloads, and its pure-inference evolution Rail-Fabric.

ROFT's Design Assumptions Are Breaking Down

Current mainstream GPU cluster networking is based on ROFT (Rail-Optimized Fat-Tree) — a variant of the classic Clos topology, which can be visualized as a tree of layered switches where data flows up from GPUs through lower-tier (leaf) switches to upper-tier (spine) switches, then back down to the destination. ROFT extends this idea by grouping GPUs onto the same "rail" (a set of same-rank GPUs sharing a switch) so that the most frequent communication stays within one rail and completes in a single hop. The design assumption is that same-rank GPUs communicate the most, so traffic naturally aligns with rails.

Prefill-Decode disaggregated inference breaks this assumption. KV Cache — the intermediate data that stores "already-processed context" during LLM inference, typically ranging from hundreds of megabytes to over 10 GB per request — must be transmitted cross-node from Prefill nodes to Decode nodes. This transfer traffic exhibits three characteristics:

Strong source-destination asymmetry: The KV Cache transfer load across different GPUs can differ by orders of magnitude.
Dynamic variation: Which P nodes pair with which D nodes is determined in real-time by the scheduler and is not predictable.
Burstiness: Long-context requests generate large KV Cache transfers with high instantaneous bandwidth demands.

ROFT's rail mapping fails under this traffic pattern — traffic gets concentrated onto a few switches and links, triggering PFC backpressure, creating a structural problem of "abundant total bandwidth, frequent local congestion."

2. ZCube's Core Value and Limitations

How ZCube Was Born: From Automated Search to Asymmetric Topology

ZCube was not hand-designed by human experts — it was discovered by ATOP (Automated Topology Optimization Pipeline) through automated search across 256 to 16,384 GPU configurations.

ATOP's key insight: model network topology as 11 searchable hyperparameter classes, distilled from expert knowledge of mainstream topologies like CLOS, BCube, and Dragonfly. This transforms topology design into a structured hyperparameter search problem. The search space is far smaller than the O(2^N²) of brute-force adjacency matrices, yet covers structural variants of all mainstream topologies. ATOP uses the NSGA-II evolutionary algorithm (best among four candidates), combined with a custom flow-level simulator (average error of only 1.5% vs. NS-3), searching for Pareto-optimal topologies across GPU training performance, fault tolerance, and network cost.

Key finding: Across four GPU scales (256, 1024, 4096, 16384), the architectures at the Pareto front's inflection point (best cost-effectiveness) were strikingly similar — the paper formalized this as ZCube. ATOP broke through human experts' "symmetry bias" — traditional topologies give all switches the same port count, but ATOP found asymmetric designs superior. From BCube (SIGCOMM 2009, corresponding author Dan Li) to ZCube (SIGCOMM 2025, same author), the naming lineage means "the last letter, the ultimate Cube."

Architecture Design

ZCube completely breaks away from Clos hierarchical thinking, restructuring the network around a recursively defined asymmetric topology family:

Recursive definition (paper Figure 8a):

ZCube(n, 1) = 1 switch + n GPUs (minimal unit)
ZCube(n, k+1) = n × ZCube(n, k) + n^k switches

Therefore ZCube(n, k) contains n^(k+1) GPUs, k+1 layers of switches (n^k per layer), and each GPU has (k+1) NIC ports. The paper proves network diameter = k (Theorem 1).

Core innovation: asymmetry. First and last layer switches require 2n ports, while intermediate layers require 3n ports. This differs fundamentally from traditional topologies (Fat-Tree, BCube, Dragonfly) where all switches use the same port count. ATOP's automated search found that different switch layers carry different traffic patterns, and asymmetric port allocation matches actual demand more precisely.

ZCube(128,2) concrete construction — connecting 16,384 GPUs with diameter of only 2 hops, using only 256 switches. This is the paper's primary validation scale and the reference architecture for Zhipu AI's GLM-5.1 deployment:

Two layers of switches, 128 per layer, 256 total. These can be thought of as odd-numbered and even-numbered groups.
Each GPU has 2 NIC ports (400G NIC split into 2×200G). Port 1 connects to one odd-layer switch, Port 2 to one even-layer switch.
Full bipartite graph interconnection between the two layers — this is the engineering implementation for k=2.

All GPUs reachable within 2 hops (vs. ROFT's 5 hops). The routing goal is to have GPU pairs traverse unique/deterministic preferred paths, reducing the local hotspots and PFC backpressure caused by ECMP or static rail mapping.

Production Validation Data

The thousand-GPU production cluster for GLM-5.1 coding inference was migrated from ROFT to ZCube, with GPU hardware, software stack, and application unchanged:

Switch and optical module CapEx reduced by ~33%
Average GPU inference throughput +15%
TTFT P99 reduced by 40.6%

The original paper also conducted a 32-GPU bandwidth ablation: increasing NIC available bandwidth from 100Gbps to 200Gbps improved throughput by ~19% and reduced TTFT by ~22%. This shows the bottleneck isn't purely GPU compute — it's KV Cache transfer and network hotspots under PD disaggregation that affect effective throughput.

Network cost-performance rough calculation: Network hardware cost becomes 0.67×, throughput becomes 1.15×, network hardware/throughput cost ≈ 0.67/1.15 ≈ 0.58, meaning network-layer cost-performance improves by ~70%.

Paper's 16,384 GPU quantitative comparison (Table 6):

Topology	Switches	Cables	Network Cost
ROFT	640	49,152×400G	~$93M
Rail-only	384	32,768×400G	~$76M
HPN	384	16,384×400G + 32,768×200G	~$84M
ZCube(128,2)	256	49,152×200G	~$57M

ZCube improves end-to-end training speed by 3%–7% and reduces network hardware costs by 26%–46% compared to ROFT, Rail-only, and HPN. It uses 60% fewer switches than ROFT (256 vs. 640) and uses 200G cables (vs. ROFT's 400G), further reducing optical module costs.

Four Core Advantages

First, it disperses PD traffic's source/destination/scale asymmetry. Multi-layer switches + multi-port access distributes dynamic P→D KV Cache traffic more evenly than Clos/ROFT.

Second, deterministic routing eliminates structural congestion. The topology itself doesn't assume any specific traffic pattern, eliminating topology-induced congestion hotspots at the root.

Third, gains come from reducing structural congestion, not adding GPUs. "Effective throughput per unit of network CapEx" improves significantly.

Fourth, structural fault tolerance. ZCube's fault tolerance comes from its multi-port structure itself — each GPU has (k+1) NIC ports connecting to different-layer switches. When a single switch fails, affected GPUs simply switch to another port without rerouting through NVLink. Paper data (Figure 3b): in a 4K GPU cluster with a single ToR failure, ZCube performance degrades only 2.8%, compared to 46.9% for ROFT and 9.0% for HPN. Furthermore, ZCube uses fewer switches (256 vs. ROFT's 640), meaning fewer failure points — the no-failure probability for a 16K cluster is 93% vs. ROFT's 83%. This is a more elegant fault tolerance path than HPN's "dual-ToR hardware redundancy": a better structure that simultaneously improves performance and reliability.

Four Practical Limitations

1. Physical topology restructuring, not a software patch. The original paper explicitly states that Clos-era cabling, IP addressing, routing policies, and switch configurations cannot be reused directly — they require redesign and automated validation.

2. It solves "avoidable congestion," not last-hop incast. When multiple Prefill nodes simultaneously push KV to the same Decode group, you still need scheduling, congestion control, and traffic shaping.

3. Full bipartite graph imposes high requirements on cabling and port counts. At larger scales, it increasingly depends on high-radix switches, breakout, automated cabling validation, and topology-aware scheduling.

4. Training scenarios lose rank locality. Same-rank GPUs no longer connect to the same switch, so DP/PP communication during training goes from 1 hop to 2 hops.

3. GLM-5.1's Inference Service Unit and KV Cache Bottleneck

ZCube's strengths and limitations are now clear. But to reason about which topology to adopt and how to configure it, we need to ground the discussion in concrete numbers. The core benefit of topology optimization depends on the scale and pattern of cross-node KV transfers — so quantifying them is the necessary next step before we can derive reasonable P:D ratios.

Before discussing topology choices, we need to understand where the bottleneck actually is.

GLM-5.1 official specs: 744B-A40B, available in BF16 and FP8 versions; official vLLM/SGLang examples use --tensor-parallel-size 8, so an 8-GPU node/TP group is the minimal complete inference unit in practice.

Official parameters: maximum context 200K, maximum output 128K; input $1.4/MTok, output $4.4/MTok, output/input price ratio ~3.14. Price doesn't equal true cost, but it's a usable proxy: the serving cost/resource pressure per generated token is roughly 3× that per input token.

KV Cache Per-Token Estimation

Using the public config, a rough estimate for GLM-5.1's MLA/DSA-style KV latent cache stored in BF16:

78 × (512 + 64) × 2 ≈ 89,856 Bytes/token

78 is the number of layers, 512 is kv_lora_rank, 64 is qk_rope_head_dim, 2 is BF16 bytes. If actual KV Cache uses FP8 quantization, this roughly halves.

Prompt Length	P→D Total KV/Request	Per-Card Share at TP=8
32K	~2.94 GB	~370 MB
128K	~11.5 GB	~1.4 GB
200K	~18.2 GB	~2.3 GB

This is the fundamental reason topology optimizations can affect TTFT P99: each inference request generates hundreds of MB to several GB of cross-node transfers, and the network topology determines whether these transfers are evenly distributed or concentrated on a few hotspot links.

4. PD Disaggregation P:D Ratio Estimation

With KV volumes now estimated, the next natural question is: in an inference cluster, how many Prefill GPUs versus Decode GPUs should be provisioned? Too many P nodes waste hardware; too few starve the D nodes of work. Getting this ratio right is essential for cost-efficiency.

Without real token distribution data, the ratio can only be an engineering estimate, not a definitive answer. But we can start with a first-order formula:

P/D ≈ (Ī_eff) / (3.14 × Ō) × h

Where Ī_eff is the average input tokens missing the cache, Ō is the average generated tokens (including reasoning, tool-call JSON, final answer), 3.14 comes from the official output/input price ratio, and h is the TTFT/SLO headroom (recommended 1.1–1.2 under ZCube, higher under ROFT).

With context caching:

Ī_eff = Ī × [(1-H) + H × 0.26/1.4]

H is the cache hit ratio, 0.26/1.4 comes from the official cached input/input price ratio. When coding agents heavily reuse context, the P pool becomes noticeably lighter, and the ratio shifts from Prefill-heavy toward closer to 1:1 or even Decode-heavy.

Workload Profile	Assumed Average Tokens	Formula-Estimated P:D	Recommended Starting Config
Deep thinking / long output	16K in / 8K out	0.64:1	1:1, or 2:3
Typical coding agent	24K in / 6K out	1.27:1	3:2
Repo-heavy long context	32K in / 4K out	2.55:1	2:1–5:2
32K/6K with 40% cache hit	effective input ≈21.6K	1.15:1	4:3–3:2

Default recommendation: start with P:D ≈ 3:2. If production shows long output/reasoning tokens, lower to 1:1; for repo ingest, heavy long-context, short-output scenarios, raise to 2:1.

ZCube's 15% throughput improvement doesn't directly change the P:D formula, but it lets you reduce network congestion headroom and run closer to the theoretical ratio.

The D Pool Isn't Just About Compute — KV Memory Matters Too

Don't size the P/D ratio based solely on prefill/decode FLOPs. Another critical responsibility of Decode nodes: long-term retention of active sessions' KV cache.

D pool size should be the maximum of three values:

N_D = max(N_D_decode_compute, N_D_KV_memory, N_D_KV_ingress)

If your workload includes many 64K, 100K, 200K context sessions, even if token compute models suggest more P, the D pool still can't be too small. Otherwise D-side gets bottlenecked on KV cache memory — P nodes idle, D nodes have full VRAM, requests can't be admitted, TTFT jitters.

Engineering recommendations:

Short output, long prompt: can go 5:3, even close to 2:1
General coding: ~3:2
Long reasoning / multi-turn agent: 1:1 to 4:3
Ultra-long context with high concurrency: prioritize D memory, even at the cost of fewer P

5. ZCube Cell Scale Design

With the P:D ratio determined, the next step is deciding how many GPUs and switches to place in a single ZCube cell — the basic building block of the cluster.

ZCube formula from the original paper: total GPU count n, each switch connects k GPUs, total switches 2n/k, complete bipartite graph between the two groups; each switch has approximately k + n/k logical ports. Port utilization is optimal when k ≈ n/k ≈ √n.

Note: The port formula above is a simplified description for the ZCube(n,2) special case. Per the paper, ZCube(n,k)'s precise port allocation is asymmetric — first and last layer switches require 2n ports, while intermediate layers require 3n ports (when k>2). The "logical ports" column in the table below uses the simplified formula and may differ slightly from actual ZCube asymmetric port allocation.

With 8-GPU nodes, starting P:D ≈ 3:2:

ZCube Cell Scale	8-GPU Nodes	Balanced k	Logical Ports/Switch	Switches	Recommended P/D Node Config	Equivalent ROFT GPU Throughput
1024 GPU	128	32	64	64	80P / 48D	~1178
4096 GPU	512	64	128	128	320P / 192D	~4710
16,384 GPU	2048	128	256	256	1280P / 768D	~18,842

"Equivalent ROFT GPU throughput" calculated at +15%. Conversely, to achieve the same throughput as an ROFT cluster, ZCube needs approximately N_roft/1.15 GPUs.

The 16,384 Port Problem

The original paper states that 51.2T, 128×400G switches can support 16,384 400Gbps NICs. Per the port formula, 16,384 corresponds to 256 logical ports per switch. A more practical engineering interpretation is that 400G NICs are split into 2×200G logical ports, with switches calculated at 256×200G breakout. If you insist on every logical port being full 400G with only 128 logical ports per switch, the clean single-plane scale is 4096 GPUs; 16K requires multi-plane or higher-radix switches.

6. Intra-Cell P/D Placement: Even Distribution, No Physical Partitioning

Cell scale is set, but where GPUs physically sit — clustered together or spread out — directly affects ZCube's ability to evenly disperse traffic.

The point of ZCube is to disperse the dynamic, asymmetric KV cache transfers under PD disaggregation. The core intra-cell strategy: don't artificially create a high-traffic cut from a P zone to a D zone.

Wrong approach:

Rack/Zone A: all Prefill
Rack/Zone B: all Decode

This would compress all KV cache transfers onto a few A→B links, directly undermining ZCube's design intent.

Correct approach: Every local topology segment contains both P and D, and P/D density along each odd/even switch dimension is close to the global ratio.

Affinity Tile Partitioning

Using 1024 GPUs / 128 nodes as an example, general 3:2 configuration:

25 tiles, each tile = 3P + 2D
3 flex nodes distributed evenly across topology
Total: 75P + 50D + 3 flex

Tiles are not physically contiguous rows of machines — they're evenly hashed across ZCube's two switch dimensions. The goal is for every switch to see a P/D port count close to the global ratio.

Configuration	Tile Composition	Tile Count	Flex	Use Case
General 3:2	3P + 2D	25	3	General coding agent
P-heavy 5:3	5P + 3D	16	0	Long input, short output
Balanced 1:1	4P + 4D	16	0	Long reasoning, multi-turn agent

Flex Pool Elasticity

The 3 flex nodes dynamically switch roles based on real-time load:

Long prompt peak: convert to P
Long output / D cache peak: convert to D
On failure: assume the failed node's role

When to Change P/D Ratio vs. When to Change Pairing Strategy

Signals to add P: Prefill queue P95/P99 high, P GPU utilization high and D low, KV transfer not congested, D memory sufficient, TTFT mainly bottlenecked on prefill wait.

Signals to add D: Decode queue high, D KV memory near limit, active session admission failures, output token backlog high, P nodes frequently waiting for D lease.

Signals that you should change pairing/routing, NOT the ratio: Both P and D GPUs are underutilized, but KV transfer P99 is high, certain inter-switch edges are hot, certain D ingress is overloaded, TTFT has sporadic long tails. This means the problem is in pairing and routing, not quantity — optimize scheduling first, don't add machines first.

7. P/D Pairing: Affinity Groups + Dynamic Selection, Not Fixed Binding

With P and D counts and positions determined, there remains a key operational question: when an inference request arrives, which P handles the input and which D generates the output? This "pairing" decision directly impacts network congestion. A naive approach would permanently bind each P to a specific D — but as we'll see, that creates more problems than it solves.

Don't Use Fixed P-D Pairs

Fixed binding (P1 always paired with D1) is simple but has three problems: one slow D drags down its bound P; long and short prompts can't dynamically stagger; D-side KV memory easily fragments — locally full, globally empty.

A better approach is affinity tile + dynamic selection:

Tile-17:
  P: P17-0, P17-1, P17-2
  D: D17-0, D17-1

P candidate D priority:
  primary:   same tile D
  secondary: adjacent topology tile D
  tertiary:  any D in same cell
  last:      cross-cell overflow

New Sessions vs. Existing Sessions: Different Pairing Directions

New sessions: joint P/D selection. No historical KV exists. First estimate prompt tokens / output tokens / KV bytes, then jointly select a (p,d) pair:

score(p,d) = α·Qp + β·Qd + γ·Tkv(p,d) + δ·Hpath(p,d) + ε·Md − ζ·Acache

Select the (p,d) with the lowest score.

Existing sessions: D sticky, P reverse-selected. A multi-turn agent or coding session already has KV cache on a particular D — don't casually switch Ds, or you'd need to migrate historical KV at high cost. Existing sessions prioritize continuing on D_home, selecting a P that's topologically close to D_home for incremental prefill, transmitting only delta KV.

This is critical for coding agents — they're often multi-turn long-context, not one-shot prompts.

KV Transfer Credit: Lease First, Then Send KV

The most dangerous scenario in PD disaggregation: multiple Prefills complete simultaneously, the scheduler finds an idle D, and multiple Ps send GB-scale KV to the same D at once — D ingress, last hop, D NIC, D KV allocator all overload, TTFT P99 jitters.

Solution: D-side maintains KV ingress credits. Only when both D and path have credit can P start sending KV. Recommend each D concurrently receive 2–4 large KV streams; long prompts must reserve path credit in advance, short prompts can go best-effort.

Leveraging ZCube's Dual Paths for Load-Aware Routing

Within a ZCube cell, each node has two topology coordinates: P = (A_p, B_p), D = (A_d, B_d). P to D typically has two two-hop paths:

path 1: P → A_p → B_d → D
path 2: P → B_p → A_d → D

The scheduler doesn't just check if D is free — it also checks which path is less loaded. Fine-grained per-packet ECMP is not recommended (KV transfers are typically large; out-of-order delivery hurts RDMA/TCP efficiency). Instead, split by TP rank or route at the request level. For example, with an 8-GPU TP group: rank 0,2,4,6 take path1, rank 1,3,5,7 take path2.

TP Rank Consistency

For large model inference like GLM-5.1 where both P and D use TP=8, enforce Prefill TP rank i → Decode TP rank i. Don't do P rank i → D rank j — unless you explicitly plan a KV reshape, this adds an all-to-all or KV reordering that directly negates ZCube's network benefits.

8. RailFly: A Pragmatic Choice for Mixed Training+Inference Workloads

ZCube is powerful but requires rebuilding the entire physical network, and its full bipartite interconnect complexity grows quickly with scale. A natural question arises: is there a more pragmatic approach that supports PD disaggregated inference without requiring a complete network rebuild?

Design Motivation

Rail-Only (MIT CSAIL + Meta, 2023) is an extreme cost-saving topology: think of it as "removing all upper-tier (spine) switches and keeping only independent switches per rail." It's the cheapest option, but cross-rail communication relies entirely on software forwarding through GPUs — which makes it impractical for PD disaggregated inference. Its core insight: in LLM training, 99%+ of cross-node communication occurs between same-rank GPUs, and the Spine is idle most of the time — it can be removed. Training-optimal (same rank in 1 hop), extremely low cost (38–77% cheaper than Clos). But it's basically unusable for PD disaggregated inference: cross-rail GPU software forwarding has unstable latency, P99 degrades severely.

The problem: P→D KV Cache flow is not stable same-rail communication — it's dynamic, asymmetric, with high source-destination variation. Pure Rail-only relies more on NVLink/HB domain forwarding, easily shifting TTFT tail latency elsewhere.

RailFly's approach: starting from Rail-Only, gain cross-rail hardware forwarding capability with minimal changes.

Architecture Definition

K rails, each served by one switch. Same-rank GPUs connect to the same rail switch (Rail-Only rule unchanged).
The K switches are fully interconnected (28 links when K=8).
GPU port allocation unchanged — zero modification.

Core property: inter-group links are fixed at 28, not growing with cluster scale.

Scale and Oversubscription Ratio

Configuration	Ports per Rail	Total GPUs	Inter-Group Links
8× fixed 51.2T	121	968	28
8× modular Arista 7800R3	569	4,552	28
8× modular H3C S12500R	761	6,088	28

Compared to ZCube's scaling complexity:

GPUs	RailFly Inter-Group Links	ZCube Bipartite Links
1,000	28	15,876
4,000	28	250,000
8,000	28	1,000,000

Oversubscription ratio = downlink GPU bandwidth / uplink inter-group bandwidth. High oversubscription doesn't mean congestion — if a rail-aware scheduler achieves 90% PD pairing within rail, only 10% crosses rails, and 10:1 oversubscription is sufficient. The oversubscription ratio is not a pure network parameter — it's a joint optimization problem of the network and the scheduler.

With the same 128 switches, RailFly uses 1 port per GPU, 5.5% for interconnection, connecting ~15,488 GPUs; ZCube uses 2 ports per GPU, 50% for interconnection, capping at 4,096 GPUs. Under equal hardware budgets, RailFly connects ~3.8× more GPUs.

9. From Rail-Only to Hybrid ZRail / Sparse-ZCube

RailFly is a topology family that can be adapted to different depths of modification. By adding more inter-switch connections to the base Rail-Only design, the topology evolves — and where you stop on that spectrum depends on your workload requirements.

Adding switch direct connections to Rail-only gradually turns it into ZCube or Sparse-ZCube, not traditional Rail-only. A better approach is PD-aware Hybrid ZRail:

1. Intra-cell ZCube/full-bipartite: Within a 1024–4096 GPU cell, a complete bipartite graph ensures any P→D is 2 hops, paths are deterministic, and load is dispersed.

2. Inter-cell sparse expander/dragonfly-like direct connections: Beyond 4K, no need for global full-bipartite. Scheduling keeps P and D in the same cell whenever possible; cross-cell only for overflow.

3. Dual-rail QoS isolation: One rail handles traditional rail/local collective traffic; the other handles ZCube/expander traffic, with QoS isolation between KV transfer, control, and collective.

4. Topology-aware scheduling: Decode selection considers not just GPU idleness, but also P-switch-to-D-switch edge load, D pool queue depth, and last-hop incast.

5. Sparse direct-connect degree d determined by traffic: Port budget R = k + d; full ZCube is d = S. When PD traffic has locality, using d << S for an expander may be more cost-efficient.

Decision criterion: For maximum robustness, random P→D, multi-tenant mixed workloads, use ZCube/full-bipartite. When you can control P/D placement, cache locality is high, and business traces are stable, use Sparse-ZCube / Hybrid ZRail. Pure Rail-only is not recommended as the primary network for PD disaggregated inference.

10. RailFly's Pure-Inference Evolution: Rail-Fabric

All the architectures discussed so far treat KV transfers as traffic to be fairly routed alongside everything else. But PD inference KV transfers have special characteristics: predictable sizes, unidirectional P→D flow, and a D-side that's vulnerable to being overwhelmed by multiple simultaneous arrivals. If these characteristics can be exploited to design a network layer purpose-built for KV transfers, the result can outperform even ZCube.

ZCube is an excellent static topology. But PD inference isn't purely arbitrary traffic — it has four characteristics that a stronger architecture can exploit:

P→D KV flows are large, and their size can be estimated when prefill begins;
D is the state residency point — multi-turn sessions should generally stick to the same D or D group;
Warm prefix / repo cache has clear locality;
D-side is prone to ingress incast, not uniform all-to-all.

So a stronger architecture shouldn't just ask "how many hops between any two points" — it should ask:

Can large KV flows reserve congestion-free paths?
Is D-side ingress oversubscribed?
Can cache-hit requests avoid cross-cell KV migration?
Can the scheduler determine D before prefill starts and pre-provision the path?

Rail-Fabric is RailFly's evolution for pure-inference scenarios: retaining standard Ethernet switches as the base, adding a sparse expander foundation + KV Fast Lane + Cache/Home logical plane to achieve stronger P99 TTFT and D-side incast control than ZCube — without introducing OCS or dedicated hardware.

Three-Layer Capability

Base Fabric (Sparse-ZCube / Expander): Handles control plane, small KV, failure fallback, and general RDMA. Doesn't pursue a complete bipartite graph. Each switch uses k GPU downlinks + d electrical packet expander links + q dedicated KV fast lane links. (An expander is a "sparse but highly connected" interconnect structure: each switch only needs a few cross-group links to ensure low-hop reachability between any two points in the network — think of it as a small-world graph applied to data center switching.)

KV Fast Lane (Pure Ethernet QoS + Dedicated Links): Handles GB-scale P→D KV cache migration. No OCS — instead uses dedicated Ethernet links + priority queues + scheduler reservation to achieve congestion-free transfer for large KV flows. Each switch reserves 8–12 high-priority links exclusively for KV bursts, physically isolated or strictly QoS-isolated from the base fabric's expander links. The scheduler pre-reserves D ingress credits and link bandwidth quotas based on KV size before prefill begins.

Cache/Home Logical Plane: Keeps long sessions, warm prefixes, and shared repo cache within the same D island whenever possible. Warm sessions preferentially return to their original D island; only cold long prompts use the global KV fast lane.

Why Rail-Fabric Outperforms ZCube

Better P99 TTFT. ZCube reduces the probability of topology-induced congestion; Rail-Fabric further reduces queuing and contention for large KV flows. The scheduler knows approximate KV size before prefill begins and pre-reserves D buffer, D ingress, and link bandwidth quotas.

Better D-side incast handling. Rail-Fabric can design the D side asymmetrically — D-bearing switches have more inter-switch links, higher ingress credits, and more dedicated KV fast lane links. This better matches the real traffic characteristics of PD disaggregation.

Better scalability beyond 16K. ZCube's full bipartite graph requires 256 logical ports per switch at 16K. Rail-Fabric's expander degree d ≈ 8–16, KV fast lane q ≈ 8–24, approximately constant or slowly growing as scale increases. The tradeoff is dependence on the scheduler and traffic locality.

Better for cache-heavy coding agents. ZCube optimizes "transmit more smoothly"; Rail-Fabric also optimizes "transmit less, transmit closer, transmit only deltas." For multi-turn coding sessions, the same session stays on the same D group, the same repo cache stays on the same cache island, and Prefill reverse-selects a P close to D/cache.

1024 GPU Cell Deployment

Using the base configuration of 128 8-GPU nodes, 75P/50D/3 flex, 25 3P+2D affinity tiles:

64 Ethernet switches
Each switch:
  32 GPU/NIC downlinks
  8 expander inter-switch links (base fabric)
  8–12 KV fast lane dedicated links (QoS isolated)

ZCube comparison:  32 downlink + 32 inter-switch = 64
Rail-Fabric:        32 downlink + 8 expander + 8–12 KV fast lane = 48–52

Request scheduling flow:

Warm session: D-first, prefer D_home, select a P close to and idle near D_home, transmit only delta KV
Cold long prompt: P/D joint select, reserve D ingress credit, reserve KV fast lane bandwidth quota, P pushes KV to D in write mode
Short prompt: Use base expander fabric, don't occupy KV fast lane

Core principle: large KV takes reserved KV fast lanes, small KV uses base expander, warm KV avoids long-distance transfer, D ingress must have credit.

Simplified Variant: D-centric Asymmetric Expander

If you don't want the complexity of KV fast lane + QoS isolation, a simpler pure Ethernet approach is possible:

Eliminate spine; inter-switch connections form a high-quality expander rather than a full bipartite graph
D-bearing switches have higher inter-switch degree than P-bearing switches
D nodes have more uplinks / higher credits
P/D placement is strictly cache-aware and tile-aware

Normal P switch:  32 downlink + 8 expander links
D-heavy switch:  24–28 downlink + 12–16 expander links

Fewer static inter-switch links than ZCube, better cross-rail P→D support than Rail-only, better Decode ingress accommodation than a plain expander. But if P→D is completely random with low cache locality, stability is weaker than ZCube. Suitable for cost-sensitive scenarios where the scheduler is controllable.

11. Positioning of Three Pure Ethernet Architectures

Architecture	Best For	Advantage over ZCube	Main Risk
ZCube	General-purpose, random P→D	Simple, stable, strong static topology; fault tolerance advantage (single ToR failure degrades only 2.8%); discovered via ATOP automated search	Doesn't exploit cache locality; large KV and small traffic share links
RailFly	Mixed training+inference, 10K+ scale	Inter-group complexity doesn't grow; 3.8× GPU capacity under equal hardware	Cross-rail bandwidth depends on scheduler pairing quality
Rail-Fabric (RailFly pure-inference evolution)	Long context, GB-scale KV, multi-turn agent	Better P99 TTFT, D incast, 16K+ scalability	Complex control plane; QoS isolation requires switch support

12. Scenario Recommendations and Deployment Guide

The following recommendations are organized from simplest to most powerful — three tiers corresponding to different starting points and goals.

Scenario	Recommended Architecture	Reason
Pure PD disaggregated inference (thousand-GPU scale)	ZCube cell	Best load balancing, +15% throughput
Pure training	Rail-Only	Rail locality is key
Mixed training + inference	RailFly	Balances both, simple scaling
10K+ GPU mixed workloads	RailFly + Hybrid ZRail	Inter-group complexity doesn't grow with scale
Long KV + multi-turn agent + controllable scheduling	Rail-Fabric	Large KV uses KV fast lane, cache-aware minimizes KV migration

Three-Tier Deployment Recommendation

Tier 1, thousand-GPU production validation: 1024 GPUs, 128 8-card nodes, ZCube cell, 75P/50D/3 flex, 25 affinity tiles. The most stable starting point.

Tier 2, scaled production: 4096 GPUs, 512 8-card nodes, single cell or 4×1024-GPU cells, default 320P/192D. For 16K GPUs, recommend splitting into 4×4096 cells.

Tier 3, next-generation dedicated inference cluster: Rail-Fabric = Sparse expander + Ethernet KV fast lane + cache-aware D island + D-centric endpoint oversubscription. All based on standard Ethernet switches, suited for long-context coding agent GB-scale KV scenarios.

One-Line Summary

ZCube is "better static Ethernet"; a pure-Ethernet PD inference network stronger than ZCube should be "sparse expander base + KV fast lane dedicated links + cache-aware scheduling" — all on standard Ethernet switches, no OCS or scale-up hardware. RailFly finds the pragmatic sweet spot for mixed training+inference workloads in this spectrum: fixed 28 inter-group links, cross-rail hardware forwarding, 3.8× GPU capacity under equal hardware budgets.

References

Yan, Z., et al. From ATOP to ZCube. ACM SIGCOMM 2025.
Ghobadi, M., et al. Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters. arXiv:2307.12169, 2023.
Araujo, J., et al. Resilient AI Supercomputer Networking using MRC and SRv6. OpenAI, 2026.
Zhipu AI Tech Blog. Next-Generation Large Model Inference Network Architecture: How ZCube Effectively Resolves Network Bottlenecks. z.ai/blog/zcube (Zhipu AI is a deployment validation user of ZCube, having validated the ZCube topology on their GLM-5.1 thousand-GPU inference cluster)
Wang, Z., et al. Opus: An Elastic, Reconfigurable Optical Interconnect for Compute Pods. NSDI 2026.
Together AI. Chunked Prefill-Decode Disaggregation for Long-Context Serving. 2026.
FlowKV: Global KV Cache Management for Disaggregated LLM Serving. 2026.

Starting from ZCube: Network Topology Design for PD Disaggregated Inference

Starting from ZCube: Network Topology Design for PD Disaggregated Inference

1. Why Network Topology Suddenly Matters

ROFT's Design Assumptions Are Breaking Down

2. ZCube's Core Value and Limitations

How ZCube Was Born: From Automated Search to Asymmetric Topology

Architecture Design

Production Validation Data

Four Core Advantages

Four Practical Limitations

3. GLM-5.1's Inference Service Unit and KV Cache Bottleneck

KV Cache Per-Token Estimation

4. PD Disaggregation P:D Ratio Estimation

The D Pool Isn't Just About Compute — KV Memory Matters Too

5. ZCube Cell Scale Design

The 16,384 Port Problem

6. Intra-Cell P/D Placement: Even Distribution, No Physical Partitioning

Affinity Tile Partitioning

Flex Pool Elasticity

When to Change P/D Ratio vs. When to Change Pairing Strategy

7. P/D Pairing: Affinity Groups + Dynamic Selection, Not Fixed Binding

Don't Use Fixed P-D Pairs

New Sessions vs. Existing Sessions: Different Pairing Directions

KV Transfer Credit: Lease First, Then Send KV

Leveraging ZCube's Dual Paths for Load-Aware Routing

TP Rank Consistency

8. RailFly: A Pragmatic Choice for Mixed Training+Inference Workloads

Design Motivation

Architecture Definition

Scale and Oversubscription Ratio

9. From Rail-Only to Hybrid ZRail / Sparse-ZCube

10. RailFly's Pure-Inference Evolution: Rail-Fabric

Three-Layer Capability

Why Rail-Fabric Outperforms ZCube

1024 GPU Cell Deployment

Simplified Variant: D-centric Asymmetric Expander

11. Positioning of Three Pure Ethernet Architectures

12. Scenario Recommendations and Deployment Guide

Three-Tier Deployment Recommendation

One-Line Summary

References

Related Essays