← Thinking Thinking

RailFly: Network Topology Design for Prefill-Decode Disaggregated Inference

Using ZCube as a baseline, we analyze the core value and limitations of network topologies for PD disaggregated inference, derive P:D ratios and placement…

2026-05-25Thinking35 min read

As inference replaces training as the primary battlefield for AI infrastructure, GPU cluster network topology needs to be rethought from scratch.

1. Why Network Topology Suddenly Matters

As large model clusters evolve from thousands to tens of thousands of GPUs, raw GPU compute is no longer the sole bottleneck. The network links connecting that compute now directly determine a cluster's effective throughput.

Two landmark events occurred almost simultaneously:

  • May 5, 2026: OpenAI, together with NVIDIA, AMD, Intel, Microsoft, and Broadcom, released the MRC (Multipath Reliable Connection) protocol — solving the network communication bottleneck for ultra-large-scale GPU clusters at the transport protocol layer.
  • May 21, 2026: Zhipu AI, together with Tsinghua University and Yuxun Network, announced the ZCube networking architecture deployed in the GLM-5.1 production cluster — solving the same problem at the topology architecture layer.

These represent two complementary technical paths: protocol-layer optimization (MRC) and architecture-layer restructuring (ZCube).

This article uses ZCube as a baseline, analyzes its core value and limitations, derives P:D ratios and placement strategies for PD disaggregated inference, and then proposes RailFly — a pragmatic topology family covering mixed training+inference workloads, and its pure-inference evolution PDX-Fabric.

ROFT/Clos Traditional Topology
ROFT/Clos Traditional Topology

ROFT's Design Assumptions Are Breaking Down

Current mainstream GPU cluster networking is based on ROFT (Rail-Optimized Fat-Tree): GPUs are grouped by rank into rails, with same-rank GPUs connecting to the same switch. DP/PP communication during training completes in 1 hop within a rail. The design assumption is that same-rank GPUs communicate the most, so traffic naturally aligns with rails.

Prefill-Decode disaggregated inference breaks this assumption. KV Cache is transmitted cross-node from Prefill nodes to Decode nodes, and this traffic exhibits three characteristics:

  1. Strong source-destination asymmetry: The KV Cache transfer load across different GPUs can differ by orders of magnitude.
  2. Dynamic variation: Which P nodes pair with which D nodes is determined in real-time by the scheduler and is not predictable.
  3. Burstiness: Long-context requests generate large KV Cache transfers with high instantaneous bandwidth demands.

ROFT's rail mapping fails under this traffic pattern — traffic gets concentrated onto a few switches and links, triggering PFC backpressure, creating a structural problem of "abundant total bandwidth, frequent local congestion."

2. ZCube's Core Value and Limitations

ZCube Bipartite Graph Topology
ZCube Bipartite Graph Topology

Architecture Design

ZCube completely breaks away from Clos hierarchical thinking, restructuring the network around fully flat bipartite graph interconnection:

  1. Eliminate the Spine layer. All switches are divided into odd-numbered and even-numbered groups.
  2. Each GPU's Port 1 connects to one odd switch, Port 2 connects to one even switch.
  3. A complete bipartite graph interconnection is formed between the odd and even groups.

All GPUs in the network are reachable within 2 hops. The routing goal is to have GPU pairs traverse unique/deterministic preferred paths, reducing the local hotspots and PFC backpressure caused by ECMP or static rail mapping.

Production Validation Data

The thousand-GPU production cluster for GLM-5.1 coding inference was migrated from ROFT to ZCube, with GPU hardware, software stack, and application unchanged:

  • Switch and optical module CapEx reduced by ~33%
  • Average GPU inference throughput +15%
  • TTFT P99 reduced by 40.6%

The original paper also conducted a 32-GPU bandwidth ablation: increasing NIC available bandwidth from 100Gbps to 200Gbps improved throughput by ~19% and reduced TTFT by ~22%. This shows the bottleneck isn't purely GPU compute — it's KV Cache transfer and network hotspots under PD disaggregation that affect effective throughput.

Network cost-performance rough calculation: Network hardware cost becomes 0.67×, throughput becomes 1.15×, network hardware/throughput cost ≈ 0.67/1.15 ≈ 0.58, meaning network-layer cost-performance improves by ~70%.

Three Core Advantages

First, it disperses PD traffic's source/destination/scale asymmetry. The bipartite graph + dual-port single-rail/multi-rail access distributes dynamic P→D KV Cache traffic more evenly than Clos/ROFT.

Second, deterministic routing eliminates structural congestion. The topology itself doesn't assume any specific traffic pattern, eliminating topology-induced congestion hotspots at the root.

Third, gains come from reducing structural congestion, not adding GPUs. "Effective throughput per unit of network CapEx" improves significantly.

Four Practical Limitations

1. Physical topology restructuring, not a software patch. The original paper explicitly states that Clos-era cabling, IP addressing, routing policies, and switch configurations cannot be reused directly — they require redesign and automated validation.

2. It solves "avoidable congestion," not last-hop incast. When multiple Prefill nodes simultaneously push KV to the same Decode group, you still need scheduling, congestion control, and traffic shaping.

3. Full bipartite graph imposes high requirements on cabling and port counts. At larger scales, it increasingly depends on high-radix switches, breakout, automated cabling validation, and topology-aware scheduling.

4. Training scenarios lose rank locality. Same-rank GPUs no longer connect to the same switch, so DP/PP communication during training goes from 1 hop to 2 hops.

3. GLM-5.1's Inference Service Unit and KV Cache Bottleneck

Before discussing topology choices, we need to understand where the bottleneck actually is.

GLM-5.1 official specs: 744B-A40B, available in BF16 and FP8 versions; official vLLM/SGLang examples use --tensor-parallel-size 8, so an 8-GPU node/TP group is the minimal complete inference unit in practice.

Official parameters: maximum context 200K, maximum output 128K; input $1.4/MTok, output $4.4/MTok, output/input price ratio ~3.14. Price doesn't equal true cost, but it's a usable proxy: the serving cost/resource pressure per generated token is roughly 3× that per input token.

KV Cache Per-Token Estimation

Using the public config, a rough estimate for GLM-5.1's MLA/DSA-style KV latent cache stored in BF16:

78 × (512 + 64) × 2 ≈ 89,856 Bytes/token

78 is the number of layers, 512 is kv_lora_rank, 64 is qk_rope_head_dim, 2 is BF16 bytes. If actual KV Cache uses FP8 quantization, this roughly halves.

Prompt Length P→D Total KV/Request Per-Card Share at TP=8
32K ~2.94 GB ~370 MB
128K ~11.5 GB ~1.4 GB
200K ~18.2 GB ~2.3 GB

This is the fundamental reason topology optimizations can affect TTFT P99: each inference request generates hundreds of MB to several GB of cross-node transfers, and the network topology determines whether these transfers are evenly distributed or concentrated on a few hotspot links.

4. PD Disaggregation P:D Ratio Estimation

Without real token distribution data, the ratio can only be an engineering estimate, not a definitive answer. But we can start with a first-order formula:

P/D ≈ (Ī_eff) / (3.14 × Ō) × h

Where Ī_eff is the average input tokens missing the cache, Ō is the average generated tokens (including reasoning, tool-call JSON, final answer), 3.14 comes from the official output/input price ratio, and h is the TTFT/SLO headroom (recommended 1.1–1.2 under ZCube, higher under ROFT).

With context caching:

Ī_eff = Ī × [(1-H) + H × 0.26/1.4]

H is the cache hit ratio, 0.26/1.4 comes from the official cached input/input price ratio. When coding agents heavily reuse context, the P pool becomes noticeably lighter, and the ratio shifts from Prefill-heavy toward closer to 1:1 or even Decode-heavy.

Workload Profile Assumed Average Tokens Formula-Estimated P:D Recommended Starting Config
Deep thinking / long output 16K in / 8K out 0.64:1 1:1, or 2:3
Typical coding agent 24K in / 6K out 1.27:1 3:2
Repo-heavy long context 32K in / 4K out 2.55:1 2:1–5:2
32K/6K with 40% cache hit effective input ≈21.6K 1.15:1 4:3–3:2

Default recommendation: start with P:D ≈ 3:2. If production shows long output/reasoning tokens, lower to 1:1; for repo ingest, heavy long-context, short-output scenarios, raise to 2:1.

ZCube's 15% throughput improvement doesn't directly change the P:D formula, but it lets you reduce network congestion headroom and run closer to the theoretical ratio.

The D Pool Isn't Just About Compute — KV Memory Matters Too

Don't size the P/D ratio based solely on prefill/decode FLOPs. Another critical responsibility of Decode nodes: long-term retention of active sessions' KV cache.

D pool size should be the maximum of three values:

N_D = max(N_D_decode_compute, N_D_KV_memory, N_D_KV_ingress)

If your workload includes many 64K, 100K, 200K context sessions, even if token compute models suggest more P, the D pool still can't be too small. Otherwise D-side gets bottlenecked on KV cache memory — P nodes idle, D nodes have full VRAM, requests can't be admitted, TTFT jitters.

Engineering recommendations:

  • Short output, long prompt: can go 5:3, even close to 2:1
  • General coding: ~3:2
  • Long reasoning / multi-turn agent: 1:1 to 4:3
  • Ultra-long context with high concurrency: prioritize D memory, even at the cost of fewer P

5. ZCube Cell Scale Design

ZCube formula from the original paper: total GPU count n, each switch connects k GPUs, total switches 2n/k, complete bipartite graph between the two groups; each switch has approximately k + n/k logical ports. Port utilization is optimal when k ≈ n/k ≈ √n.

With 8-GPU nodes, starting P:D ≈ 3:2:

ZCube Cell Scale 8-GPU Nodes Balanced k Logical Ports/Switch Switches Recommended P/D Node Config Equivalent ROFT GPU Throughput
1024 GPU 128 32 64 64 80P / 48D ~1178
4096 GPU 512 64 128 128 320P / 192D ~4710
16,384 GPU 2048 128 256 256 1280P / 768D ~18,842

"Equivalent ROFT GPU throughput" calculated at +15%. Conversely, to achieve the same throughput as an ROFT cluster, ZCube needs approximately N_roft/1.15 GPUs.

The 16,384 Port Problem

The original paper states that 51.2T, 128×400G switches can support 16,384 400Gbps NICs. Per the port formula, 16,384 corresponds to 256 logical ports per switch. A more practical engineering interpretation is that 400G NICs are split into 2×200G logical ports, with switches calculated at 256×200G breakout. If you insist on every logical port being full 400G with only 128 logical ports per switch, the clean single-plane scale is 4096 GPUs; 16K requires multi-plane or higher-radix switches.

6. Intra-Cell P/D Placement: Even Distribution, No Physical Partitioning

The point of ZCube is to disperse the dynamic, asymmetric KV cache transfers under PD disaggregation. The core intra-cell strategy: don't artificially create a high-traffic cut from a P zone to a D zone.

Wrong approach:

Rack/Zone A: all Prefill
Rack/Zone B: all Decode

This would compress all KV cache transfers onto a few A→B links, directly undermining ZCube's design intent.

Correct approach: Every local topology segment contains both P and D, and P/D density along each odd/even switch dimension is close to the global ratio.

Affinity Tile Partitioning

Using 1024 GPUs / 128 nodes as an example, general 3:2 configuration:

  • 25 tiles, each tile = 3P + 2D
  • 3 flex nodes distributed evenly across topology
  • Total: 75P + 50D + 3 flex

Tiles are not physically contiguous rows of machines — they're evenly hashed across ZCube's two switch dimensions. The goal is for every switch to see a P/D port count close to the global ratio.

Configuration Tile Composition Tile Count Flex Use Case
General 3:2 3P + 2D 25 3 General coding agent
P-heavy 5:3 5P + 3D 16 0 Long input, short output
Balanced 1:1 4P + 4D 16 0 Long reasoning, multi-turn agent

Flex Pool Elasticity

The 3 flex nodes dynamically switch roles based on real-time load:

  • Long prompt peak: convert to P
  • Long output / D cache peak: convert to D
  • On failure: assume the failed node's role

When to Change P/D Ratio vs. When to Change Pairing Strategy

Signals to add P: Prefill queue P95/P99 high, P GPU utilization high and D low, KV transfer not congested, D memory sufficient, TTFT mainly bottlenecked on prefill wait.

Signals to add D: Decode queue high, D KV memory near limit, active session admission failures, output token backlog high, P nodes frequently waiting for D lease.

Signals that you should change pairing/routing, NOT the ratio: Both P and D GPUs are underutilized, but KV transfer P99 is high, certain inter-switch edges are hot, certain D ingress is overloaded, TTFT has sporadic long tails. This means the problem is in pairing and routing, not quantity — optimize scheduling first, don't add machines first.

7. P/D Pairing: Affinity Groups + Dynamic Selection, Not Fixed Binding

Don't Use Fixed P-D Pairs

Fixed binding (P1 always paired with D1) is simple but has three problems: one slow D drags down its bound P; long and short prompts can't dynamically stagger; D-side KV memory easily fragments — locally full, globally empty.

A better approach is affinity tile + dynamic selection:

Tile-17:
  P: P17-0, P17-1, P17-2
  D: D17-0, D17-1

P candidate D priority:
  primary:   same tile D
  secondary: adjacent topology tile D
  tertiary:  any D in same cell
  last:      cross-cell overflow

New Sessions vs. Existing Sessions: Different Pairing Directions

New sessions: joint P/D selection. No historical KV exists. First estimate prompt tokens / output tokens / KV bytes, then jointly select a (p,d) pair:

score(p,d) = α·Qp + β·Qd + γ·Tkv(p,d) + δ·Hpath(p,d) + ε·Md − ζ·Acache

Select the (p,d) with the lowest score.

Existing sessions: D sticky, P reverse-selected. A multi-turn agent or coding session already has KV cache on a particular D — don't casually switch Ds, or you'd need to migrate historical KV at high cost. Existing sessions prioritize continuing on D_home, selecting a P that's topologically close to D_home for incremental prefill, transmitting only delta KV.

This is critical for coding agents — they're often multi-turn long-context, not one-shot prompts.

KV Transfer Credit: Lease First, Then Send KV

The most dangerous scenario in PD disaggregation: multiple Prefills complete simultaneously, the scheduler finds an idle D, and multiple Ps send GB-scale KV to the same D at once — D ingress, last hop, D NIC, D KV allocator all overload, TTFT P99 jitters.

Solution: D-side maintains KV ingress credits. Only when both D and path have credit can P start sending KV. Recommend each D concurrently receive 2–4 large KV streams; long prompts must reserve path credit in advance, short prompts can go best-effort.

Leveraging ZCube's Dual Paths for Load-Aware Routing

Within a ZCube cell, each node has two topology coordinates: P = (A_p, B_p), D = (A_d, B_d). P to D typically has two two-hop paths:

path 1: P → A_p → B_d → D
path 2: P → B_p → A_d → D

The scheduler doesn't just check if D is free — it also checks which path is less loaded. Fine-grained per-packet ECMP is not recommended (KV transfers are typically large; out-of-order delivery hurts RDMA/TCP efficiency). Instead, split by TP rank or route at the request level. For example, with an 8-GPU TP group: rank 0,2,4,6 take path1, rank 1,3,5,7 take path2.

TP Rank Consistency

For large model inference like GLM-5.1 where both P and D use TP=8, enforce Prefill TP rank i → Decode TP rank i. Don't do P rank i → D rank j — unless you explicitly plan a KV reshape, this adds an all-to-all or KV reordering that directly negates ZCube's network benefits.

8. RailFly: A Pragmatic Choice for Mixed Training+Inference Workloads

Rail-Only Topology
Rail-Only Topology

Design Motivation

Rail-Only (MIT CSAIL + Meta, 2023) has a core insight: in LLM training, 99%+ of cross-node communication occurs between same-rank GPUs, and the Spine is idle most of the time — it can be removed. Training-optimal (same rank in 1 hop), extremely low cost (38–77% cheaper than Clos). But it's basically unusable for PD disaggregated inference: cross-rail GPU software forwarding has unstable latency, P99 degrades severely.

The problem: P→D KV Cache flow is not stable same-rail communication — it's dynamic, asymmetric, with high source-destination variation. Pure Rail-only relies more on NVLink/HB domain forwarding, easily shifting TTFT tail latency elsewhere.

RailFly's approach: starting from Rail-Only, gain cross-rail hardware forwarding capability with minimal changes.

RailFly Topology
RailFly Topology

Architecture Definition

  1. K rails, each served by one switch. Same-rank GPUs connect to the same rail switch (Rail-Only rule unchanged).
  2. The K switches are fully interconnected (28 links when K=8).
  3. GPU port allocation unchanged — zero modification.

Core property: inter-group links are fixed at 28, not growing with cluster scale.

Scale and Oversubscription Ratio

Configuration Ports per Rail Total GPUs Inter-Group Links
8× fixed 51.2T 121 968 28
8× modular Arista 7800R3 569 4,552 28
8× modular H3C S12500R 761 6,088 28

Compared to ZCube's scaling complexity:

GPUs RailFly Inter-Group Links ZCube Bipartite Links
1,000 28 15,876
4,000 28 250,000
8,000 28 1,000,000

Oversubscription ratio = downlink GPU bandwidth / uplink inter-group bandwidth. High oversubscription doesn't mean congestion — if a rail-aware scheduler achieves 90% PD pairing within rail, only 10% crosses rails, and 10:1 oversubscription is sufficient. The oversubscription ratio is not a pure network parameter — it's a joint optimization problem of the network and the scheduler.

With the same 128 switches, RailFly uses 1 port per GPU, 5.5% for interconnection, connecting ~15,488 GPUs; ZCube uses 2 ports per GPU, 50% for interconnection, capping at 4,096 GPUs. Under equal hardware budgets, RailFly connects ~3.8× more GPUs.

Topology Comparison
Topology Comparison

9. From Rail-Only to Hybrid ZRail / Sparse-ZCube

Adding switch direct connections to Rail-only gradually turns it into ZCube or Sparse-ZCube, not traditional Rail-only. A better approach is PD-aware Hybrid ZRail:

1. Intra-cell ZCube/full-bipartite: Within a 1024–4096 GPU cell, a complete bipartite graph ensures any P→D is 2 hops, paths are deterministic, and load is dispersed.

2. Inter-cell sparse expander/dragonfly-like direct connections: Beyond 4K, no need for global full-bipartite. Scheduling keeps P and D in the same cell whenever possible; cross-cell only for overflow.

3. Dual-rail QoS isolation: One rail handles traditional rail/local collective traffic; the other handles ZCube/expander traffic, with QoS isolation between KV transfer, control, and collective.

4. Topology-aware scheduling: Decode selection considers not just GPU idleness, but also P-switch-to-D-switch edge load, D pool queue depth, and last-hop incast.

5. Sparse direct-connect degree d determined by traffic: Port budget R = k + d; full ZCube is d = S. When PD traffic has locality, using d << S for an expander may be more cost-efficient.

Decision criterion: For maximum robustness, random P→D, multi-tenant mixed workloads, use ZCube/full-bipartite. When you can control P/D placement, cache locality is high, and business traces are stable, use Sparse-ZCube / Hybrid ZRail. Pure Rail-only is not recommended as the primary network for PD disaggregated inference.

10. RailFly's Pure-Inference Evolution: PDX-Fabric

ZCube is an excellent static topology. But PD inference isn't purely arbitrary traffic — it has four characteristics that a stronger architecture can exploit:

  1. P→D KV flows are large, and their size can be estimated when prefill begins;
  2. D is the state residency point — multi-turn sessions should generally stick to the same D or D group;
  3. Warm prefix / repo cache has clear locality;
  4. D-side is prone to ingress incast, not uniform all-to-all.

So a stronger architecture shouldn't just ask "how many hops between any two points" — it should ask:

  • Can large KV flows reserve congestion-free paths?
  • Is D-side ingress oversubscribed?
  • Can cache-hit requests avoid cross-cell KV migration?
  • Can the scheduler determine D before prefill starts and pre-provision the path?

PDX-Fabric is RailFly's evolution for pure-inference scenarios: retaining standard Ethernet switches as the base, adding a sparse expander foundation + KV Fast Lane + Cache/Home logical plane to achieve stronger P99 TTFT and D-side incast control than ZCube — without introducing OCS or dedicated hardware.

Three-Layer Capability

 ┌─────────────────────────────┐
 │   Global PD Scheduler       │
 │   P/D load + KV size + cache│
 └──────────────┬──────────────┘
                │
 ┌──────────────┼──────────────────────┐
 │              │                      │
┌───────▼────────┐ ┌────────▼────────┐ ┌────────▼────────┐
│ Base Fabric    │ │ KV Fast Lane    │ │ Cache/Home Plane│
 │ sparse expander│ │ priority queues │ │ logical policy  │
 │ control/small  │ │ scheduled KV    │ │ session locality│
 └───────┬────────┘ └────────┬────────┘ └────────┬────────┘
        │                   │                   │
     P tiles / D tiles   P leaf ↔ D leaf    D island / cache

Base Fabric (Sparse-ZCube / Expander): Handles control plane, small KV, failure fallback, and general RDMA. Doesn't pursue a complete bipartite graph. Each switch uses k GPU downlinks + d electrical packet expander links + q dedicated KV fast lane links.

KV Fast Lane (Pure Ethernet QoS + Dedicated Links): Handles GB-scale P→D KV cache migration. No OCS — instead uses dedicated Ethernet links + priority queues + scheduler reservation to achieve congestion-free transfer for large KV flows. Each switch reserves 8–12 high-priority links exclusively for KV bursts, physically isolated or strictly QoS-isolated from the base fabric's expander links. The scheduler pre-reserves D ingress credits and link bandwidth quotas based on KV size before prefill begins.

Cache/Home Logical Plane: Keeps long sessions, warm prefixes, and shared repo cache within the same D island whenever possible. Warm sessions preferentially return to their original D island; only cold long prompts use the global KV fast lane.

Why PDX-Fabric Outperforms ZCube

Better P99 TTFT. ZCube reduces the probability of topology-induced congestion; PDX-Fabric further reduces queuing and contention for large KV flows. The scheduler knows approximate KV size before prefill begins and pre-reserves D buffer, D ingress, and link bandwidth quotas.

Better D-side incast handling. PDX-Fabric can design the D side asymmetrically — D-bearing switches have more inter-switch links, higher ingress credits, and more dedicated KV fast lane links. This better matches the real traffic characteristics of PD disaggregation.

Better scalability beyond 16K. ZCube's full bipartite graph requires 256 logical ports per switch at 16K. PDX-Fabric's expander degree d ≈ 8–16, KV fast lane q ≈ 8–24, approximately constant or slowly growing as scale increases. The tradeoff is dependence on the scheduler and traffic locality.

Better for cache-heavy coding agents. ZCube optimizes "transmit more smoothly"; PDX-Fabric also optimizes "transmit less, transmit closer, transmit only deltas." For multi-turn coding sessions, the same session stays on the same D group, the same repo cache stays on the same cache island, and Prefill reverse-selects a P close to D/cache.

1024 GPU Cell Deployment

Using the base configuration of 128 8-GPU nodes, 75P/50D/3 flex, 25 3P+2D affinity tiles:

64 Ethernet switches
Each switch:
  32 GPU/NIC downlinks
  8 expander inter-switch links (base fabric)
  8–12 KV fast lane dedicated links (QoS isolated)

ZCube comparison:  32 downlink + 32 inter-switch = 64
PDX-Fabric:        32 downlink + 8 expander + 8–12 KV fast lane = 48–52

Request scheduling flow:

  1. Warm session: D-first, prefer D_home, select a P close to and idle near D_home, transmit only delta KV
  2. Cold long prompt: P/D joint select, reserve D ingress credit, reserve KV fast lane bandwidth quota, P pushes KV to D in write mode
  3. Short prompt: Use base expander fabric, don't occupy KV fast lane

Core principle: large KV takes reserved KV fast lanes, small KV uses base expander, warm KV avoids long-distance transfer, D ingress must have credit.

Simplified Variant: D-centric Asymmetric Expander

If you don't want the complexity of KV fast lane + QoS isolation, a simpler pure Ethernet approach is possible:

  1. Eliminate spine; inter-switch connections form a high-quality expander rather than a full bipartite graph
  2. D-bearing switches have higher inter-switch degree than P-bearing switches
  3. D nodes have more uplinks / higher credits
  4. P/D placement is strictly cache-aware and tile-aware
Normal P switch:  32 downlink + 8 expander links
D-heavy switch:  24–28 downlink + 12–16 expander links

Fewer static inter-switch links than ZCube, better cross-rail P→D support than Rail-only, better Decode ingress accommodation than a plain expander. But if P→D is completely random with low cache locality, stability is weaker than ZCube. Suitable for cost-sensitive scenarios where the scheduler is controllable.

11. Positioning of Three Pure Ethernet Architectures

Architecture Best For Advantage over ZCube Main Risk
ZCube General-purpose, random P→D Simple, stable, strong static topology Doesn't exploit cache locality; large KV and small traffic share links
RailFly Mixed training+inference, 10K+ scale Inter-group complexity doesn't grow; 3.8× GPU capacity under equal hardware Cross-rail bandwidth depends on scheduler pairing quality
PDX-Fabric (RailFly pure-inference evolution) Long context, GB-scale KV, multi-turn agent Better P99 TTFT, D incast, 16K+ scalability Complex control plane; QoS isolation requires switch support

12. Scenario Recommendations and Deployment Guide

Scenario Recommended Architecture Reason
Pure PD disaggregated inference (thousand-GPU scale) ZCube cell Best load balancing, +15% throughput
Pure training Rail-Only Rail locality is key
Mixed training + inference RailFly Balances both, simple scaling
10K+ GPU mixed workloads RailFly + Hybrid ZRail Inter-group complexity doesn't grow with scale
Long KV + multi-turn agent + controllable scheduling PDX-Fabric Large KV uses KV fast lane, cache-aware minimizes KV migration

Three-Tier Deployment Recommendation

Tier 1, thousand-GPU production validation: 1024 GPUs, 128 8-card nodes, ZCube cell, 75P/50D/3 flex, 25 affinity tiles. The most stable starting point.

Tier 2, scaled production: 4096 GPUs, 512 8-card nodes, single cell or 4×1024-GPU cells, default 320P/192D. For 16K GPUs, recommend splitting into 4×4096 cells.

Tier 3, next-generation dedicated inference cluster: PDX-Fabric = Sparse expander + Ethernet KV fast lane + cache-aware D island + D-centric endpoint oversubscription. All based on standard Ethernet switches, suited for long-context coding agent GB-scale KV scenarios.

One-Line Summary

ZCube is "better static Ethernet"; a pure-Ethernet PD inference network stronger than ZCube should be "sparse expander base + KV fast lane dedicated links + cache-aware scheduling" — all on standard Ethernet switches, no OCS or scale-up hardware. RailFly finds the pragmatic sweet spot for mixed training+inference workloads in this spectrum: fixed 28 inter-group links, cross-rail hardware forwarding, 3.8× GPU capacity under equal hardware budgets.


References

  1. Yan, Z., et al. From ATOP to ZCube. ACM SIGCOMM 2025.
  2. Ghobadi, M., et al. Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters. arXiv:2307.12169, 2023.
  3. Araujo, J., et al. Resilient AI Supercomputer Networking using MRC and SRv6. OpenAI, 2026.
  4. Zhipu AI Tech Blog. Next-Generation Large Model Inference Network Architecture: How ZCube Effectively Resolves Network Bottlenecks. z.ai/blog/zcube
  5. Wang, Z., et al. Opus: An Elastic, Reconfigurable Optical Interconnect for Compute Pods. NSDI 2026.
  6. Together AI. Chunked Prefill-Decode Disaggregation for Long-Context Serving. 2026.
  7. FlowKV: Global KV Cache Management for Disaggregated LLM Serving. 2026.