Targeting NVIDIA B300 and Huawei Ascend 950 Supernode, this article answers three questions: how many P units to how many D units? How many GPUs/NPUs per P/D unit? How should internal parallelism be configured?
This article not only presents conclusions but fully shows how every key number is derived—readers can follow the derivation chain and verify for themselves.
Conclusions First, Derivations Follow
The core logic of PD disaggregated deployment has only three layers:
| Layer | Question | One-line Answer |
|---|---|---|
| 1 | How many P to how many D | It's not about picking a fixed ratio upfront. Instead, define minimum deployment units (P-unit / D-unit), then derive how many you need from business traffic (request arrival rate, context length, output length, cache hit rate) |
| 2 | How large is each unit | Determined by per-session speed SLA; B300 P-unit = 64 GPUs, D-unit = 128/192/384 GPUs |
| 3 | How to configure TP/EP/PP/DP | EP is the backbone, TP should be as small as possible, PP defaults to 1, DP equals replica count |
NVIDIA B300 baseline configuration (λ=5.86 req/s, O=512, h=0.56):
| Context | 30 tok/s | 100 tok/s | 400 tok/s |
|---|---|---|---|
| 4K–256K | P1:D1 = 192 GPUs | P1:D1 = 256 GPUs | P1:D1 = 448 GPUs |
| 1M | P6:D1 = 512 GPUs | P6:D1 = 576 GPUs | P6:D1 = 768 GPUs |
Huawei 950 Supernode (conservative estimate using pure 950DT; recommended to use 950PR for P):
| Context | 30 tok/s | 100/400 tok/s |
|---|---|---|
| 4K–128K | P1:D1 = 320 NPUs | P1:D1 = 512 NPUs |
| 256K | P3:D1 = 576 NPUs | P3:D1 = 768 NPUs |
| 1M | P13:D1 ≈ 2048 NPUs (capacity floor) | P16:D1 ≈ 2432 NPUs (production recommendation) |
Now, let's derive everything from scratch.
1. V4-Pro Structural Parameters: The Foundation of All Calculations
Before computing anything, we need to pin down V4-Pro's key parameters. These numbers are the foundation for all subsequent derivations.
V4-Pro is a 1.6T total parameter, 49B active parameter MoE (Mixture of Experts) model. The core idea of MoE: the model is large, but each inference only uses a small portion. For V4-Pro:
- 1.6T total parameters, but only 49B activated per token—activation rate ≈ 3%, which is the source of MoE's efficiency
- Each layer has 1 shared expert + 384 routed experts
- Each token activates 6 routed experts (top-6 routing)
- Expert intermediate dimension = 3072
- MTP depth = 1 (Multi-Token Prediction; while predicting the next token, it also attempts to predict one additional token, used to accelerate decode)
- Supports 1M context
Why are these numbers important? Because they directly determine:
- Per-token compute (FLOPs) → how much compute power is needed
- Per-token KV cache size → how much memory is needed
- Expert sharding strategy → how EP (Expert Parallelism) should be configured
2. Define Workload First, Then Configure
2.1 A Common Misconception
Many people, given a model, instinctively ask "how many GPUs does this model need?" But the right question isn't "how many GPUs does the model need"—it's "how many GPUs does my business need."
The same V4-Pro model:
- If your business is 4K short conversations at 30 tok/s standard speed → 192 B300 GPUs might suffice
- If your business is 1M long-document analysis at 100 tok/s → at least 576 B300 GPUs
- That's a 3× difference, entirely driven by workload
So we must define the workload first.
2.2 Four Key Business Parameters
Only four business parameters determine the deployment configuration:
- λ (request arrival rate): how many new requests per second
- O (average output length): how many tokens each session outputs on average
- L (average input context length): how long each session's prompt is
- h (prefix cache hit rate): what fraction of prefixes in the P tier can skip computation
Given these four parameters, P and D loads are fully determined:
D-tier total output throughput: Q_D = λ × O (tokens to output per second)
P-tier total input throughput: Q_P = λ × L × (1 − h) (new input tokens to process per second, minus cache hits)
Note the (1−h) factor on the P side—if cache hits, the P tier doesn't need to recompute that portion of the prompt.
2.3 What Baseline We Use
Throughout this article, we use this baseline:
O = 512, h = 0.56, λ = 5.86 req/s
λ = 5.86 is not a made-up number. It corresponds to a concrete scenario:
Imagine 100 concurrent users, each generating at 30 tok/s, with an average output of 512 tokens per session. A session from start to finish takes 512 / 30 ≈ 17 seconds. During those 17 seconds, 100 users take turns generating new requests: λ = 100 × 30 / 512 = 5.86 req/s
At the same λ, the number of active sessions varies dramatically with different per-session speeds:
| Per-session Speed | TPOT (per-token latency) | Active Decode Sessions | How Derived |
|---|---|---|---|
| 30 tok/s | 33.3 ms | 100 | 5.86 × 512 / 30 = 100 |
| 100 tok/s | 10 ms | 30 | 5.86 × 512 / 100 = 30 |
| 400 tok/s | 2.5 ms | 7.5 | 5.86 × 512 / 400 = 7.5 |
This table reveals a counterintuitive fact: with the same business traffic (same req/s), faster per-session speed means fewer concurrent active decode sessions. At 400 tok/s, there are only 7.5 active sessions.
Fewer active sessions might sound good, but for MoE models it's a major problem—we'll expand on this later.
3. How Much Compute Per Token (FLOPs)
Now let's calculate how much compute V4-Pro needs per token.
3.1 Base FLOPs: Why 0.098 TFLOP/token
V4-Pro activates 49B parameters per token. For transformer linear layers, one forward pass requires approximately 2 × parameter count FLOPs (one matrix multiply's FLOPs ≈ 2mn, where n is input dimension, m is output dimension, and the weight matrix has mn parameters).
So:
Base FLOPs = 2 × 49B = 98B FLOP = 0.098 TFLOP / token
This is the "base compute tax" that must be paid regardless of prefill or decode.
3.2 Attention Overhead Grows with Context Length
But the above only accounts for the feed-forward / MoE compute. The attention component's compute depends on context length L:
- Prefill: attention processes KV from position 0 to L−1, compute ≈ attention parameters × L × average position (cumulative from 0 to L, averaging ~L/2)
- Decode: each new token's attention reads KV at all existing positions (length L), compute ≈ attention parameters × L
So the attention FLOPs can be approximated as:
F_decode(L) = 0.098 + 0.202 × L_M TFLOP/output token
F_prefill(L) = 0.098 + 0.101 × L_M TFLOP/input token
Where L_M = L / 10⁶ (in millions of tokens).
Decode's attention term (0.202) is 2× that of prefill, because prefill accumulates from 0 to L and only processes half the length on average, while decode scans all L positions at every step.
Let's plug in some specific numbers to verify our intuition:
| Context L | L_M | Decode FLOPs/token | Prefill FLOPs/token | Intuition Check |
|---|---|---|---|---|
| 4K | 0.004 | 0.099 | 0.099 | Almost pure base compute; attention negligible |
| 128K | 0.128 | 0.124 | 0.111 | Attention starting to matter |
| 1M | 1.0 | 0.300 | 0.199 | 1M decode attention compute is ~3× the base compute |
This means: at 1M context, decode compute per token is 3× that of 4K context; prefill compute per token is 2× that of 4K. Context length's impact on compute cannot be ignored.
4. KV Cache: How Much Memory Per Session
MoE model KV cache differs from dense models. V4-Pro uses a heterogeneous KV structure: CSA/HCA compressed KV, SWA (Sliding Window Attention) KV, and state cache are managed separately. We focus on the compressed CSA/HCA portion:
K(L) ≈ 5.3 × L_M GB/sequence
Specific numbers:
| Context | KV/sequence | How Derived |
|---|---|---|
| 4K | 0.022 GB | 5.3 × 0.004 |
| 32K | 0.174 GB | 5.3 × 0.032 |
| 128K | 0.695 GB | 5.3 × 0.128 |
| 256K | 1.389 GB | 5.3 × 0.256 |
| 1M | 5.300 GB | 5.3 × 1.0 |
An important detail: SWA KV is approximately 8× the compressed CSA/HCA KV. That is, if the system fully caches SWA as well, at 1M context each session's KV is 5.3 × 8 = 42.4 GB—a single B300 GPU can't hold many sessions. So the C-tier (Cache tier) cannot default to full SWA caching; selective caching is mandatory.
5. P-unit and D-unit: Define Units First, Then Derive Ratios
5.1 Why Not Just Say "P:D = 1:1"
Many articles directly state a P:D ratio, like "P:D = 1:2." But this ratio is meaningless unless you know "how large is P, how large is D."
An analogy: saying "there are as many kitchens as people" is meaningless unless you know how much food each kitchen can produce and how much each person eats. A kitchen that can cook 20 dishes simultaneously paired with 20 light eaters, versus a kitchen that can only make 5 dishes paired with 5 heavy eaters—both have a 1:1 ratio, but the meaning is entirely different.
So the correct approach is:
- First define P-unit (a minimum independently deployable prefill unit) and D-unit (a minimum independently deployable decode unit)
- Calculate each unit's goodput (effective throughput)
- Derive how many P-units and D-units are needed based on business traffic
- The P:D ratio is derived, not assumed
5.2 How to Calculate P-unit / D-unit Goodput
Goodput ≠ raw peak FLOPS. Effective throughput must account for:
- Hardware peak compute C_node
- Model's hardware utilization η under specific workload
- SLA constraints (TTFT for P tier, TPOT for D tier)
- Communication overhead (EP all-to-all communication)
We use this formula:
G_P = reserve × C_node × η_P × 1000 / F_P(L) tok/s
G_D = reserve × C_node × η_D × 1000 / F_D(L) tok/s
Where reserve is the SLA headroom (P tier reserves 30% for bursts, D tier reserves 40% for TPOT p90), and η is the model's FLOPS utilization on that hardware.
Parameters for both platforms:
| Hardware | Per-node Compute C_node | P-tier Utilization η_P | D-tier Utilization η_D |
|---|---|---|---|
| B300 8-GPU | 36 PFLOPS dense FP8 | 0.45 | 0.14 |
| 950DT 8-NPU | 8 PFLOPS FP8 | 0.45 | 0.12 |
Why is D-tier utilization so much lower than P-tier? P-tier prefill is compute-bound (processing the entire prompt in large batches), yielding high hardware utilization; D-tier decode is memory-bound (generating only 1 token per step, with most time spent reading weights and KV cache), so utilization is inherently low. D-tier's 0.14/0.12 are typical values for MoE decode, not anomalies.
These efficiency parameters are calibrated from DeepSeek V3/R1 production data on H800 (EP32 prefill + EP144 decode), not V4-Pro实测—we'll repeat this caveat throughout.
5.3 From Goodput to P-unit / D-unit Size
P-unit and D-unit sizes are not determined by average traffic, but by per-session speed SLA and TPOT (Time Per Output Token) constraints.
D-unit logic is the most intuitive:
Your SLA is 100 tok/s, equivalent to TPOT ≤ 10ms. In MoE models, each decode step requires an expert all-to-all communication. The larger the EP (fewer experts per GPU), the faster each step, the lower the TPOT.
So D-unit size is primarily determined by "what tok/s do I need to achieve":
| Target Speed | TPOT Requirement | B300 D-unit | 950DT D-unit | Why This Size |
|---|---|---|---|---|
| 30 tok/s | 33.3 ms | 128 GPUs (EP128) | 192 NPUs (EP192) | Standard tier; TPOT headroom is generous |
| 100 tok/s | 10 ms | 192 GPUs (EP192) | 384 NPUs (EP384) | Larger EP, lower TPOT |
| 400 tok/s | 2.5 ms | 384 GPUs (EP384) | 384 NPUs+ (EP384+) | Must have ≤ 1 expert per GPU |
What's the relationship between EP and GPU count? V4-Pro has 384 routed experts. EP128 means 384 experts distributed across 128 GPUs, each GPU holding 384/128 = 3 experts. Similarly:
| EP | experts/GPU | Weights per GPU | Fits? |
|---|---|---|---|
| EP64 | 6 | Complete weights for 6 experts | Requires large HBM |
| EP128 | 3 | Weights for 3 experts | Reasonable |
| EP192 | 2 | Weights for 2 experts | Comfortable |
| EP384 | 1 | Weights for 1 expert | Very comfortable, but high communication overhead |
P-unit size logic is similar, but since prefill is compute-bound and doesn't need ultra-low TPOT, P-units can be relatively smaller. B300 uses EP64 (64 GPUs, 6 experts per GPU); 950DT uses EP128 (128 NPUs, 3 experts per NPU).
6. Full Derivation: From Workload to P:D Ratio
6.1 General Formula
With P-unit and D-unit goodput calculated, N_P and N_D follow naturally:
N_P = ⌈Q_P / G_{P,unit}(L)⌉ = ⌈λ × L × (1−h) / G_{P,unit}(L)⌉
N_D = max(⌈Q_D / G_{D,unit}(L)⌉, N_D^SLA)
Where N_D^SLA is the minimum number of D-units needed to satisfy the TPOT SLA.
Note N_D takes the max: even if traffic is low, the D tier needs at least 1 D-unit to guarantee TPOT.
6.2 Worked Example: B300, 128K, 100 tok/s
Let's walk through a specific scenario step by step: B300 platform, 128K context, 100 tok/s, baseline business traffic.
Step 1: How many tokens/s does the P tier need to process?
Q_P = λ × L × (1−h) = 5.86 × 128000 × (1−0.56) = 5.86 × 128000 × 0.44
First: 5.86 × 0.44 = 2.578
Then: 2.578 × 128000 = 330,000 token/s ≈ 330K token/s
Step 2: How much can one P-unit (64 GPUs) process?
F_prefill(128K) = 0.098 + 0.101 × 0.128 = 0.098 + 0.013 = 0.111 TFLOP/token
Single B300 node (8 GPUs) raw throughput:
S_P = 36 PFLOPS × 0.45 × 1000 / 0.111 = 145,946 token/s ≈ 146K token/s
8 nodes = 64-GPU P-unit:
G_{P,unit} = 0.7 × 146K × 8 = 817K token/s
Step 3: How many P-units needed?
N_P = ⌈330K / 817K⌉ = ⌈0.40⌉ = 1
Step 4: How many tokens/s does the D tier need to process?
Q_D = λ × O = 5.86 × 512 = 3,000 token/s (~3000 tok/s)
Step 5: How many D-units at minimum?
100 tok/s corresponds to D100-unit = 192 GPUs (EP192), per D-unit goodput:
F_decode(128K) = 0.098 + 0.202 × 0.128 = 0.098 + 0.026 = 0.124 TFLOP/token
Per-node raw decode throughput:
S_D = 36 PFLOPS × 0.14 × 1000 / 0.124 = 40,645 token/s ≈ 40.6K token/s
24 nodes = 192-GPU D-unit:
G_{D,unit} = 0.6 × 40.6K × 24 = 585K token/s
Step 6:
N_D = max(⌈3000 / 585K⌉, 1) = max(⌈0.005⌉, 1) = 1
Result: P1:D1 = 64 + 192 = 256 GPUs ✓
You may notice that D-unit goodput (585K tok/s) far exceeds business demand (3K tok/s). This is why P:D = 1:1 isn't because P and D throughput happen to be equal—it's because the D-unit's minimum deployable granularity already far exceeds the capacity floor. You can't build a D-unit satisfying a 100 tok/s SLA with fewer than 192 GPUs.
6.3 Why 1M Needs P6:D1—Full Derivation
The 1M scenario is the P tier's nightmare. Let's calculate exactly why.
Step 1: P-tier token/s
Q_P = 5.86 × 1,000,000 × 0.44 = 2,578,000 token/s ≈ 2.58M token/s
Compared to 128K's 330K token/s—the 1M P-tier load is 7.8× that of 128K.
Step 2: What's the P-unit goodput at 1M?
F_prefill(1M) = 0.098 + 0.101 × 1.0 = 0.199 TFLOP/token
Per-node raw prefill throughput:
S_P = 36 × 0.45 × 1000 / 0.199 = 81,407 token/s ≈ 81.4K token/s
P-unit (8 nodes = 64 GPUs) goodput:
G_{P,unit} = 0.7 × 81.4K × 8 = 455.8K token/s
Step 3:
N_P = ⌈2578K / 455.8K⌉ = ⌈5.66⌉ = 6
6 P-units = 384 GPUs.
So P6:D1 = 384P + 192D = 576 GPUs (100 tok/s scenario).
These 6 P-units can run independently or be merged into one 384-GPU P-supergroup using CP8 (Context Parallelism, 8-way parallel) to split a 1M prefill across 8 GPU groups for parallel computation, reducing TTFT.
Why not just use a larger P-unit? Because excessive EP increases all-to-all communication overhead. When running independently, each P-unit maintains EP64 (6 experts per GPU) with controlled communication overhead. When merged into a CP8 supergroup, each group internally still uses EP64; only KV (not expert weights) is transmitted across groups.
6.4 Huawei 950 Supernode: Why 1M Reaches P13:D1 or Even P16:D1
Same derivation logic, but 950DT's per-node compute is lower (8 PFLOPS vs B300's 36 PFLOPS).
P-unit goodput (128 NPUs):
F_prefill(1M) = 0.199 TFLOP/token
Per-node raw: 8 × 0.45 × 1000 / 0.199 = 18,091 token/s ≈ 18.1K token/s
P-unit (16 nodes = 128 NPUs) goodput: 0.7 × 18.1K × 16 = 202.5K token/s
N_P = ⌈2578K / 202.5K⌉ = ⌈12.7⌉ = 13
P13:D1 = 13 × 128P + 192D = 1664 + 192 = 1856 NPUs (rounded to 2048 NPUs with headroom).
But P13 doesn't lend itself to clean CP groupings—13 can't be evenly divided into reasonable CP × EP combinations. Production recommendation: P16:D1 = 2048P + 384D = 2432 NPUs, using EP128 × CP16 or EP256 × CP8.
This is why we repeatedly say: Huawei's 1M scenario should not use pure 950DT; use 950PR for P. The 950PR is purpose-built for prefill—if its prefill goodput is 2× that of 950DT (a reasonable expectation), P-tier NPU count could be cut in half.
7. 400 tok/s: Why This Speed Is Especially Difficult
Now let's return to the earlier table where 400 tok/s had only 7.5 active sessions.
7.1 The Expert Microbatch Disaster
V4-Pro activates 6 routed experts per token. With 384 routed experts, each expert has a 6/384 = 1/64 probability of being activated.
Assuming B concurrent tokens in a forward pass, a single expert's microbatch size is:
B_expert = B × 6 / 384 = B / 64
Plugging in active sessions at different speeds:
| Speed | Active Sessions | Expert Microbatch | Problem |
|---|---|---|---|
| 30 tok/s | 100 | 100/64 ≈ 1.56 | Barely enough for GEMM |
| 100 tok/s | 30 | 30/64 ≈ 0.47 | Less than 1; poor utilization |
| 400 tok/s | 7.5 | 7.5/64 ≈ 0.12 | Far less than 1 |
Expert microbatch = 0.12 means: in one forward pass, each expert receives only 0.12 tokens on average—not even 1. The GPU's GEMM units are completely starved.
This is the fundamental contradiction of MoE decode: you want faster per-session speed, which requires fewer concurrent streams to satisfy TPOT; but fewer concurrent streams mean smaller MoE expert batches and worse GEMM utilization.
7.2 Mitigation Strategies
400 tok/s isn't impossible, but requires a combination of techniques:
- EP384 or EP512: Each GPU handles only 1 expert (or fewer), reducing all-to-all communication volume and routing hops
- MTP / Speculative Decoding: Instead of predicting only 1 token per step, use a draft model to guess 2–4 tokens, multiplying the expert microbatch by 2–4×
- Hot expert replication: Create multiple replicas of high-frequency experts to reduce per-expert load imbalance
- D-unit dedicated speed tiers: 400 tok/s users get dedicated D-units, not mixed with 30 tok/s users
- The critical point: V4-Pro measured TPOT p90
CloudMatrix-Infer's measured data on DeepSeek-R1 already shows: the stricter the TPOT SLO, the smaller the batch size, the lower the decode throughput. This is a fundamental trade-off, not something that can be resolved by tuning parameters.
Recommendation: 30 tok/s as standard SLA, 100 tok/s as premium SLA, 400 tok/s only as an experimental SLA—do not promise 400 tok/s to customers without V4-Pro实测 data.
8. EP/TP/PP/DP Trade-offs in Practice
8.1 EP Is the Backbone
The core of the entire parallelism strategy is EP (Expert Parallelism). The reason is straightforward: V4-Pro has 384 routed experts, naturally suited for distribution across GPUs.
EP's trade-off is clear:
- Larger EP → fewer experts per GPU → faster per-step compute → lower TPOT
- Larger EP → more all-to-all communication → higher communication latency
In decode scenarios, compute is fast (memory-bound), so communication latency accounts for a large share. EP isn't "bigger is better"—you need to find the TPOT sweet spot:
| Stage | Recommended EP | Why |
|---|---|---|
| P / prefill | EP64/128 | Large prefill batch, compute-bound, low communication share |
| D / 30 tok/s | EP128/192 | Standard tier; 33ms TPOT headroom is generous |
| D / 100 tok/s | EP192/384 | 10ms TPOT, needs larger EP |
| D / 400 tok/s | EP384/512 | 2.5ms TPOT, must have ≤ 1 expert/GPU |
8.2 Keep TP Small
TP (Tensor Parallelism) splits one expert's weights across multiple GPUs. For MoE models, TP introduces two problems:
- Increased communication: every forward step requires all-reduce
- KV duplication: if attention layers also use TP, KV cache is redundantly stored across multiple GPUs
So start with TP1. Only consider TP2 + DCP (Decode Context Parallelism) when HBM is insufficient or when 1M scenario KV cache is too large, to reduce KV duplication.
8.3 PP Defaults to 1
PP (Pipeline Parallelism) places different model layers on different GPUs. For inference, especially decode, PP1 is the default—because PP introduces pipeline bubbles that directly increase per-token latency.
Consider PP only in one scenario: a single execution group's HBM can't hold the entire model. But V4-Pro's MoE portion already distributes expert weights via EP, and the dense/shared portion is relatively small, so PP isn't usually needed.
8.4 DP Is Simply P/D Replica Count
In PD disaggregation, DP's most practical meaning is:
DP_P = N_P (number of P-units) DP_D = N_D (number of D-units)
When scaling traffic, prefer adding DP (adding units) over changing individual unit TP/EP/PP. Each unit's parallelism config is SLA-validated; changing it may break TPOT constraints.
8.5 CP/DCP: Critical for 1M Scenarios
CP (Context Parallelism, for prefill) and DCP (Decode Context Parallelism) aren't in the standard "TP/EP/PP/DP" quartet, but must be included in designs for 128K+ and especially 1M:
- P-tier CP: splits a long prefill's attention computation across multiple GPU groups in parallel, reducing TTFT
- D-tier DCP: shards KV cache across multiple GPUs, reducing per-GPU KV storage pressure
At 1M context, a single session's KV cache is ~5.3 GB. If a D-tier GPU serves 20 concurrent sessions, KV alone is 106 GB—over half of a B300 GPU's 262.5 GB HBM. DCP can distribute KV across multiple GPUs, but the cost is cross-GPU KV reads during every attention step, increasing communication.
9. B300 All-Scenario Configuration Quick Reference
Synthesizing all the above analysis, B300's all-scenario configuration:
P-unit / D-unit Definitions
| Unit | GPU Count | Internal Parallelism |
|---|---|---|
| P-unit | 64 | TP1 / EP64 / PP1 / CP1; 128K+ can merge multiple P-units for CP |
| D30-unit | 128 | TP1 / EP128 / PP1 |
| D100-unit | 192 | TP1 / EP192 / PP1 |
| D400-unit | 384 | TP1 / EP384 / PP1 + MTP/speculative |
All-Scenario P:D Configuration
Baseline: λ = 5.86, O = 512, h = 0.56
| Context | Speed | P:D | P GPUs | D GPUs | Total GPUs |
|---|---|---|---|---|---|
| 4K | 30 | P1:D1 | 64 | 128 | 192 |
| 4K | 100 | P1:D1 | 64 | 192 | 256 |
| 4K | 400 | P1:D1 | 64 | 384 | 448 |
| 32K | 30 | P1:D1 | 64 | 128 | 192 |
| 32K | 100 | P1:D1 | 64 | 192 | 256 |
| 32K | 400 | P1:D1 | 64 | 384 | 448 |
| 128K | 30 | P1:D1 | 64 | 128 | 192 |
| 128K | 100 | P1:D1 | 64 | 192 | 256 |
| 128K | 400 | P1:D1 | 64 | 384 | 448 |
| 256K | 30 | P1:D1 | 64 | 128 | 192 |
| 256K | 100 | P1:D1 | 64 | 192 | 256 |
| 256K | 400 | P1:D1 | 64 | 384 | 448 |
| 1M | 30 | P6:D1 | 384 | 128 | 512 |
| 1M | 100 | P6:D1 | 384 | 192 | 576 |
| 1M | 400 | P6:D1 | 384 | 384 | 768 |
Interpretation:
- 4K–256K P1:D1: Not because P/D throughput happens to be 1:1, but because the minimum deployment units already exceed the capacity floor. The 64-GPU P-unit and 128/192/384-GPU D-units all far exceed actual traffic requirements.
- 1M P6:D1: P-tier 1M prefill is the absolute bottleneck. TTFT grows linearly with context length; 1M prefill compute is 200× that of 4K. The 384-GPU P-supergroup uses CP8 to complete a 1M prefill within reasonable time.
10. Huawei 950 Supernode All-Scenario Configuration
10.1 Recommended Architecture
The Huawei solution should not be a flat 950DT deployment. The correct architecture is:
P tier: 950PR (optimized for prefill)
D tier: 950DT (optimized for decode)
C tier: Atlas 950 SuperPoD / UnifiedBus / Kunpeng / EMS / NVMe cache
The numbers below use pure 950DT for conservative estimation. Once 950PR's measured prefill goodput is available, P-tier counts are expected to decrease significantly.
10.2 All-Scenario P:D Configuration (Pure 950DT)
Baseline: λ = 5.86, O = 512, h = 0.56
| Context | Speed | P:D | P NPUs | D NPUs | Total NPUs |
|---|---|---|---|---|---|
| 4K–128K | 30 | P1:D1 | 128 | 192 | 320 |
| 4K–128K | 100/400 | P1:D1 | 128 | 384 | 512 |
| 256K | 30 | P3:D1 | 384 | 192 | 576 |
| 256K | 100/400 | P3:D1 | 384 | 384 | 768 |
| 1M | 30 | P13:D1 | 1664 | 192 | 1856→2048 |
| 1M | 100/400 | P13:D1 | 1664 | 384 | 2048→2432 |
Cleaner Supernode production slices:
| Context | Production Recommendation |
|---|---|
| 256K | P4:D1 = 512P + 384D = 896 NPUs (more stable TTFT) |
| 1M | P16:D1 = 2048P + 384D = 2432 NPUs (EP128×CP16 or EP256×CP8) |
10.3 Internal Parallelism Recommendations
| Context | P Tier | D Tier 30 tok/s | D Tier 100/400 tok/s |
|---|---|---|---|
| 4K | TP1 / EP128 / PP1 / CP1 | TP1 / EP192 / PP1 | TP1 / EP384 / PP1 |
| 128K | EP128 / CP1; use CP2 for strict TTFT | EP192 | EP384 |
| 256K | EP128 × CP3; production recommends CP4 | EP192 | EP384 |
| 1M | EP128 × CP16 or EP256 × CP8 | EP192 / DCP | EP384 / DCP + MTP |
11. Cost: How Much Per Million Tokens
Finally, let's calculate cost. The derivation chain is straightforward:
$/MTok = Total GPUs × per-GPU-hour cost ÷ million tokens output per hour
Tokens output per hour = λ × O × 3600
$/MTok = N_gpu × $/GPU-hour / (λ × O × 3600 / 1,000,000)
Cost assumptions:
- B300: $5/GPU-hour (planning estimate, not official pricing)
- 950DT: $2/NPU-hour (planning estimate)
11.1 Full Derivation: B300 128K / 100 tok/s
N_gpu = 256 (P1:D1 = 64P + 192D)
Output per hour = 5.86 × 512 × 3600 = 10,786,432 tokens ≈ 10.79M tokens
Cost per hour = 256 × $5 = $1,280
$/MTok = $1,280 / 10.79 = $119/MTok
11.2 Full Derivation: B300 1M / 100 tok/s
N_gpu = 576 (P6:D1 = 384P + 192D)
Output per hour = 10.79M tokens (same λ)
Cost per hour = 576 × $5 = $2,880
$/MTok = $2,880 / 10.79 = $267/MTok
11.3 B300 All-Scenario Costs
| Context | 30 tok/s | 100 tok/s | 400 tok/s |
|---|---|---|---|
| 4K–256K | $89/MTok | $119/MTok | $207/MTok |
| 1M | $237/MTok | $267/MTok | $356/MTok |
11.4 950DT All-Scenario Costs
| Context | 30 tok/s | 100 tok/s | 400 tok/s |
|---|---|---|---|
| 4K–128K | $59/MTok | $95/MTok | $95/MTok |
| 256K | $107/MTok | $142/MTok | $142/MTok |
| 1M | $344/MTok | $379/MTok | $379/MTok |
11.5 Cost Interpretation
Several numbers worth noting:
- 4K–128K: 950DT has lower nominal cost ($59 vs $89), but this advantage rests entirely on the NPU-hour = $2 assumption. If actual NPU-hour pricing is higher, the advantage disappears.
- 1M: B300 is actually cheaper ($267 vs $379). The reason is that 950DT's 1M configuration balloons to 2048 NPUs on the P tier—lower per-node compute means more NPUs needed for P, and more NPUs mean higher cost.
- For 950DT to match B300 at 1M, NPU-hour would need to drop to approximately $1.1 or below—half the current assumption.
12. C Tier: Cache and Bandwidth Between P and D
12.1 How Much P→D Bandwidth Is Needed
After prefill completes, the P tier must transfer KV cache to the D tier. Each session's KV size is K(L), and there are λ new sessions per second:
B_{P→D} = λ × K(L)
Plugging in baseline λ = 5.86:
| Context | KV/seq | P→D Floor | If Caching Full SWA (8×) |
|---|---|---|---|
| 4K | 0.022 GB | 0.13 GB/s | 1.0 GB/s |
| 32K | 0.174 GB | 1.02 GB/s | 8.1 GB/s |
| 128K | 0.695 GB | 4.07 GB/s | 32.6 GB/s |
| 256K | 1.389 GB | 8.14 GB/s | 65.1 GB/s |
| 1M | 5.300 GB | 31.05 GB/s | 248.4 GB/s |
31 GB/s is only the P→D KV transfer floor. Actual C-tier traffic also includes lookup, D→C write-back, eviction, replay, and other multiplexed flows. At 1M, actual C-tier bandwidth requirements may be in the 100–200 GB/s range.
This scale means:
- 4K–32K: standard Ethernet/RoCE suffices
- 128K–256K: RDMA or NVLink-domain needed
- 1M: RDMA + UB (UnifiedBus) + NVLink-domain KV transfer required
12.2 What the C Tier Must Do at 1M
At 1M context, cache strategy isn't an optimization—it's a survival requirement:
- Prefix hashing: store only one copy of compressed KV for identical prefixes
- Compressed CSA/HCA KV priority caching: don't default to full SWA caching
- SWA periodic checkpointing: periodic snapshots of SWA, not real-time full persistence
- Cache-aware routing: route requests preferentially to D-units with cache hits
- SSD endurance management: on-disk KV cache generates massive SSD write volume; monitor write lifespan
- Cache hit rate as a scheduling objective: don't distribute requests evenly; prioritize leveraging existing cache
13. P/D/C Scheduling: Don't Use Static Binding
13.1 Three Independent Resource Pools
In production, don't permanently bind P1 to D1. Use three independent resource pools:
P Pool | D Pool | C Cache Pool
Request flow:
- Router performs prefix hashing
- Checks C tier for compressed KV hit
- Hit → reduced P load, only tail recompute needed
- Miss → allocate a P-unit for full prefill
- P completes → allocate D-unit based on D-tier load and cache locality
- During D decode, continuously append KV; partial write-back to C tier is possible
13.2 What P-Tier Scheduling Should Consider
P-tier scheduling shouldn't look only at request count. The same request with a 4K prompt vs. a 1M prompt has 250× different prefill loads. Long-context prefill should be scheduled by token budget—"this P-unit has 500K tokens of prefill budget remaining," not "this P-unit can accept 5 more requests."
13.3 What D-Tier Scheduling Should Consider
D-tier scheduling must simultaneously consider:
- Active streams (current number of active sessions)
- KV length (how long each session's KV is)
- Expert load (whether expert load is balanced)
- Per-user speed tier (30/100/400 tok/s cannot be mixed in the same D-unit)
- TPOT p90 (real-time monitoring; throttle if SLA is exceeded)
14. High-Utilization Scenario: How Many P to Saturate D
The configurations in §6–10 are SLA-safe—guaranteeing TPOT and TTFT, but with very low D-tier utilization (D-unit goodput far exceeds actual traffic). If the goal is to saturate D-units (maximizing throughput), P:D would be completely different.
P / D ≈ (L/O) × (1−h) × G_D / G_P
Intuitively: the longer the context (large L), the shorter the output (small O), and the lower the cache hit rate (small h), the larger P needs to be relative to D.
B300 (P-unit = 64 GPUs)
| Context | Saturate D30-unit (128 GPUs) | Saturate D100-unit (192 GPUs) | Saturate D400-unit (384 GPUs) |
|---|---|---|---|
| 4K | 2 P-units | 3 P-units | 6 P-units |
| 32K | 15 | 22 | 44 |
| 128K | 54 | 81 | 162 |
| 256K | 100 | 149 | 298 |
| 1M | 305 | 457 | 913 |
Conclusion: At 1M context, short output (O=512), and semi-cold prefixes (h=0.56), you should not pursue D-tier saturation—otherwise the P tier would balloon to unacceptable levels. The correct approach: use cache to increase h, offload long-output tasks, or use larger EP for D-units to improve D goodput.
15. Final Recommendations
B300
| Scenario | Configuration | Total GPUs |
|---|---|---|
| 4K–256K / 30 tok/s | P1:D1 | 192 |
| 4K–256K / 100 tok/s | P1:D1 | 256 |
| 4K–256K / 400 tok/s | P1:D1 | 448 |
| 1M / 30 tok/s | P6:D1 | 512 |
| 1M / 100 tok/s | P6:D1 | 576 |
| 1M / 400 tok/s | P6:D1 | 768 |
P tier: TP1 / EP64 / PP1, add CP8 for 1M D tier: EP192/384, 400 tok/s requires MTP
Huawei 950 Supernode
Preferred architecture: 950PR for P, 950DT for D, SuperPoD for C/cache
| Scenario | Configuration (Pure 950DT Conservative Estimate) | Total NPUs |
|---|---|---|
| 4K–128K / 30 tok/s | P1:D1 | 320 |
| 4K–128K / 100 tok/s | P1:D1 | 512 |
| 256K | P4:D1 (production recommendation) | 896 |
| 1M | P16:D1 (production recommendation) | 2432 |
16. Uncertainties and Validation Checkpoints
Finally, we must be clear about what is verified and what is speculative:
| Category | Content |
|---|---|
| Verified | V4-Pro model structural parameters (1.6T/49B/384 experts/top-6), B300 hardware specifications, FLOPs calculation formulas |
| Calibrated but not V4-Pro measured | Hardware utilization η_P=0.45 / η_D=0.14, calibrated from DeepSeek V3/R1 H800 production systems |
| Speculative | B300's actual decode utilization on V4-Pro, 950DT's performance on V4-Pro, 950PR prefill goodput |
| Planning estimates | B300 $5/GPU-hour, 950DT $2/NPU-hour |
| Staleness risk | 950DT planned for 2026 Q4 availability; specs and availability may change; B300 pricing not yet published |
Next validation checkpoints:
- V4-Pro measured decode goodput and TPOT p90 on B300
- 950DT measured prefill/decode efficiency on V4-Pro
- 950PR prefill goodput
- Actual commercial pricing for B300 and 950DT
- V4-Pro's SWA KV strategy and actual cache overhead
Until measured data is available, all configurations in this article should be treated as capacity planning starting points and frameworks, not final answers. But the framework is correct—define workload first, then calculate goodput, then derive P:D—once measured data arrives, only the efficiency parameters need to be replaced; the methodology remains unchanged.
