PDC Disaggregated Serving for DeepSeek V4-Pro: From Compute Principles to Deployment Configs

Targeting NVIDIA B300 and Huawei Ascend 950 Supernode, this article answers three questions: how many P units to how many D units? How many GPUs/NPUs per P/D unit? How should internal parallelism be configured?

This article not only presents conclusions but fully shows how every key number is derived—readers can follow the derivation chain and verify for themselves.

Conclusions First, Derivations Follow

The core logic of PD disaggregated deployment has only three layers:

Layer	Question	One-line Answer
1	How many P to how many D	It's not about picking a fixed ratio upfront. Instead, define minimum deployment units (P-unit / D-unit), then derive how many you need from business traffic (request arrival rate, context length, output length, cache hit rate)
2	How large is each unit	Determined by per-session speed SLA; B300 P-unit = 64 GPUs, D-unit = 128/192/384 GPUs
3	How to configure TP/EP/PP/DP	EP is the backbone, TP should be as small as possible, PP defaults to 1, DP equals replica count

NVIDIA B300 baseline configuration (λ=5.86 req/s, O=512, h=0.56):

Context	30 tok/s	100 tok/s	400 tok/s
4K–256K	P1:D1 = 192 GPUs	P1:D1 = 256 GPUs	P1:D1 = 448 GPUs
1M	P6:D1 = 512 GPUs	P6:D1 = 576 GPUs	P6:D1 = 768 GPUs

Huawei 950 Supernode (conservative estimate using pure 950DT; recommended to use 950PR for P):

Context	30 tok/s	100/400 tok/s
4K–128K	P1:D1 = 320 NPUs	P1:D1 = 512 NPUs
256K	P3:D1 = 576 NPUs	P3:D1 = 768 NPUs
1M	P13:D1 ≈ 2048 NPUs (capacity floor)	P16:D1 ≈ 2432 NPUs (production recommendation)

Now, let's derive everything from scratch.

1. V4-Pro Structural Parameters: The Foundation of All Calculations

Before computing anything, we need to pin down V4-Pro's key parameters. These numbers are the foundation for all subsequent derivations.

V4-Pro is a 1.6T total parameter, 49B active parameter MoE (Mixture of Experts) model. The core idea of MoE: the model is large, but each inference only uses a small portion. For V4-Pro:

1.6T total parameters, but only 49B activated per token—activation rate ≈ 3%, which is the source of MoE's efficiency
Each layer has 1 shared expert + 384 routed experts
Each token activates 6 routed experts (top-6 routing)
Expert intermediate dimension = 3072
MTP depth = 1 (Multi-Token Prediction; while predicting the next token, it also attempts to predict one additional token, used to accelerate decode)
Supports 1M context

Why are these numbers important? Because they directly determine:

Per-token compute (FLOPs) → how much compute power is needed
Per-token KV cache size → how much memory is needed
Expert sharding strategy → how EP (Expert Parallelism) should be configured

2. Define Workload First, Then Configure

2.1 A Common Misconception

Many people, given a model, instinctively ask "how many GPUs does this model need?" But the right question isn't "how many GPUs does the model need"—it's "how many GPUs does my business need."

The same V4-Pro model:

If your business is 4K short conversations at 30 tok/s standard speed → 192 B300 GPUs might suffice
If your business is 1M long-document analysis at 100 tok/s → at least 576 B300 GPUs
That's a 3× difference, entirely driven by workload

So we must define the workload first.

2.2 Four Key Business Parameters

Only four business parameters determine the deployment configuration:

λ (request arrival rate): how many new requests per second
O (average output length): how many tokens each session outputs on average
L (average input context length): how long each session's prompt is
h (prefix cache hit rate): what fraction of prefixes in the P tier can skip computation

Given these four parameters, P and D loads are fully determined:

D-tier total output throughput: Q_D = λ × O (tokens to output per second)

P-tier total input throughput: Q_P = λ × L × (1 − h) (new input tokens to process per second, minus cache hits)

Note the (1−h) factor on the P side—if cache hits, the P tier doesn't need to recompute that portion of the prompt.

2.3 What Baseline We Use

Throughout this article, we use this baseline:

O = 512, h = 0.56, λ = 5.86 req/s

λ = 5.86 is not a made-up number. It corresponds to a concrete scenario:

Imagine 100 concurrent users, each generating at 30 tok/s, with an average output of 512 tokens per session. A session from start to finish takes 512 / 30 ≈ 17 seconds. During those 17 seconds, 100 users take turns generating new requests: λ = 100 × 30 / 512 = 5.86 req/s

At the same λ, the number of active sessions varies dramatically with different per-session speeds:

Per-session Speed	TPOT (per-token latency)	Active Decode Sessions	How Derived
30 tok/s	33.3 ms	100	5.86 × 512 / 30 = 100
100 tok/s	10 ms	30	5.86 × 512 / 100 = 30
400 tok/s	2.5 ms	7.5	5.86 × 512 / 400 = 7.5

This table reveals a counterintuitive fact: with the same business traffic (same req/s), faster per-session speed means fewer concurrent active decode sessions. At 400 tok/s, there are only 7.5 active sessions.

Fewer active sessions might sound good, but for MoE models it's a major problem—we'll expand on this later.

3. How Much Compute Per Token (FLOPs)

Now let's calculate how much compute V4-Pro needs per token.

3.1 Base FLOPs: Why 0.098 TFLOP/token

V4-Pro activates 49B parameters per token. For transformer linear layers, one forward pass requires approximately 2 × parameter count FLOPs (one matrix multiply's FLOPs ≈ 2mn, where n is input dimension, m is output dimension, and the weight matrix has mn parameters).

So:

Base FLOPs = 2 × 49B = 98B FLOP = 0.098 TFLOP / token

This is the "base compute tax" that must be paid regardless of prefill or decode.

3.2 Attention Overhead Grows with Context Length

But the above only accounts for the feed-forward / MoE compute. The attention component's compute depends on context length L:

Prefill: attention processes KV from position 0 to L−1, compute ≈ attention parameters × L × average position (cumulative from 0 to L, averaging ~L/2)
Decode: each new token's attention reads KV at all existing positions (length L), compute ≈ attention parameters × L

So the attention FLOPs can be approximated as:

F_decode(L) = 0.098 + 0.202 × L_M TFLOP/output token

F_prefill(L) = 0.098 + 0.101 × L_M TFLOP/input token

Where L_M = L / 10⁶ (in millions of tokens).

Decode's attention term (0.202) is 2× that of prefill, because prefill accumulates from 0 to L and only processes half the length on average, while decode scans all L positions at every step.

Let's plug in some specific numbers to verify our intuition:

Context L	L_M	Decode FLOPs/token	Prefill FLOPs/token	Intuition Check
4K	0.004	0.099	0.099	Almost pure base compute; attention negligible
128K	0.128	0.124	0.111	Attention starting to matter
1M	1.0	0.300	0.199	1M decode attention compute is ~3× the base compute

This means: at 1M context, decode compute per token is 3× that of 4K context; prefill compute per token is 2× that of 4K. Context length's impact on compute cannot be ignored.

4. KV Cache: How Much Memory Per Session

MoE model KV cache differs from dense models. V4-Pro uses a heterogeneous KV structure: CSA/HCA compressed KV, SWA (Sliding Window Attention) KV, and state cache are managed separately. We focus on the compressed CSA/HCA portion:

K(L) ≈ 5.3 × L_M GB/sequence

Specific numbers:

Context	KV/sequence	How Derived
4K	0.022 GB	5.3 × 0.004
32K	0.174 GB	5.3 × 0.032
128K	0.695 GB	5.3 × 0.128
256K	1.389 GB	5.3 × 0.256
1M	5.300 GB	5.3 × 1.0

An important detail: SWA KV is approximately 8× the compressed CSA/HCA KV. That is, if the system fully caches SWA as well, at 1M context each session's KV is 5.3 × 8 = 42.4 GB—a single B300 GPU can't hold many sessions. So the C-tier (Cache tier) cannot default to full SWA caching; selective caching is mandatory.

5. P-unit and D-unit: Define Units First, Then Derive Ratios

5.1 Why Not Just Say "P:D = 1:1"

Many articles directly state a P:D ratio, like "P:D = 1:2." But this ratio is meaningless unless you know "how large is P, how large is D."

An analogy: saying "there are as many kitchens as people" is meaningless unless you know how much food each kitchen can produce and how much each person eats. A kitchen that can cook 20 dishes simultaneously paired with 20 light eaters, versus a kitchen that can only make 5 dishes paired with 5 heavy eaters—both have a 1:1 ratio, but the meaning is entirely different.

So the correct approach is:

First define P-unit (a minimum independently deployable prefill unit) and D-unit (a minimum independently deployable decode unit)
Calculate each unit's goodput (effective throughput)
Derive how many P-units and D-units are needed based on business traffic
The P:D ratio is derived, not assumed

5.2 How to Calculate P-unit / D-unit Goodput

Goodput ≠ raw peak FLOPS. Effective throughput must account for:

Hardware peak compute C_node
Model's hardware utilization η under specific workload
SLA constraints (TTFT for P tier, TPOT for D tier)
Communication overhead (EP all-to-all communication)

We use this formula:

G_P = reserve × C_node × η_P × 1000 / F_P(L) tok/s

G_D = reserve × C_node × η_D × 1000 / F_D(L) tok/s

Where reserve is the SLA headroom (P tier reserves 30% for bursts, D tier reserves 40% for TPOT p90), and η is the model's FLOPS utilization on that hardware.

Parameters for both platforms:

Hardware	Per-node Compute C_node	P-tier Utilization η_P	D-tier Utilization η_D
B300 8-GPU	36 PFLOPS dense FP8	0.45	0.14
950DT 8-NPU	8 PFLOPS FP8	0.45	0.12

Why is D-tier utilization so much lower than P-tier? P-tier prefill is compute-bound (processing the entire prompt in large batches), yielding high hardware utilization; D-tier decode is memory-bound (generating only 1 token per step, with most time spent reading weights and KV cache), so utilization is inherently low. D-tier's 0.14/0.12 are typical values for MoE decode, not anomalies.

These efficiency parameters are calibrated from DeepSeek V3/R1 production data on H800 (EP32 prefill + EP144 decode), not V4-Pro实测—we'll repeat this caveat throughout.

5.3 From Goodput to P-unit / D-unit Size

P-unit and D-unit sizes are not determined by average traffic, but by per-session speed SLA and TPOT (Time Per Output Token) constraints.

D-unit logic is the most intuitive:

Your SLA is 100 tok/s, equivalent to TPOT ≤ 10ms. In MoE models, each decode step requires an expert all-to-all communication. The larger the EP (fewer experts per GPU), the faster each step, the lower the TPOT.

So D-unit size is primarily determined by "what tok/s do I need to achieve":

Target Speed	TPOT Requirement	B300 D-unit	950DT D-unit	Why This Size
30 tok/s	33.3 ms	128 GPUs (EP128)	192 NPUs (EP192)	Standard tier; TPOT headroom is generous
100 tok/s	10 ms	192 GPUs (EP192)	384 NPUs (EP384)	Larger EP, lower TPOT
400 tok/s	2.5 ms	384 GPUs (EP384)	384 NPUs+ (EP384+)	Must have ≤ 1 expert per GPU

What's the relationship between EP and GPU count? V4-Pro has 384 routed experts. EP128 means 384 experts distributed across 128 GPUs, each GPU holding 384/128 = 3 experts. Similarly:

EP	experts/GPU	Weights per GPU	Fits?
EP64	6	Complete weights for 6 experts	Requires large HBM
EP128	3	Weights for 3 experts	Reasonable
EP192	2	Weights for 2 experts	Comfortable
EP384	1	Weights for 1 expert	Very comfortable, but high communication overhead

P-unit size logic is similar, but since prefill is compute-bound and doesn't need ultra-low TPOT, P-units can be relatively smaller. B300 uses EP64 (64 GPUs, 6 experts per GPU); 950DT uses EP128 (128 NPUs, 3 experts per NPU).

6. Full Derivation: From Workload to P:D Ratio

6.1 General Formula

With P-unit and D-unit goodput calculated, N_P and N_D follow naturally:

N_P = ⌈Q_P / G_{P,unit}(L)⌉ = ⌈λ × L × (1−h) / G_{P,unit}(L)⌉

N_D = max(⌈Q_D / G_{D,unit}(L)⌉, N_D^SLA)

Where N_D^SLA is the minimum number of D-units needed to satisfy the TPOT SLA.

Note N_D takes the max: even if traffic is low, the D tier needs at least 1 D-unit to guarantee TPOT.

6.2 Worked Example: B300, 128K, 100 tok/s

Let's walk through a specific scenario step by step: B300 platform, 128K context, 100 tok/s, baseline business traffic.

Step 1: How many tokens/s does the P tier need to process?

Q_P = λ × L × (1−h) = 5.86 × 128000 × (1−0.56) = 5.86 × 128000 × 0.44

First: 5.86 × 0.44 = 2.578

Then: 2.578 × 128000 = 330,000 token/s ≈ 330K token/s

Step 2: How much can one P-unit (64 GPUs) process?

F_prefill(128K) = 0.098 + 0.101 × 0.128 = 0.098 + 0.013 = 0.111 TFLOP/token

Single B300 node (8 GPUs) raw throughput:

S_P = 36 PFLOPS × 0.45 × 1000 / 0.111 = 145,946 token/s ≈ 146K token/s

8 nodes = 64-GPU P-unit:

G_{P,unit} = 0.7 × 146K × 8 = 817K token/s

Step 3: How many P-units needed?

N_P = ⌈330K / 817K⌉ = ⌈0.40⌉ = 1

Step 4: How many tokens/s does the D tier need to process?

Q_D = λ × O = 5.86 × 512 = 3,000 token/s (~3000 tok/s)

Step 5: How many D-units at minimum?

100 tok/s corresponds to D100-unit = 192 GPUs (EP192), per D-unit goodput:

F_decode(128K) = 0.098 + 0.202 × 0.128 = 0.098 + 0.026 = 0.124 TFLOP/token

Per-node raw decode throughput:

S_D = 36 PFLOPS × 0.14 × 1000 / 0.124 = 40,645 token/s ≈ 40.6K token/s

24 nodes = 192-GPU D-unit:

G_{D,unit} = 0.6 × 40.6K × 24 = 585K token/s

Step 6:

N_D = max(⌈3000 / 585K⌉, 1) = max(⌈0.005⌉, 1) = 1

Result: P1:D1 = 64 + 192 = 256 GPUs ✓

You may notice that D-unit goodput (585K tok/s) far exceeds business demand (3K tok/s). This is why P:D = 1:1 isn't because P and D throughput happen to be equal—it's because the D-unit's minimum deployable granularity already far exceeds the capacity floor. You can't build a D-unit satisfying a 100 tok/s SLA with fewer than 192 GPUs.

6.3 Why 1M Needs P6:D1—Full Derivation

The 1M scenario is the P tier's nightmare. Let's calculate exactly why.

Step 1: P-tier token/s

Q_P = 5.86 × 1,000,000 × 0.44 = 2,578,000 token/s ≈ 2.58M token/s

Compared to 128K's 330K token/s—the 1M P-tier load is 7.8× that of 128K.

Step 2: What's the P-unit goodput at 1M?

F_prefill(1M) = 0.098 + 0.101 × 1.0 = 0.199 TFLOP/token

Per-node raw prefill throughput:

S_P = 36 × 0.45 × 1000 / 0.199 = 81,407 token/s ≈ 81.4K token/s

P-unit (8 nodes = 64 GPUs) goodput:

G_{P,unit} = 0.7 × 81.4K × 8 = 455.8K token/s

Step 3:

N_P = ⌈2578K / 455.8K⌉ = ⌈5.66⌉ = 6

6 P-units = 384 GPUs.

So P6:D1 = 384P + 192D = 576 GPUs (100 tok/s scenario).

These 6 P-units can run independently or be merged into one 384-GPU P-supergroup using CP8 (Context Parallelism, 8-way parallel) to split a 1M prefill across 8 GPU groups for parallel computation, reducing TTFT.

Why not just use a larger P-unit? Because excessive EP increases all-to-all communication overhead. When running independently, each P-unit maintains EP64 (6 experts per GPU) with controlled communication overhead. When merged into a CP8 supergroup, each group internally still uses EP64; only KV (not expert weights) is transmitted across groups.

6.4 Huawei 950 Supernode: Why 1M Reaches P13:D1 or Even P16:D1

Same derivation logic, but 950DT's per-node compute is lower (8 PFLOPS vs B300's 36 PFLOPS).

P-unit goodput (128 NPUs):

F_prefill(1M) = 0.199 TFLOP/token

Per-node raw: 8 × 0.45 × 1000 / 0.199 = 18,091 token/s ≈ 18.1K token/s

P-unit (16 nodes = 128 NPUs) goodput: 0.7 × 18.1K × 16 = 202.5K token/s

N_P = ⌈2578K / 202.5K⌉ = ⌈12.7⌉ = 13

P13:D1 = 13 × 128P + 192D = 1664 + 192 = 1856 NPUs (rounded to 2048 NPUs with headroom).

But P13 doesn't lend itself to clean CP groupings—13 can't be evenly divided into reasonable CP × EP combinations. Production recommendation: P16:D1 = 2048P + 384D = 2432 NPUs, using EP128 × CP16 or EP256 × CP8.

This is why we repeatedly say: Huawei's 1M scenario should not use pure 950DT; use 950PR for P. The 950PR is purpose-built for prefill—if its prefill goodput is 2× that of 950DT (a reasonable expectation), P-tier NPU count could be cut in half.

7. 400 tok/s: Why This Speed Is Especially Difficult

Now let's return to the earlier table where 400 tok/s had only 7.5 active sessions.

7.1 The Expert Microbatch Disaster

V4-Pro activates 6 routed experts per token. With 384 routed experts, each expert has a 6/384 = 1/64 probability of being activated.

Assuming B concurrent tokens in a forward pass, a single expert's microbatch size is:

B_expert = B × 6 / 384 = B / 64

Plugging in active sessions at different speeds:

Speed	Active Sessions	Expert Microbatch	Problem
30 tok/s	100	100/64 ≈ 1.56	Barely enough for GEMM
100 tok/s	30	30/64 ≈ 0.47	Less than 1; poor utilization
400 tok/s	7.5	7.5/64 ≈ 0.12	Far less than 1

Expert microbatch = 0.12 means: in one forward pass, each expert receives only 0.12 tokens on average—not even 1. The GPU's GEMM units are completely starved.

This is the fundamental contradiction of MoE decode: you want faster per-session speed, which requires fewer concurrent streams to satisfy TPOT; but fewer concurrent streams mean smaller MoE expert batches and worse GEMM utilization.

7.2 Mitigation Strategies

400 tok/s isn't impossible, but requires a combination of techniques:

EP384 or EP512: Each GPU handles only 1 expert (or fewer), reducing all-to-all communication volume and routing hops
MTP / Speculative Decoding: Instead of predicting only 1 token per step, use a draft model to guess 2–4 tokens, multiplying the expert microbatch by 2–4×
Hot expert replication: Create multiple replicas of high-frequency experts to reduce per-expert load imbalance
D-unit dedicated speed tiers: 400 tok/s users get dedicated D-units, not mixed with 30 tok/s users
The critical point: V4-Pro measured TPOT p90

CloudMatrix-Infer's measured data on DeepSeek-R1 already shows: the stricter the TPOT SLO, the smaller the batch size, the lower the decode throughput. This is a fundamental trade-off, not something that can be resolved by tuning parameters.

Recommendation: 30 tok/s as standard SLA, 100 tok/s as premium SLA, 400 tok/s only as an experimental SLA—do not promise 400 tok/s to customers without V4-Pro实测 data.

8. EP/TP/PP/DP Trade-offs in Practice

8.1 EP Is the Backbone

The core of the entire parallelism strategy is EP (Expert Parallelism). The reason is straightforward: V4-Pro has 384 routed experts, naturally suited for distribution across GPUs.

EP's trade-off is clear:

Larger EP → fewer experts per GPU → faster per-step compute → lower TPOT
Larger EP → more all-to-all communication → higher communication latency

In decode scenarios, compute is fast (memory-bound), so communication latency accounts for a large share. EP isn't "bigger is better"—you need to find the TPOT sweet spot:

Stage	Recommended EP	Why
P / prefill	EP64/128	Large prefill batch, compute-bound, low communication share
D / 30 tok/s	EP128/192	Standard tier; 33ms TPOT headroom is generous
D / 100 tok/s	EP192/384	10ms TPOT, needs larger EP
D / 400 tok/s	EP384/512	2.5ms TPOT, must have ≤ 1 expert/GPU

8.2 Keep TP Small

TP (Tensor Parallelism) splits one expert's weights across multiple GPUs. For MoE models, TP introduces two problems:

Increased communication: every forward step requires all-reduce
KV duplication: if attention layers also use TP, KV cache is redundantly stored across multiple GPUs

So start with TP1. Only consider TP2 + DCP (Decode Context Parallelism) when HBM is insufficient or when 1M scenario KV cache is too large, to reduce KV duplication.

8.3 PP Defaults to 1

PP (Pipeline Parallelism) places different model layers on different GPUs. For inference, especially decode, PP1 is the default—because PP introduces pipeline bubbles that directly increase per-token latency.

Consider PP only in one scenario: a single execution group's HBM can't hold the entire model. But V4-Pro's MoE portion already distributes expert weights via EP, and the dense/shared portion is relatively small, so PP isn't usually needed.

8.4 DP Is Simply P/D Replica Count

In PD disaggregation, DP's most practical meaning is:

DP_P = N_P (number of P-units) DP_D = N_D (number of D-units)

When scaling traffic, prefer adding DP (adding units) over changing individual unit TP/EP/PP. Each unit's parallelism config is SLA-validated; changing it may break TPOT constraints.

8.5 CP/DCP: Critical for 1M Scenarios

CP (Context Parallelism, for prefill) and DCP (Decode Context Parallelism) aren't in the standard "TP/EP/PP/DP" quartet, but must be included in designs for 128K+ and especially 1M:

P-tier CP: splits a long prefill's attention computation across multiple GPU groups in parallel, reducing TTFT
D-tier DCP: shards KV cache across multiple GPUs, reducing per-GPU KV storage pressure

At 1M context, a single session's KV cache is ~5.3 GB. If a D-tier GPU serves 20 concurrent sessions, KV alone is 106 GB—over half of a B300 GPU's 262.5 GB HBM. DCP can distribute KV across multiple GPUs, but the cost is cross-GPU KV reads during every attention step, increasing communication.

9. B300 All-Scenario Configuration Quick Reference

Synthesizing all the above analysis, B300's all-scenario configuration:

P-unit / D-unit Definitions

Unit	GPU Count	Internal Parallelism
P-unit	64	TP1 / EP64 / PP1 / CP1; 128K+ can merge multiple P-units for CP
D30-unit	128	TP1 / EP128 / PP1
D100-unit	192	TP1 / EP192 / PP1
D400-unit	384	TP1 / EP384 / PP1 + MTP/speculative

All-Scenario P:D Configuration

Baseline: λ = 5.86, O = 512, h = 0.56

Context	Speed	P:D	P GPUs	D GPUs	Total GPUs
4K	30	P1:D1	64	128	192
4K	100	P1:D1	64	192	256
4K	400	P1:D1	64	384	448
32K	30	P1:D1	64	128	192
32K	100	P1:D1	64	192	256
32K	400	P1:D1	64	384	448
128K	30	P1:D1	64	128	192
128K	100	P1:D1	64	192	256
128K	400	P1:D1	64	384	448
256K	30	P1:D1	64	128	192
256K	100	P1:D1	64	192	256
256K	400	P1:D1	64	384	448
1M	30	P6:D1	384	128	512
1M	100	P6:D1	384	192	576
1M	400	P6:D1	384	384	768

Interpretation:

4K–256K P1:D1: Not because P/D throughput happens to be 1:1, but because the minimum deployment units already exceed the capacity floor. The 64-GPU P-unit and 128/192/384-GPU D-units all far exceed actual traffic requirements.
1M P6:D1: P-tier 1M prefill is the absolute bottleneck. TTFT grows linearly with context length; 1M prefill compute is 200× that of 4K. The 384-GPU P-supergroup uses CP8 to complete a 1M prefill within reasonable time.

10. Huawei 950 Supernode All-Scenario Configuration

10.1 Recommended Architecture

The Huawei solution should not be a flat 950DT deployment. The correct architecture is:

P tier: 950PR (optimized for prefill)
D tier: 950DT (optimized for decode)
C tier: Atlas 950 SuperPoD / UnifiedBus / Kunpeng / EMS / NVMe cache

The numbers below use pure 950DT for conservative estimation. Once 950PR's measured prefill goodput is available, P-tier counts are expected to decrease significantly.

10.2 All-Scenario P:D Configuration (Pure 950DT)

Baseline: λ = 5.86, O = 512, h = 0.56

Context	Speed	P:D	P NPUs	D NPUs	Total NPUs
4K–128K	30	P1:D1	128	192	320
4K–128K	100/400	P1:D1	128	384	512
256K	30	P3:D1	384	192	576
256K	100/400	P3:D1	384	384	768
1M	30	P13:D1	1664	192	1856→2048
1M	100/400	P13:D1	1664	384	2048→2432

Cleaner Supernode production slices:

Context	Production Recommendation
256K	P4:D1 = 512P + 384D = 896 NPUs (more stable TTFT)
1M	P16:D1 = 2048P + 384D = 2432 NPUs (EP128×CP16 or EP256×CP8)

10.3 Internal Parallelism Recommendations

Context	P Tier	D Tier 30 tok/s	D Tier 100/400 tok/s
4K	TP1 / EP128 / PP1 / CP1	TP1 / EP192 / PP1	TP1 / EP384 / PP1
128K	EP128 / CP1; use CP2 for strict TTFT	EP192	EP384
256K	EP128 × CP3; production recommends CP4	EP192	EP384
1M	EP128 × CP16 or EP256 × CP8	EP192 / DCP	EP384 / DCP + MTP

11. Cost: How Much Per Million Tokens

Finally, let's calculate cost. The derivation chain is straightforward:

$/MTok = Total GPUs × per-GPU-hour cost ÷ million tokens output per hour

Tokens output per hour = λ × O × 3600

$/MTok = N_gpu × $/GPU-hour / (λ × O × 3600 / 1,000,000)

Cost assumptions:

B300: $5/GPU-hour (planning estimate, not official pricing)
950DT: $2/NPU-hour (planning estimate)

11.1 Full Derivation: B300 128K / 100 tok/s

N_gpu = 256 (P1:D1 = 64P + 192D)

Output per hour = 5.86 × 512 × 3600 = 10,786,432 tokens ≈ 10.79M tokens

Cost per hour = 256 × $5 = $1,280

$/MTok = $1,280 / 10.79 = $119/MTok

11.2 Full Derivation: B300 1M / 100 tok/s

N_gpu = 576 (P6:D1 = 384P + 192D)

Output per hour = 10.79M tokens (same λ)

Cost per hour = 576 × $5 = $2,880

$/MTok = $2,880 / 10.79 = $267/MTok

11.3 B300 All-Scenario Costs

Context	30 tok/s	100 tok/s	400 tok/s
4K–256K	$89/MTok	$119/MTok	$207/MTok
1M	$237/MTok	$267/MTok	$356/MTok

11.4 950DT All-Scenario Costs

Context	30 tok/s	100 tok/s	400 tok/s
4K–128K	$59/MTok	$95/MTok	$95/MTok
256K	$107/MTok	$142/MTok	$142/MTok
1M	$344/MTok	$379/MTok	$379/MTok

11.5 Cost Interpretation

Several numbers worth noting:

4K–128K: 950DT has lower nominal cost ($59 vs $89), but this advantage rests entirely on the NPU-hour = $2 assumption. If actual NPU-hour pricing is higher, the advantage disappears.
1M: B300 is actually cheaper ($267 vs $379). The reason is that 950DT's 1M configuration balloons to 2048 NPUs on the P tier—lower per-node compute means more NPUs needed for P, and more NPUs mean higher cost.
For 950DT to match B300 at 1M, NPU-hour would need to drop to approximately $1.1 or below—half the current assumption.

12. C Tier: Cache and Bandwidth Between P and D

12.1 How Much P→D Bandwidth Is Needed

After prefill completes, the P tier must transfer KV cache to the D tier. Each session's KV size is K(L), and there are λ new sessions per second:

B_{P→D} = λ × K(L)

Plugging in baseline λ = 5.86:

Context	KV/seq	P→D Floor	If Caching Full SWA (8×)
4K	0.022 GB	0.13 GB/s	1.0 GB/s
32K	0.174 GB	1.02 GB/s	8.1 GB/s
128K	0.695 GB	4.07 GB/s	32.6 GB/s
256K	1.389 GB	8.14 GB/s	65.1 GB/s
1M	5.300 GB	31.05 GB/s	248.4 GB/s

31 GB/s is only the P→D KV transfer floor. Actual C-tier traffic also includes lookup, D→C write-back, eviction, replay, and other multiplexed flows. At 1M, actual C-tier bandwidth requirements may be in the 100–200 GB/s range.

This scale means:

4K–32K: standard Ethernet/RoCE suffices
128K–256K: RDMA or NVLink-domain needed
1M: RDMA + UB (UnifiedBus) + NVLink-domain KV transfer required

12.2 What the C Tier Must Do at 1M

At 1M context, cache strategy isn't an optimization—it's a survival requirement:

Prefix hashing: store only one copy of compressed KV for identical prefixes
Compressed CSA/HCA KV priority caching: don't default to full SWA caching
SWA periodic checkpointing: periodic snapshots of SWA, not real-time full persistence
Cache-aware routing: route requests preferentially to D-units with cache hits
SSD endurance management: on-disk KV cache generates massive SSD write volume; monitor write lifespan
Cache hit rate as a scheduling objective: don't distribute requests evenly; prioritize leveraging existing cache

13. P/D/C Scheduling: Don't Use Static Binding

13.1 Three Independent Resource Pools

In production, don't permanently bind P1 to D1. Use three independent resource pools:

P Pool  |  D Pool  |  C Cache Pool

Request flow:

Router performs prefix hashing
Checks C tier for compressed KV hit
Hit → reduced P load, only tail recompute needed
Miss → allocate a P-unit for full prefill
P completes → allocate D-unit based on D-tier load and cache locality
During D decode, continuously append KV; partial write-back to C tier is possible

13.2 What P-Tier Scheduling Should Consider

P-tier scheduling shouldn't look only at request count. The same request with a 4K prompt vs. a 1M prompt has 250× different prefill loads. Long-context prefill should be scheduled by token budget—"this P-unit has 500K tokens of prefill budget remaining," not "this P-unit can accept 5 more requests."

13.3 What D-Tier Scheduling Should Consider

D-tier scheduling must simultaneously consider:

Active streams (current number of active sessions)
KV length (how long each session's KV is)
Expert load (whether expert load is balanced)
Per-user speed tier (30/100/400 tok/s cannot be mixed in the same D-unit)
TPOT p90 (real-time monitoring; throttle if SLA is exceeded)

14. High-Utilization Scenario: How Many P to Saturate D

The configurations in §6–10 are SLA-safe—guaranteeing TPOT and TTFT, but with very low D-tier utilization (D-unit goodput far exceeds actual traffic). If the goal is to saturate D-units (maximizing throughput), P:D would be completely different.

P / D ≈ (L/O) × (1−h) × G_D / G_P

Intuitively: the longer the context (large L), the shorter the output (small O), and the lower the cache hit rate (small h), the larger P needs to be relative to D.

B300 (P-unit = 64 GPUs)

Context	Saturate D30-unit (128 GPUs)	Saturate D100-unit (192 GPUs)	Saturate D400-unit (384 GPUs)
4K	2 P-units	3 P-units	6 P-units
32K	15	22	44
128K	54	81	162
256K	100	149	298
1M	305	457	913

Conclusion: At 1M context, short output (O=512), and semi-cold prefixes (h=0.56), you should not pursue D-tier saturation—otherwise the P tier would balloon to unacceptable levels. The correct approach: use cache to increase h, offload long-output tasks, or use larger EP for D-units to improve D goodput.

15. Final Recommendations

B300

Scenario	Configuration	Total GPUs
4K–256K / 30 tok/s	P1:D1	192
4K–256K / 100 tok/s	P1:D1	256
4K–256K / 400 tok/s	P1:D1	448
1M / 30 tok/s	P6:D1	512
1M / 100 tok/s	P6:D1	576
1M / 400 tok/s	P6:D1	768

P tier: TP1 / EP64 / PP1, add CP8 for 1M D tier: EP192/384, 400 tok/s requires MTP

Huawei 950 Supernode

Preferred architecture: 950PR for P, 950DT for D, SuperPoD for C/cache

Scenario	Configuration (Pure 950DT Conservative Estimate)	Total NPUs
4K–128K / 30 tok/s	P1:D1	320
4K–128K / 100 tok/s	P1:D1	512
256K	P4:D1 (production recommendation)	896
1M	P16:D1 (production recommendation)	2432

16. Uncertainties and Validation Checkpoints

Finally, we must be clear about what is verified and what is speculative:

Category	Content
Verified	V4-Pro model structural parameters (1.6T/49B/384 experts/top-6), B300 hardware specifications, FLOPs calculation formulas
Calibrated but not V4-Pro measured	Hardware utilization η_P=0.45 / η_D=0.14, calibrated from DeepSeek V3/R1 H800 production systems
Speculative	B300's actual decode utilization on V4-Pro, 950DT's performance on V4-Pro, 950PR prefill goodput
Planning estimates	B300 $5/GPU-hour, 950DT $2/NPU-hour
Staleness risk	950DT planned for 2026 Q4 availability; specs and availability may change; B300 pricing not yet published

Next validation checkpoints:

V4-Pro measured decode goodput and TPOT p90 on B300
950DT measured prefill/decode efficiency on V4-Pro
950PR prefill goodput
Actual commercial pricing for B300 and 950DT
V4-Pro's SWA KV strategy and actual cache overhead

Until measured data is available, all configurations in this article should be treated as capacity planning starting points and frameworks, not final answers. But the framework is correct—define workload first, then calculate goodput, then derive P:D—once measured data arrives, only the efficiency parameters need to be replaced; the methodology remains unchanged.