← Thinking Thinking

DeepSeek V4 + Ascend: Full-Stack Validation of Domestic AI Inference

KADC 2026 Series Analysis · Part 4 · End-to-End Validation / Domestic AI Inference

2026-05-25Thinking15 min read

KADC 2026 Series Analysis · Part 4 · End-to-End Validation / Domestic AI Inference


Why DeepSeek V4 Is the Ultimate Test for Ascend

When people talk about domestic AI chips running large model inference, the most common claim is "support for mainstream models." But "mainstream" is too vague. DeepSeek V4 finally establishes a hard benchmark: if you can run V4 inference end to end, there is almost no model you can't run.

The reason is simple: DeepSeek V4 is the most extreme MoE inference workload in existence.

Start with the architecture. V4 has 1.6T total parameters, 49B activated parameters, 61 Transformer layers, and a hidden dimension of 7,168. It uses 384 routed experts plus 1 shared expert—385 FFN modules in total. Each token activates only topK=6 routed experts, an activation ratio of just 1.56%. Expert intermediate dimension is 3,072. There are 128 attention heads with a head dimension of 512. The context window spans 1M tokens.

This parameter profile places all-round pressure on the inference system:

Operator compatibility. MoE's Expert Dispatch and Expert Combine are non-standard operators with divergent implementation paths across hardware. Add V4's novel CSA (4× compression) and HCA (128× compression) attention mechanisms, and standard operator libraries may not provide coverage.

EP communication efficiency. 384 routed experts mean Expert Parallelism requires large-scale All-to-All communication. Each token must be dispatched to 6 experts and combined back from 6 experts. Individual packets are only 7–14KB, but frequency grows quadratically with the number of experts. This is extremely sensitive to interconnect bandwidth and communication latency.

KV Cache capacity and bandwidth. 1M-token context + 128 attention heads + 512 head dimension creates astronomical KV Cache storage requirements. V4 uses mixed-precision storage—BF16 for RoPE components, FP8 for the rest—to compress, with the indexer even dropping to FP4. This shows the V4 team themselves pushed KV Cache pressure to its limits during design.

Long-sequence attention computation. A sliding window of 128 mitigates global attention pressure, but sequence management across 1M tokens, prefix caching, and the memory access patterns of attention kernels remain engineering challenges.

In short: DeepSeek V4 is a ruler. Where Ascend falls on this ruler marks the true waterline of domestic inference compute.


I. Operator Layer: TileLang's Cross-Platform Validation

Adapting to any new hardware always hits the operator layer first.

The core question: can the hundreds of operators used in large model inference run on new hardware? Once running, is performance sufficient? And how high is the development cost?

Ascend's operator ecosystem has two main paths: the Triton frontend (600+ operators) and the TileLang frontend (300+). But the most significant data point from KADC comes from Professor Yang Zhi's team at Peking University, who presented TileLang validation results.

TileLang's Cross-Platform Mechanism

TileLang is an operator framework built on tile-level programming abstractions. Its core idea: decompose operators into tile-level operations—essentially block-by-block data operations—where tile sizes can be tuned to hardware characteristics. The underlying codegen generates micro-architecture-specific instructions for different hardware.

What does this mean? A single TileLang operator source can run on both NVIDIA GPUs and Ascend NPUs. Hardware differences are abstracted away at the codegen layer, so developers maintain one codebase.

Professor Yang's team concluded that TileLang demonstrates "high development efficiency and high performance" in DeepSeek V4 operator practice. In Developer mode, operators across platforms differ by only a small amount of code.

This conclusion matters, but it's important to understand precisely what it says—and what it doesn't.

What it says: writing the core operators required by DeepSeek V4 in TileLang incurs low adaptation cost on Ascend, with usable performance. This has been empirically validated—not by Huawei's own claims, but by an independent Peking University team report.

What it doesn't say: all operators used by V4 are already covered. 600+ Triton + 300+ TileLang covers "key operator examples for mainstream models," but "key operators" ≠ "all operators." Long-tail operators—particularly specialized normalization, quantization, or fine-grained operators in V4's unique CSA/HCA attention mechanisms—may still require individual verification.

Verified vs. Pending

For the most critical MoE inference operators—Expert Dispatch, Expert Combine, and Attention kernels—the evidence suggests they are verified. These are the scenarios TileLang explicitly demonstrated.

For Triton-frontend operators, the situation is more complex. Current data shows Triton operators on Ascend achieve 0.6–0.9× the performance of Ascend C (Ascend's native operator implementation). This range is wide—0.6× means nearly twice as slow, while 0.9× is close to native. Which specific V4 inference operators fall at 0.6× and which at 0.9×? No per-operator public benchmark data exists yet.


II. Framework Layer: PyTorch Ecosystem Alignment

The operator layer answers "can it run?" The framework layer answers "how expensive is migration?"

The headline number: 2,300+ APIs aligned with the PyTorch ecosystem.

What the Number Actually Means

2,300+ API alignment means most PyTorch model code can, in theory, migrate zero-shot to Ascend—swap import torch for import torch_npu, and most code runs without modification.

But "alignment" needs unpacking. API signature alignment (matching function names, parameters, and return values) and performance alignment (similar performance for the same API across hardware) are two different things. 2,300+ API alignment refers to the former, with no guarantee on the latter.

A concrete example: torch.nn.functional.scaled_dot_product_attention may have an identical signature on Ascend, but the internal implementation path differs—NVIDIA routes through FlashAttention kernels, while Ascend uses its own attention kernels. The two may differ in performance characteristics, memory footprint, and numerical precision.

More substantive metrics include:

40+ models with graph-mode compilation. Graph mode (torch.compile and similar) compiles PyTorch's dynamic graphs into static computation graphs for better performance. 40+ models successfully entering graph compilation indicates Ascend's graph compiler covers mainstream model architectures.

20+ models with out-of-the-box FSDP2 support. FSDP2 (Fully Sharded Data Parallel v2) is PyTorch's distributed training/inference solution. 20+ mainstream large models working out of the box suggests the framework adaptation for distributed inference is largely in place.

The verl Signal

verl is one of the most active open-source RL training frameworks, used for RLHF, GRPO, and other post-training processes. Ascend's deep collaboration with verl achieved fully async mode, doubling RL training efficiency.

The significance of this signal: it shows Ascend's adaptation extends beyond inference into post-training (RLHF/GRPO). RL training demands more from the framework than pure inference—it involves complex loops of generation, reward model scoring, and policy updates. Any broken link in this chain kills the whole pipeline.

8+ RL community collaborations, with 10,000+ lines of code merged. This is tangible open-source contribution volume, not lip-service "compatibility" pledges.


III. Inference Engine Layer: vLLM + SGLang + xLLM

The framework layer is infrastructure. The inference engine is what users directly interact with. Adaptation quality at this layer directly determines the inference experience.

Ascend simultaneously landed three inference engines: vLLM, SGLang, and xLLM. This isn't random selection—each engine covers different inference needs.

vLLM: Native Integration

vLLM is the de facto standard for general-purpose LLM inference today. Ascend is the only indigenous-innovation hardware vendor natively integrated into the vLLM main branch.

"Native integration" is fundamentally different from an "adaptation layer." An adaptation layer is an after-the-fact patch—the hardware vendor maintains a fork, and users need a specific branch or an extra patch package. Native integration means Ascend is a first-class citizen in vLLM; every main-branch update includes Ascend adaptation code.

Specific performance data: first-token latency reduced by 30% in long-sequence scenarios. This number needs context—30% reduction relative to what? If relative to Ascend's previous implementation, it shows significant optimization from native integration. If relative to NVIDIA GPUs, the number would be far more striking. Based on KADC's phrasing, the former is more likely.

SGLang: Native Integration

SGLang is one of the fastest open-source inference frameworks, particularly strong in structured generation scenarios. Ascend is the only indigenous-innovation non-GPU hardware vendor natively integrated into the SGLang main branch.

SGLang's performance edge comes from deep optimization of the inference process—radix attention, fine-grained continuous batching scheduling, and efficient KV Cache management. Ascend achieving native SGLang integration means these optimization patterns run correctly on Ascend's hardware characteristics.

xLLM: Full Modality

xLLM is a Chinese-team open-source inference engine positioned as "an operating system that connects underlying chips with large model applications." Its differentiator: native support for text, image, and video modalities, with deep adaptation to Ascend's supernode technology.

xLLM's significance: it wasn't built by bolting an Ascend backend onto a general-purpose inference framework. It was designed from the ground up with Ascend supernode interconnect characteristics in mind. Future plans include deep adaptation to the Ascend 950 supernode—a forward-looking bet.

What Does Triple-Engine Coverage Mean?

  • vLLM suits general-purpose LLM serving, with the largest user base
  • SGLang suits latency- and throughput-critical scenarios
  • xLLM suits multimodal and supernode scenarios

Ascend isn't picking sides—it's going for full coverage. This reduces users' switching costs: regardless of which inference engine you use, Ascend is an available platform.


IV. EP Communication Optimization: The Core Bottleneck in MoE Inference

The first three layers (operators, framework, inference engines) address "can it run" and "is it usable." EP communication optimization addresses "can it run fast"—and this is precisely the biggest performance bottleneck in MoE inference.

DeepSeek V4's EP Challenge

Expert Parallelism is the core distributed strategy for MoE model inference. 384 routed experts are distributed across multiple cards. Each token must be dispatched to the cards hosting its topK=6 experts, computed, then combined back.

This process requires All-to-All communication. The problems:

  1. Extremely small packets. Each token sends only 7–14KB of data to a single expert (depending on hidden dim and batch configuration). For packets this small, traditional TCP/IP network stacks are highly inefficient—protocol header overhead is proportionally large, and interrupt handling is frequent.

  2. Extremely high frequency. If a batch has B tokens, each token sends 6 packets and receives 6 packets. Across 384 cards, each step involves B × 6 × 2 communications (dispatch + combine).

  3. Latency-sensitive. MoE's dispatch and combine happen in the middle of the forward pass—unlike data parallelism, they can't overlap with computation. Communication latency directly adds to inference latency.

Ascend's Solution

Based on KADC disclosures, Ascend's EP communication design has several key elements:

EP communication completed within the Scale-Up domain. Traffic stays off traditional networks (RoCE/InfiniBand) and completes directly on the supernode's internal Scale-Up interconnect. This avoids protocol stack overhead and reduces latency.

Load & Store semantics for small packets. 7–14KB packets bypass DMA (DMA has startup overhead for small packets) and use Load & Store semantics—essentially CPU-like memory access, completed directly by hardware. This dramatically reduces small-packet communication latency.

DMA for large packets. When batch size grows and packet size exceeds a threshold, the system switches to DMA mode, leveraging hardware DMA engines for high-bandwidth transfer.

CCU-hardened collective communication. CCU (Collective Communication Unit) hardens collective communication operations (AllReduce, All-to-All, etc.) into silicon, reducing CPU involvement. This is especially important for MoE's high-frequency collective communication.

SSU + UB direct KV Cache connection. SSU (Shared Storage Unit) and UB (Unified Buffer) connect directly to KV Cache, with claimed bandwidth improvements of "an order of magnitude."

From CloudMatrix 384 to Ascend 950

Current validation data comes from CloudMatrix 384 (based on Ascend 910C):

  • A cluster of 384 × 910C
  • DeepSeek-R1 inference validation (note: R1 is based on V3 architecture, simpler than V4)
  • EP320 configuration (320 experts in parallel)
  • Decode performance: 1,943 tok/s/NPU

This data comes from an arXiv paper (2506.12708)—independently verifiable.

The Ascend 950 supernode claims EP8192 support (8,192 cards). If interconnect bandwidth doubles and the SSU KV Cache bandwidth improvement materializes, inference performance should theoretically far exceed CloudMatrix 384 levels.

But "theoretically" and "verified" are separated by actual delivery and testing.


V. Honest Assessment: Verified vs. Pending

Consolidating the analysis above into a single table:

Component Verification Status Evidence Source
TileLang cross-platform operator support ✅ Verified Public report, Prof. Yang Zhi team, Peking University
DeepSeek V4 inference runnable on Ascend ✅ Verified Multiple confirmations at KADC
vLLM native Ascend backend integration ✅ Verified Verifiable in vLLM open-source repository
SGLang native Ascend backend integration ✅ Verified Verifiable in SGLang open-source repository
xLLM deep Ascend adaptation ✅ Verified Verifiable in xLLM open-source repository
PyTorch 2,300+ API alignment ⚠️ Partially verified API signature alignment confirmed; performance characteristics lack public benchmarks
verl RL training 2× efficiency ⚠️ Partially verified Numbers from collaboration partner; no independent reproduction
Triton operator performance 0.6–0.9× vs. Ascend C ⚠️ Partially verified Wide range; no per-operator public data
EP320 (384-card) inference at 1,943 tok/s/NPU ✅ Verified arXiv:2506.12708 paper
EP8192 (8,192-card) inference performance ❌ Unverified No public data
SSU + UB direct KV Cache bandwidth improvement ❌ Unverified Only Huawei's own claims; no independent testing
MoE communication latency target of 1ms ❌ Unverified Target value only
Ascend 950 supernode actual inference performance ❌ Unverified Not yet delivered; no test data

This table should be pinned to the wall of every conference room where people debate "is domestic compute usable?"


VI. Judgment

Confirmed Conclusions

DeepSeek V4 inference runs on Ascend. This is not a demo-level validation—from the operator layer (TileLang) to the framework layer (PyTorch) to the inference engine layer (vLLM/SGLang/xLLM), full-stack adaptation is complete. Independent verification from multiple parties (the Peking University team, the vLLM community, the SGLang community) cross-confirms this.

This is a milestone. Not because it proves Ascend is "invincible," but because it establishes a minimum viability threshold: the most extreme MoE inference workload, and Ascend caught it.

Native open-source community support indicates adaptation quality meets community acceptance standards. vLLM and SGLang maintainers would not accept a poorly performing or bug-ridden backend into their main branches. Native integration is itself a quality signal.

Key Questions Still Pending

Actual inference performance on 8,192-card supernodes. CloudMatrix 384's 1,943 tok/s/NPU was achieved under EP320 configuration. EP8192 means 25× the expert parallelism, with communication complexity growing exponentially. Whether the supernode's internal interconnect topology can actually sustain EP8192-level All-to-All communication is a question the Ascend 950 must answer with real data.

Actual bandwidth of SSU + UB direct KV Cache. Huawei claims "an order of magnitude" bandwidth improvement, but no independent test data exists. If the claim is real, KV Cache access latency for 1M-token contexts would drop significantly—critical for long-sequence inference. But if "an order of magnitude" reflects the gap between theoretical peak and actual workload, the real improvement could fall far short.

Practical impact of the 0.6–0.9× Triton operator performance gap on V4 inference. If operators on V4's critical inference path happen to fall at the 0.6× end, overall inference performance could be significantly lower than GPU. If they fall at 0.9×, the gap is small. This requires per-operator benchmark data to answer.

Key Judgment

If the CloudMatrix 384's 1,943 tok/s/NPU figure is reliable (it appears to be, backed by a published paper), then 950 supernode inference performance should improve substantially—doubled interconnect bandwidth, SSU KV Cache, more cards in parallel. Theoretically, decode performance could double or more.

But "should" ≠ "verified."

Actual delivery and large-scale validation of the Ascend 950 is the next critical milestone. Until that milestone arrives, all discussion of EP8192 and supernode inference performance remains speculation.

Recommendations for Decision-Makers

For technical leaders and investors tracking domestic inference compute:

Ascend has moved from "can't use" to "can use." This is the most important conclusion. DeepSeek V4's full-stack inference validation proves it. If you're evaluating whether domestic compute can serve as an inference backup, the answer is: yes, it can be a backup.

But "can use" ≠ "can compete with GPU on cost and performance." That's the next question to validate. Cost isn't just hardware procurement—it includes software adaptation costs, operational costs, and the hidden costs of ecosystem maturity. Performance isn't just peak throughput—it includes long-tail latency stability, efficiency across different batch sizes, and scalability in multimodal scenarios.

Watch the Ascend 950 delivery timeline and first-user real-world benchmarks. That will be the inflection point from "can use" to "good to use."


This is Part 4 of the KADC 2026 series analysis. Parts 1–3 covered the Ascend supernode architecture, CloudMatrix 384 validation, and the CANN operator ecosystem, respectively. The series is ongoing.