CANN Open Source: Ascend's Strategic Pivot from Building Ecosystem to Entering Ecosystem

KADC 2026 Series Analysis · Part 2 · Software Ecosystem / Developer Strategy

How Deep Is CUDA's Moat, Really?

In 2026, NVIDIA still commands an outsized share of the AI chip market by valuation. But if you look at silicon performance alone, that dominance isn't unassailable—AMD's MI400, Google's TPU v6, and Huawei's Ascend 910D are all approaching or even surpassing the H200 on specific benchmarks.

The real barrier isn't the chip. It's CUDA.

Since its release in 2007, CUDA has accumulated nearly two decades of developer ecosystem. The code, papers, tutorials, and Stack Overflow answers produced by AI researchers worldwide are overwhelmingly built around CUDA. The low-level operators in PyTorch, TensorFlow, and JAX are deeply bound to CUDA's programming model. From a newcomer's first lesson to writing an efficient training script, every step in an AI engineer's learning path is CUDA-aware.

This isn't a technical preference. It's muscle memory.

Ascend isn't challenging a chip—it's challenging an entire set of development habits. That's an order of magnitude harder than catching up on hardware. Hardware gaps can be closed with engineering resources and time, but once developer habits are formed, switching costs scale exponentially.

Only with this understanding can you grasp the strategic significance of CANN's full open-source announcement at KADC 2026.

I. CANN's Three Strategic Pivots

CANN (Compute Architecture for Neural Networks) is Ascend's operator acceleration library and compute architecture—analogous to cuDNN + CUDA Toolkit in the NVIDIA ecosystem. Its evolution maps Huawei's evolving understanding of how to build a software ecosystem around a chip.

First Pivot (2020–2023): Build Your Own Ecosystem—Attempting to Replicate the CUDA Playbook

Ascend's earliest strategy was straightforward: use MindSpore + CANN to build a complete proprietary tech stack, following the path NVIDIA took—own framework, own operator library, own programming model, forming a closed loop.

The logic was sound on paper: if CUDA proved the value of an integrated ecosystem, why couldn't Ascend do it again?

Reality was less cooperative.

Developers faced an entirely new programming paradigm—Ascend C syntax, CANN's operator development workflow, MindSpore's graph mode—every step required relearning. Meanwhile, CUDA's ecosystem already covered virtually every common scenario. The only motivation to migrate was if Ascend offered a large enough performance advantage or low enough compute cost.

In practice through 2022–2023, that condition wasn't met. Community feedback centered on several pain points:

High migration cost: Moving a model that runs on CUDA to Ascend typically took weeks, involving operator adaptation, memory management differences, and immature debugging toolchains.
Documentation gaps: Official docs covered basic use cases, but edge cases left developers diving into source code or waiting for community responses.
Ecosystem flywheel wouldn't spin: Few users → few contributions → slow bug fixes → even fewer users—a classic cold-start problem.

By late 2023, the limitations were clear: competing against a mature ecosystem with an equivalent but incompatible system meant migration costs far outweighed performance gains.

Second Pivot (2024–2025): Layered Decoupling—From "Replace CUDA" to "Support PyTorch"

Starting in 2024, Huawei's strategy shifted noticeably. Two core changes:

First, embrace PyTorch at the framework layer. Rather than pushing MindSpore, Huawei provided PyTorch backend adaptation, making Ascend an optional acceleration device. Developers didn't need to change their training scripts—just swap the backend.

Second, CANN internally began layered decoupling. Operator libraries, communication libraries, and the runtime moved from monolithic releases to independent upgrades, reducing coupling.

The direction was right, but execution exposed a deeper tension: compatibility and performance couldn't both be maximized.

PyTorch's operator calls ultimately need to map to CANN's underlying implementation. On CUDA, this mapping has been optimized over years with virtually zero overhead. On Ascend, the adaptation layer introduced non-trivial performance penalties in some scenarios. The developer experience was "it runs, but not fast enough."

This phase was still essentially "come write code on Ascend"—Ascend remained the destination, and developers still needed to do adaptation work. The migration cost was just lower than in the first phase.

But why wasn't it zero?

Third Pivot (KADC 2026): Full Open Source + Mainstream Ecosystem Compatibility—Bring Ascend to the Developers

CANN's announcements at KADC 2026 mark a fundamental strategy shift:

50+ repositories, 800+ operators fully open-sourced
100% Triton interface compatibility, 600+ operator coverage
100% TileLang interface compatibility, 300+ operator coverage
2,300+ PyTorch API alignment
Runtime and operator compilation interfaces open at every level
Operator and communication libraries independently upgradeable

Taken together, the core message is singular: Ascend no longer asks developers to migrate to it; Ascend adapts to the toolchains developers already use.

This is fundamentally different from the first two pivots. The first was "replace CUDA with my ecosystem." The second was "keep using PyTorch, but come to Ascend." The third is "use whatever you want—Ascend supports it underneath."

This shift isn't the natural evolution of technical maturity. It's Huawei's response to three converging conditions.

II. Why Now: Three Preconditions for CANN's Open Source

Precondition 1: Operator coverage has reached a practical threshold.

800+ open-source operators, 600+ Triton operators, 300+ TileLang operators, 2,300+ PyTorch APIs—these numbers aren't arbitrary. They cover over 90% of the operator needs in current mainstream large model training and inference. CANN has technically crossed the "barely usable" threshold into "sufficient for most scenarios."

Open-sourcing a half-finished product would only expose shortcomings and accelerate trust collapse. That Huawei waited until operator coverage was sufficient signals confidence in current technical maturity—at least for mainstream use cases.

Precondition 2: Mature cross-platform programming frameworks provide a technical bridge.

Triton and TileLang aren't Ascend technologies—they're community-established cross-platform programming frameworks. Triton, led by OpenAI, has become the core backend for PyTorch 2.x's torch.compile. TileLang, developed by a Peking University team, has demonstrated cross-platform capability through DeepSeek V4's operator work.

Ascend doesn't need to build a new programming paradigm from scratch to compete with CUDA. It just needs to support paradigms that already have community validation. This dramatically reduces the engineering complexity of strategic execution.

Precondition 3: The time window is narrowing.

NVIDIA is tied down by GPU production bottlenecks and tightening U.S. export control policies, and global demand (especially in China) for non-NVIDIA compute has peaked in 2025–2026. Meanwhile, teams like DeepSeek have demonstrated the feasibility of domestic compute in specific scenarios, establishing initial market confidence.

Huawei needs to close the software ecosystem gap within this window. Once NVIDIA's capacity recovers or policies loosen, users will flow back quickly—because the CUDA ecosystem is still there.

All three conditions coinciding enabled CANN's full open source. This isn't a technology showcase. It's a calculated timing decision.

III. The Deeper Implications of Triton/TileLang Compatibility

Triton: Not Another CUDA—The Post-CUDA Paradigm

Triton's significance in the AI programming ecosystem is underestimated by many observers.

It's not just another GPU programming language. Triton's design philosophy is "enable people who don't understand GPU architecture to write efficient operators"—it provides Tile-level programming abstractions. Developers describe "what computation to perform on data blocks" without managing shared memory, warp scheduling, or register allocation.

This positioning precisely targets the core pain point of AI operator development: most AI researchers don't know CUDA programming, but they need custom operators to experiment with new model architectures. Triton lets them bypass the steep GPU programming learning curve.

More critically, Triton has become the core compilation backend for PyTorch 2.x's torch.compile. This means PyTorch users who never write Triton code directly are still using Triton indirectly. Its penetration is far greater than it appears.

Ascend achieving 100% Triton interface compatibility with 600+ operator coverage sends a direct message: operators written in Triton run on Ascend without modification. This isn't simple "adaptation"—it's direct integration into the largest AI programming paradigm outside CUDA.

But one data point demands honest assessment: Triton operators on Ascend run at 0.6–0.9x the performance of native Ascend C.

What does this gap mean? In research and experimentation, 0.6–0.9x is perfectly acceptable—code runs, results validate, costs stay manageable. But in large-scale commercial inference, a 30–40% performance penalty can translate to millions in additional costs.

Huawei's bet is that as Ascend's Triton backend optimization iterates, this gap will steadily narrow. Triton's abstraction layer itself is evolving and may eventually expose more hardware-specific optimization interfaces. But for now, this is a real shortcoming—no marketing language can smooth it over.

TileLang: The Shadow of DeepSeek V4

TileLang is a Tile-level programming framework developed by the Peking University team, similar to Triton but with a different design philosophy. It was widely used in DeepSeek V4's operator work—and that's no coincidence.

DeepSeek V4's training and inference require numerous custom operators that must run efficiently across multiple hardware platforms. In TileLang's Developer mode, operator implementations across different platforms differ by only small amounts of code—meaning the same operator logic can run at near-native performance on different hardware.

Ascend supports TileLang with 300+ operators and 100% interface compatibility. This fact needs to be read alongside DeepSeek V4:

Ascend isn't "adapting to" DeepSeek V4—it's leveraging a cross-platform programming framework already validated by DeepSeek to minimize adaptation costs. Operators DeepSeek writes in TileLang inherently have cross-platform capability. Ascend just needs to optimize the TileLang backend, and DeepSeek's entire operator ecosystem becomes portable.

This is smart ecosystem strategy: don't build your own bridge—use one someone else has already proven.

CANNBot: The Automation Frontier of Operator Development

CANNBot is the operator agent Huawei demonstrated—3 hours to generate a Vector operator, 1 day from generation to deployment, 5x+ efficiency improvement.

This isn't a simple code generation tool. Operator development is one of the most labor-intensive parts of chip software ecosystems: each new operator requires understanding hardware architecture, optimizing memory access patterns, and handling edge cases. In a traditional workflow, an experienced engineer typically needs 1–2 weeks to develop and ship a new operator.

CANNBot compresses that cycle to 1 day. If this number is reproducible in real-world scenarios, the impact is structural:

The rate at which the operator ecosystem fills in is no longer bounded by human labor, but by the maturity of AI-assisted tools. NVIDIA is doing something similar—CUDA-Q is an attempt at operator automation for quantum computing. But Huawei has placed AI-assisted operator development in the context of general-purpose AI chips, with broader scope and greater urgency.

The risk: can auto-generated operators meet production-grade requirements? A Vector operator generated in 3 hours can run a demo, but will it hold up in large-scale training? This requires more third-party validation.

IV. Developer Experience: From "It Runs" to "It Works Well"

Running Your First Demo in 2 Minutes

The significance of this metric is severely underestimated.

Getting a developer who has never touched Ascend to run a demo in 2 minutes requires what? Pre-configured community environments, instant compute resource allocation, zero-config sample projects, instant documentation and error message guidance—every link in the chain needs to be engineered to near-invisibility.

This is no small undertaking. It requires Ascend's community to transform from "tool provider" to "experience provider." NVIDIA spent years optimizing CUDA Toolkit's out-of-box experience. Ascend needs massive engineering investment to catch up in a short time.

Compute Resource Deployment: 1,000+ Cards Free, 10,000 Cards Total

1,000+ Ascend cards of free compute, 100 card-hours per person initial quota, 10,000 cards total deployment. This scale directly competes with Google's TPU Research Cloud and NVIDIA's Digits program.

For comparison:

Google TPU Research Cloud: Free TPU compute for academic research, but long application processes and extended queue times.
NVIDIA Digits: Developer preview compute, limited coverage.
Ascend's approach: 100 card-hour initial quota means register and start—no approval process, instant access.

The strategy's sophistication lies in its target: not compute-hungry large teams (100 card-hours is far from enough for large model training), but individual developers and small teams who want to "try it out." Lowering the barrier to first contact matters more than providing massive free compute.

vLLM / SGLang Native Integration: Not "Second-Class Citizens"

These two data points deserve specific attention:

vLLM: Ascend is the only indigenous-innovation hardware vendor with native integration into the main branch.
SGLang: Ascend is the only indigenous-innovation non-GPU hardware vendor with native integration into the main repository.

"Native integration" and "adaptation layer" are fundamentally different. An adaptation layer means Ascend is an external add-on, with code maintained in forks or plugins. Native integration means Ascend is a first-class citizen in vLLM/SGLang, on the same code path as CUDA, receiving the same CI/CD, test coverage, and version synchronization.

The impact on developer experience is decisive: teams using vLLM for inference deployment don't need to change any configuration when switching to Ascend hardware—not "it can also run," but "it's natively supported."

The 30% reduction in first-token latency for long sequences is a tangible performance highlight for Ascend in inference scenarios. In LLM deployment, first-token latency directly affects user experience, and a 30% improvement has immediate commercial value.

V. What Huawei Is Betting On: Strategic Wagers

Wager 1: The "Good Enough" Threshold for Operator Ecosystems

NVIDIA's CUDA ecosystem has tens of thousands of operators, covering nearly every parallel computing scenario from image processing to quantum simulation. Ascend doesn't need to match that scale.

The core operators for AI training and inference are concentrated in a limited number of categories: matrix multiplication, convolution, attention mechanisms, normalization, activation functions, communication primitives. Covering efficient implementations of these core operators covers 90%+ of real-world demand.

Huawei's bet: 800+ open-source operators + Triton/TileLang compatibility + community contributions can cross the "good enough" threshold. Once past that threshold, marginal returns from additional operators diminish sharply. The difference between CUDA having 5,000 operators and CANN having 800 may be negligible for mainstream large model training.

Wager 2: The Power of Open Source Community

Once CANN is open-sourced, community-contributed operators and optimizations may fill long-tail scenarios faster than Huawei's internal teams. This is open source's fundamental logic: users know best what they need.

But maintaining an open source community places new demands on Huawei's engineering capacity. Community expectations for response speed (issues addressed within 24–48 hours, PRs reviewed within a week) may exceed Huawei's traditional engineering cadence. Open source isn't "publish and forget"—it demands sustained community operations, transparent roadmaps, and respect and incentives for community contributions.

Huawei's open source experience comes primarily from MindSpore and openEuler, but CANN's developer community is more low-level, more specialized, and more demanding. This is a new test.

Wager 3: The Time Window

NVIDIA faces two constraints in 2024–2025: GPU production bottlenecks extending delivery timelines, and tightening U.S. export control policies. These factors have created a demand window for non-NVIDIA compute in global markets (particularly China and Southeast Asia).

But the window won't stay open forever. NVIDIA's capacity is expanding, and the release cadence of Blackwell Ultra and Rubin is accelerating. Once supply-demand balance is restored, user回流 costs are low—because the CUDA ecosystem is still there.

Huawei needs to accomplish two things within this window: first, elevate the software ecosystem from "usable" to "pleasant"; second, build enough user stickiness (through open source community, developer tools, compute cost advantages) that users don't leave en masse once the window closes.

VI. Risks: Questions That Can't Be Dodged

Risk 1: The 0.6–0.9x performance gap.

Triton operators on Ascend run at 0.6–0.9x the performance of native Ascend C. In research and experimentation, this is acceptable. In large-scale commercial inference, a 30–40% performance gap means 30–40% higher compute costs. For companies already running optimized pipelines in the NVIDIA ecosystem, this gap is the biggest barrier to migration.

Huawei needs to demonstrate that this gap is narrowing rapidly—not through slide decks, but through reproducible public benchmarks.

Risk 2: Community governance post-open-source.

Open-sourcing 50+ repositories simultaneously means Huawei needs to maintain large-scale community operations. Issue floods, PR backlogs, version compatibility management—each is an organizational engineering challenge. If the community experience is poor, open source backfires: developers try once, have a bad experience, leave, and tell their peers.

Risk 3: NVIDIA won't sit still.

CUDA iterates fast. Every NVIDIA GPU architecture generation brings new software features and optimizations. While Ascend chases CUDA, CUDA itself is rapidly evolving. This isn't a static target—it's a fast-moving one.

Potential NVIDIA counter-strategies include: accelerating CUDA support for new model architectures (getting new operators into CUDA faster), tightening control over Triton (Triton is OpenAI-led, and NVIDIA has deep cooperation with OpenAI), and reinforcing CUDA's moat in developer communities.

VII. Key Validation Checkpoints

To assess the actual impact of CANN's open source, watch for three independent signals:

1. Public benchmark results for DeepSeek V4 inference on Ascend.

DeepSeek is one of the most influential domestic large models. If DeepSeek V4 achieves near-NVIDIA inference performance on Ascend—and the process is transparent and reproducible—it would be the strongest possible ecosystem validation.

2. Third-party independent verification of Triton/TileLang compatibility.

Huawei's published compatibility data needs independent validation. Specifically: for a team with no prior Ascend experience, how many Triton/TileLang operators run successfully on Ascend? What's the performance? What's the debugging experience when issues arise? This information can only come from third-party usage reports.

3. Actual community contribution volume and issue response speed in the open source repos.

Community activity in the first 3–6 months after open-sourcing is the key metric: how many PRs came from non-Huawei developers? What's the average issue response time? How many issues were closed? These numbers don't reflect technical capability—they reflect community operations ability and Huawei's sincerity of investment.

Conclusion: From "Build Ecosystem" to "Enter Ecosystem"

CANN's three strategic pivots reflect Huawei's deepening understanding of the nature of competition in chip software ecosystems:

The first time, Huawei tried to replicate CUDA's success—building a closed-loop ecosystem. It didn't work, because CUDA's advantage isn't just technical—it's 20 years of accumulated developer habits that can't be displaced by an equivalent but incompatible system.

The second time, Huawei pivoted from confrontation to compatibility—embracing PyTorch and lowering migration costs. The direction was right, but Ascend was still the destination, asking developers to "migrate here."

The third time, "come to me" became "I'll come to you." Open-sourcing CANN, supporting Triton/TileLang/PyTorch, letting developers use tools they already know while Ascend runs underneath. Ascend is no longer a platform you "migrate to"—it's an underlying accelerator that permeates developers' existing workflows.

From "build ecosystem" to "enter ecosystem"—one word's difference, but a complete restructuring of strategic logic.

Whether it works depends on execution. Open source is just the beginning. Community operations, performance optimization, developer experience, ecosystem trust—every link requires sustained investment and transparent attitude. Ascend has taken the most critical step, but the road ahead is long.

CUDA's moat wasn't built in a day, and it won't be filled in a day. But CANN's full open source at least shows that Ascend has found the right direction: not building another castle outside the moat, but extending the bridge into the other city.

This article is a technical analysis based on publicly available information from KADC 2026. All cited data comes from Huawei's official releases; independent verification is pending. The judgments expressed represent the author's views and do not constitute investment advice.