Heart of the Super-Node: How the Lingqu Service Layer Weaves 8,192 Cards Together

Part 2 · Lingqu Software Deep Dive Series

The previous article analyzed the Lingqu kernel layer—how the UB OS Component enables Linux to understand the super-node. The kernel layer solves the problem of "the system can see the hardware," but it doesn't solve the problem of "how the hardware works together."

8,192 NPUs, each with its own HBM and firmware. Without a service layer, they are 8,192 isolated islands. The official name of the Lingqu service layer is UB Service Core (UBS Core), open-sourced in the openEuler community. Five core sub-components: Engine, Mem, Comm, IO, and Virt. These services operate on physical memory addresses, DMA descriptors, and NPU compute queues—not HTTP requests and JSON.

This article dissects each of the five sub-components in turn, then analyzes the full application pipeline from training to inference, and finally compares against NVIDIA's counterpart solutions.

1. UBS Engine: The Distributed Brain of the Super-Node

1.1 What Engine Manages

Resource allocation, fault isolation, and dynamic scheduling across 8,192 cards require a control plane. UBS Engine is that control plane—the "distributed OS kernel" for the entire super-node, managing the global view and dynamic scheduling of hardware resources.

1.2 Architecture

Distributed self-election, no dependency on external coordination services. No ZooKeeper, no etcd. Engine handles leader election on its own. This means it doesn't introduce additional operational components, but it also means the correctness and performance of the election algorithm must be entirely guaranteed by Huawei.

N-1 high availability. N management nodes, tolerating up to N-1 simultaneous failures. This is an extremely aggressive fault-tolerance level—it means as long as one management node survives, the entire super-node's control plane stays operational. The cost is the resource overhead of management nodes and the complexity of the consensus protocol.

Resource pool management. Memory, NPU compute, DPU accelerators, SSU storage units—all hardware resources are abstracted into schedulable pools within Engine. Upper-layer services (Mem, Comm, IO) request resources from Engine, which allocates them based on topology affinity and load.

Software-defined compute. On-demand composition and allocation of resources—you can combine memory from node A + NPU from node B + SSU from node C into a logical compute unit that appears to upper-layer applications as a single execution environment.

1.3 Division of Labor with UBM

The kernel-layer UBM (UB Manager) runs as firmware on a dedicated chip, managing hardware topology—physical-layer link discovery, routing configuration, and fault isolation.

UBS Engine manages the logical layer—on top of the physical topology discovered by UBM, it performs logical scheduling and allocation of resources.

The benefit of this two-layer separation: when physical topology changes (e.g., removing a UB Switch), UBM automatically reconfigures routing, and Engine only needs to update its resource view. Engine doesn't need to understand the physical details of the UB protocol.

1.4 Open Questions

Several key questions remain unanswered:

How long does leader election take? In an 8,192-card cluster, what is the consensus protocol latency between management nodes? If the current leader fails, how long does it take for a backup to take over?
Resource view update frequency. NPU utilization, memory occupancy, link bandwidth—how often does Engine collect these? Is the granularity per-node or per-device?
Cross-super-node scheduling. One Engine manages one 8,192-card super-node. If two super-nodes need to coordinate (e.g., cross-Pod AllReduce), who orchestrates that?

The answers to these questions directly impact Lingqu's reliability in production environments.

2. UBS Mem: Turning 128 Cabinets of Memory into One Pool

2.1 512TB Spread Across 128 Cabinets

8,192 NPUs, each with 64GB HBM, totaling approximately 512TB. Physically distributed across 128 cabinets, each with its own CPU and DDR. During training, model parallelism requires cross-cabinet parameter sharing; during inference, KV Cache requires cross-instance context sharing. RDMA can achieve this but with microsecond-level latency (2-5μs), and requires the application to explicitly manage memory registration, QP connections, and data movement.

Lingqu's approach: make remote memory look like local memory.

2.2 Three Layers of Service

SHMEM (Shared Memory). Multiple processes/nodes access the same physical memory block. The programming interface is similar to POSIX shared memory, but underlying data movement is handled by Lingqu hardware without OS involvement. Typical use cases: model parameter servers, embedding table sharing.

MemCache (Global Memory Cache). Similar to a distributed LRU cache, but the caching medium is remote memory rather than disk. When local HBM is insufficient, cold data is automatically migrated to idle HBM on remote nodes. Transparent to the application—the allocator thinks it's still getting data from local HBM, when it may actually come from an NPU three cabinets away.

MemStore (Persistent Memory Store). Similar to a distributed KV store, but the backend is Lingqu-interconnected memory rather than SSD. Use cases: high-frequency metadata, small indexes, inference request state information.

2.3 MemFabric: The Memory Weaving Layer

The foundation underlying all three services is MemFabric—a middleware layer that weaves physically distributed memory resources into a unified address space.

What MemFabric does:

Global address allocation. Assigns a globally unique address to each physical memory block (based on the UB virtual address format {NodeID, UASID, VA})
Access routing. When an application accesses a global address, MemFabric knows which physical node that address resides on and completes data movement via Lingqu hardware
Consistency management. Data ownership management in shared memory scenarios—at any given moment, only one owner can write to a shared memory block, but multiple nodes can read simultaneously

MemFabric is not a concept unique to Lingqu. CXL Type 3 devices are doing something similar (the CXL.mem protocol), but CXL's coverage is intra-cabinet (limited by CXL switch cascading), while Lingqu's coverage spans the entire super-node (128 cabinets).

2.4 Key Performance Data

Confirmed performance data:

Operation	Latency	Comparison
Cross-node memory access (Lingqu)	300-400ns	~2x slower than cross-NUMA, 5-10x faster than RDMA
Memory borrowing overhead	<5% (at 25% borrowing ratio)	Essentially transparent to applications
Shared memory mapping time	2-5s	First mapping only; subsequent access uses hardware path

300-400ns cross-node memory access is the core performance selling point of the Lingqu service layer. This means a process running in cabinet A can directly read data from an NPU's HBM in cabinet Z with a single load instruction, with latency only ~2x that of a cross-NUMA access.

But note: this is hardware-level latency and does not include MemFabric's software overhead. In real applications, you also need to add address translation, permission checking, and consistency protocol overhead. Effective latency under real workloads needs to be validated with production data.

2.5 Comparison with CXL Memory Pooling

Dimension	Lingqu MemFabric	CXL Type 3 Pooling
Coverage	Entire super-node (128 cabinets)	Intra-cabinet (CXL switch cascading)
Latency	300-400ns	100-200ns (intra-cabinet)
Programming model	Unified address space + NUMA extension	CXL.mem protocol + upper-layer software management
Ecosystem openness	Huawei hardware only	Broad support across chip/server vendors
Maturity	v26.03 just integrated into K8s	PCIe 6.0 CXL 3.1 spec published

Lingqu has wider coverage but higher latency; CXL has lower latency but narrower coverage. They are not directly competing—Lingqu solves "cross-cabinet memory unification," while CXL solves "intra-cabinet memory pooling." Theoretically they can coexist: CXL within cabinets, Lingqu between cabinets. But no practical implementation of such a hybrid approach has been seen yet.

3. UBS Comm: Replacing Ethernet and InfiniBand with the Lingqu Bus

3.1 HCCL: Huawei's Collective Communication Library

The counterpart to NVIDIA's NCCL. AllReduce, AllGather, Broadcast, ReduceScatter—these collective communication operations are the core communication patterns in large model training. HCCL's performance directly determines training MFU (Model FLOPs Utilization).

HCCL's communication path on the Lingqu super-node:

PyTorch/MindSpore → torch_npu/CANN → HCCL → URMA → Lingqu hardware

Compared to NVIDIA's path:

PyTorch → NCCL → cuMLIB → NVLink (intra-node) / IB (inter-node) → Ethernet/IP

The advantage of the Lingqu path: unified protocol stack. Both intra-node and inter-node communication goes through URMA, with no need to switch protocols between NVLink and IB as NVIDIA does. HCCL doesn't need to care whether the communication peer is in the same cabinet or three cabinets away.

But this advantage rests on the topology-agnostic nature of the Lingqu interconnect—uniform enough latency and bandwidth between any two UBPUs. If there are obvious hotspots (e.g., inter-cabinet links becoming bandwidth bottlenecks), HCCL's ring/tree algorithms would need topology-aware optimization, and the complexity would be no lower than NCCL's.

3.2 URPC and Socket over UB

URPC: High-performance RPC based on URMA. The programming model is similar to gRPC, but the underlying transport uses Lingqu shared memory instead of TCP. Suitable for low-latency cross-node calls in microservice architectures.

Socket over UB: Transparent acceleration for traditional TCP applications. Application-layer code remains unchanged; socket calls are redirected to the Lingqu bus. Latency drops from TCP's hundreds-of-microseconds range to 2-3μs.

Verbs over UB (RoUB): Traditional RDMA applications (MPI, NCCL, etc.) can run on the Lingqu bus without code modifications. This is a compatibility layer that enables existing RDMA ecosystem applications to migrate to Lingqu hardware with zero changes.

The significance of the compatibility layer lies in lowering migration barriers. But the performance ceiling of the compatibility layer depends on how efficiently Lingqu hardware simulates RDMA semantics. If RoUB merely "works" but underperforms compared to native URMA, users will eventually need to go through the SDK adaptation path.

3.3 Published Communication Performance Data

Published communication performance data:

UB container network latency: 1.7-2.5μs (90% lower than TCP)
Verbs over UB latency: close to native RDMA levels

Missing key data:

HCCL AllReduce actual bandwidth utilization at 8,192-card scale
Communication performance differences across topologies (ring vs tree vs adaptive)
Communication stability during long training runs (jitter, slowdowns)

4. UBS IO: NPUs Bypassing CPU to Access Storage Directly

4.1 NPU Direct: A Killer Capability for Inference

NPU Direct SSD/SSU is the most valuable capability of the Lingqu IO layer. NPUs bypass the CPU and system DRAM to directly read from and write to SSDs or Storage Service Units (SSUs).

During the decode phase of large model inference, the HBM occupied by KV Cache grows linearly with sequence length. A 70B model with 128K context and batch size 32 can have KV Cache consuming 40-60GB HBM—nearly the entire HBM of a single 910B card. If KV Cache can be flushed directly from HBM to SSD, the inference memory bottleneck is opened up.

Traditional path: NPU HBM → PCIe → CPU DDR → NVMe SSD. Two memory copies, two PCIe transfers, plus CPU DMA management overhead.

Lingqu path: NPU HBM → UB → SSU/SSD. One Lingqu transfer, bypassing the CPU entirely.

This capability is not unique to Lingqu. NVIDIA's GPUDirect Storage does the same thing. But GPUDirect Storage uses the NVMe-oF protocol (based on RDMA or TCP), requiring NVMe target configuration, network topology setup, and permissions management. Lingqu's NPU Direct uses the UB protocol, which requires no additional network configuration within the super-node—because the Lingqu bus already connects NPUs and SSUs together.

4.2 IO Cache and UBS OVS

IO Cache: Global read-write cache. Frequently accessed data blocks (e.g., model weight files, tokenizer vocabularies) can be cached in Lingqu-interconnected remote memory, reducing SSD reads.

UBS OVS: Virtual network switching. Similar to Open vSwitch, but the data plane runs on the Lingqu bus. Used for network isolation of containers and virtual machines.

These two components are infrastructure completeness features and do not constitute core differentiation.

5. UBS Virt: The Cross-Node "Super VM"

5.1 Combining Multiple Physical Machines into One VM

Traditional virtualization divides one physical machine into multiple VMs. Lingqu's virtualization does the opposite—combining multiple physical machines into one VM.

The "Super VM" leverages Lingqu's unified memory addressing to let a single VM span multiple physical nodes. Processes within the VM see a memory address space that crosses DDR and HBM across multiple cabinets.

5.2 Technical Approach

UB device passthrough to VM (similar to SR-IOV / vfio): VMs directly access Lingqu devices
Cross-node device passthrough: VMs can use NPUs/SSUs on other physical nodes
VirtAgent: Virtualization management agent responsible for cross-node device coordination
Super Container / Super Process: Similar to Super VM but at a finer granularity

5.3 Boundary: A Selling Point for General Compute, a Supporting Role for AI

The core scenario for Super VM is general computing—traditional databases and data warehouses that need extremely large memory spaces. GaussDB has already adapted to the TaiShan 950 super-node's pooled memory, using cross-node unified addressing to aggregate distributed DDR into a single address space, avoiding database sharding.

In AI scenarios, containers (K8s + openFuyao) are the mainstream deployment approach. The cross-node memory provided by Super VM offers limited benefit for inference—the bottleneck in inference is NPU HBM capacity and KV Cache management, not CPU-side memory size. In training scenarios, Super VM's isolation granularity is also too coarse. UBS Virt is a differentiated feature for general computing within the Lingqu software stack, not a core component for AI.

6. Application Integration: The Full Pipeline from Training to Inference

The value of the Lingqu service layer is ultimately reflected in whether applications can run, and run well. From public materials, two main application pipelines can be reconstructed.

6.1 Training Pipeline

PyTorch/MindSpore
    ↓
torch_npu / MindSpore CANN plugin
    ↓
CANN (Huawei Compute Acceleration Library)
    ↓
HCCL (collective communication)
    ↓
UBS Comm → URMA → Lingqu hardware

Verified: Huawei's collaboration with DeepSeek proved the feasibility of MoE architecture training on the Ascend platform. Specifically verified:

Communication efficiency of MoE's dynamic expert routing on the Lingqu super-node
Operator adaptation of DeepSeek's Multi-head Latent Attention on Ascend NPUs
End-to-end training MFU levels

Still missing:

Adaptation progress of third-party training frameworks (Megatron-LM, DeepSpeed)
Training efficiency of non-MoE architectures (dense models) on Lingqu
A complete case study of training a model from scratch (including data processing, checkpoint management, resumable training)

6.2 Inference Pipeline

Inference framework (vLLM-Ascend / MindIE)
    ↓
CANN
    ↓
UBS IO (NPU Direct SSD/SSU) + UBS Mem (pooled memory)
    ↓
Lingqu hardware

Inference scenarios depend more heavily on the Lingqu service layer:

KV Cache offloading. During the decode phase, KV Cache is flushed directly from NPU HBM to SSD/SSU, bypassing the CPU. This is a critical optimization for long-context inference.

Memory pooling for inference decode. Leveraging UBS Mem's shared memory, multiple inference instances can share the same model weights. Each instance only needs to store its own KV Cache and activation values.

Parallel inference for Speculative Decoding. Leveraging Lingqu's low-latency interconnect, multiple small models can perform speculation in parallel while the main model verifies.

6.3 Adaptation Status and Gaps

Already working:

PyTorch + torch_npu + CANN: training is basically functional
MindSpore + CANN: Huawei's in-house framework, deepest adaptation
GaussDB: already adapted to super-node pooled memory
openFuyao + K8s: container orchestration verified

In progress:

DeepSeek series: Huawei's priority adaptation target
vLLM-Ascend: inference engine adaptation underway (vllm-ascend 0.11~0.14)
Mooncake KVCache: distributed cache V3 architecture has partially merged capabilities

Notable gaps:

TensorFlow / JAX and other non-PyTorch frameworks
Non-openEuler operating systems (Ubuntu/Debian/RHEL)
Running Lingqu service layer on non-Huawei hardware
Third-party ISV application adaptation

7. Comparison with NVIDIA's Service Layer

Dimension	Lingqu UBS Core	NVIDIA (NCCL + GPUDirect + MIG)
Control plane	UBS Engine (distributed self-election)	No unified control plane; relies on upper-layer scheduling
Memory pooling	MemFabric (across 128 cabinets)	GPUDirect (intra-node NVLink, inter-node CX)
Communication library	HCCL + URMA	NCCL + NVLink/IB
Storage direct access	NPU Direct SSD/SSU	GPUDirect Storage (NVMe-oF)
Virtualization	UBS Virt (Super VM)	MIG (GPU partitioning)
Programming model	Unified (URMA full stack)	Stitched (NVLink + PCIe + IB + GPUDirect)
Ecosystem	Huawei ecosystem	Entire industry

Lingqu's advantage lies in "unification"—a single protocol stack spanning intra-node and inter-node, so applications don't need to care about the physical location of communication peers. NVIDIA's advantage lies in "ecosystem"—every segment has mature standards and open-source implementations, with adaptation costs far lower than Lingqu's.

The core judgment of this comparison is the same as for the kernel layer: the cost of unification is closedness; the cost of stitching is complexity. Lingqu has made correct technical choices at the software level (unified programming model, unified memory space), but the commercial value of these choices depends on ecosystem scale. If only Huawei's own hardware uses Lingqu, the advantage of unification is offset by the disadvantage of closedness.

8. Open-Source Substitution Assessment for the Service Layer

Drawing on the open-source substitution analysis materials from this series, here is the substitutability assessment for each service layer component:

Component	Substitution available	Key bottleneck
UBS Engine	No direct substitute (etcd/consul can cover some functions)	Deep binding with Lingqu hardware
MemFabric	Approximate (CXL Type 3, Lambda's pooled memory solution)	Coverage far smaller than Lingqu's
HCCL	Yes (NCCL, RCCL, oneCCL)	Backend needs adaptation to Lingqu URMA
NPU Direct	Comparable (GPUDirect Storage)	GPUDirect uses NVMe-oF, doesn't depend on Lingqu
UBS Virt	Approximate (Kata Containers, cross-node CXL)	Super VM has a narrow use case

The most irreplaceable component is MemFabric. Unified memory addressing across 128 cabinets at scale has no equivalent solution on the market today. CXL's coverage is insufficient, RDMA's latency isn't low enough—no other technology can deliver cross-cabinet memory pooling at 300-400ns latency.

The most easily substituted are HCCL and UBS IO. NCCL is already the de facto standard, and GPUDirect Storage is in widespread use. If third-party chip vendors want to bypass Lingqu, these two components have mature alternative solutions.

9. Service Layer Functional Specification Gaps

Part 1 derived 32 FRs from the kernel layer perspective. The service layer sits above the kernel layer, solving the problem of "how hardware works together." Based on the analysis in the preceding eight sections of this article, the service layer has its own independent functional specification gaps.

9.1 Control Plane

ID	Functional Specification	Derivation Source	Status
SL-1	UBS Engine leader election time is measurable; backup takeover after leader failure < SLA	§1.4 Leader election time not publicly disclosed	❓
SL-2	Resource collection granularity configurable: node-level vs device-level, collection frequency adjustable	§1.4 Collection granularity not publicly disclosed	❓
SL-3	Cross-super-node scheduling: coordination mechanism for multiple 8,192-card super-nodes	§1.4 Who coordinates cross-Pod communication	❓
SL-4	Engine global view consistency latency: time from resource state change to cluster-wide visibility	§1.2 Distributed self-election + N-1 HA consistency cost	❓

9.2 Memory Services

ID	Functional Specification	Derivation Source	Status
SL-5	MemFabric software overhead is observable: how much is added on top of the pure hardware 300-400ns	§2.4 Hardware-level latency excludes software overhead	❓
SL-6	Shared memory owner switch latency's software protocol overhead is quantifiable	§2.3 Consistency management—set_ownership is a software protocol	❓
SL-7	MemCache eviction policy is configurable (LRU/LFU/custom), eviction trigger conditions are transparent	§2.2 "Similar to distributed LRU cache" but policy not disclosed	❓
SL-8	MemStore persistence semantics: is data recoverable after power loss	§2.2 "Persistent memory store" but backend is memory, not SSD	❓

9.3 Communication Services

ID	Functional Specification	Derivation Source	Status
SL-9	HCCL AllReduce actual bandwidth utilization at 8,192-card scale	§3.3 Key data missing	❓
SL-10	HCCL topology awareness: selection logic and switchover conditions for ring/tree/adaptive strategies	§3.1 Topology-agnostic assumption needs verification	❓
SL-11	Communication stability during long training runs: conditions causing jitter and slowdowns	§3.3 No public data	❓
SL-12	RoUB compatibility layer performance overhead: latency penalty of mapping verbs API to URMA	§3.2 Compatibility layer performance ceiling	❓

9.4 IO Services

ID	Functional Specification	Derivation Source	Status
SL-13	NPU Direct transfer bandwidth and latency: actual HBM→SSU throughput	§4.1 Path description only, no data	❓
SL-14	NPU Direct concurrency: how many NPUs can simultaneously connect to the same SSU	§4.1 Multi-instance shared storage scenario	❓
SL-15	IO Cache hit rate is observable, cache capacity and eviction policy are transparent	§4.2 "Global read-write cache" but no details	❓

9.5 Service Layer Specification Summary

Status	Count	Percentage
✅ Implemented	0	0%
❓ Not publicly disclosed	15	100%
❌ Missing	0	0%

All 15 items are undisclosed. This pattern is consistent with Part 1's kernel layer—clear architectural blueprints, blank engineering details.

But the service layer's "undisclosed" status is more serious than the kernel layer's. The kernel layer's 11 "implemented" items at least prove core capabilities are running. For the service layer's five sub-components, the only public data points are latency benchmarks (300-400ns, 1.7-2.5μs) and the borrowing ratio (25%). Leader election time for the control plane, software overhead for memory services, bandwidth utilization for communication, and transfer throughput for IO—four sets of data essential for production deployment, none available.

Part 3 will consolidate the functional specifications (SR-1 through SR-41) from the cloudification layer perspective, merging the kernel layer FRs and service layer SLs into a full-stack view.

The next article will analyze the Lingqu cloudification layer—how openFuyao packages Lingqu capabilities for K8s users, the commercial value of InferNex inference cluster orchestration, and an overall strategic assessment of the Lingqu software stack.

Sources: 灵衢系统软件架构&部署公开课 PPT（牛涛）、openEuler UB Service Core 白皮书 2.0、APNet'21 "Huawei UB: Towards Compute-Native Networking" 技术报告（Bojie Li）、华为全联接大会 2025 徐直军主题演讲、comentropy 超节点产业链分析、KADC 2026 openFuyao 分论坛、openFuyao v26.03 Release Notes、ubs-core GitHub 仓库（atomgit.com）