Part 1 · Lingqu Software Deep Dive Series
The kernel software stack for the Lingqu (UnifiedBus) super-node must answer a hard question: Are 8,192 cards, 128 cabinets, and optical interconnects spanning over a hundred meters "one machine" or "128 machines" in the eyes of the Linux kernel?
If the super-node is one machine, Linux needs a brand-new bus type, a driver for cross-node address translation hardware, and memory management that goes beyond NUMA—none of which existing kernel subsystems can provide. If it's 128 machines, you just need a faster network and standard RDMA for the software.
Lingqu chose the former. It must implant thesis-level changes into the Linux kernel: the ub_bus_type bus, the UMMU address translation driver, cross-node memory borrowing/sharing, and URMA communication primitives. The prerequisite for doing this is a fully self-developed interconnect hardware stack—physical layer (112G/224G SerDes), protocol stack (10-part specification), and chips (UB Switch, UMMU, UDMA) are all self-developed. In the 2026 data center interconnect landscape, only two companies cover the full stack from physical layer to software: NVIDIA (NVLink+NVSwitch+IB) and Huawei (Lingqu UB).
NVIDIA takes a "multi-layer protocol splicing" approach—NVLink handles intra-node, PCIe handles devices, IB handles inter-node, and GPUDirect handles storage. Four sets of drivers piece together the complete path, with CUDA unifying the upper-layer API. Lingqu takes a "single-protocol convergence" approach—all interconnects use the UB protocol, and one set of drivers covers the entire path. The cost of splicing is complexity; the cost of convergence is lock-in.
This article focuses on the kernel layer: deployment design → UB OS Component module-by-module anatomy → NVIDIA comparison → capability boundary analysis. The next two articles will analyze the service layer and the cloudification layer.
1. Deployment Design: Two Paths, One ISO
1.1 General-Purpose Compute Super-Node (Kunpeng 950 SuperPoD)
Hardware installation. UB Switch and optical module physical connections.
OS installation. Standard openEuler ISO. Design decision: general-purpose servers and super-nodes share the same ISO. For general-purpose servers, installation is complete. For super-nodes, the UB OS Component is patched on top—UBMM driver, 7 types of UBPU drivers (NPU/DPU/SSU/UDMA/UMMU/UNIC), userspace toolchain ubctl, Lingqu network configuration (topology discovery, routing table initialization), and UBM boot.
On-demand service deployment. UBS Engine (control plane core), and on-demand UBS Mem/Comm/IO/Virt.
The engineering intent of sharing the ISO is clear: general-purpose servers and super-nodes can coexist in the same data center, and operations doesn't need to maintain two OS images. But the "extra patch" conceals the actual scale—UB OS Component is a systematic extension of the kernel bus model, memory management subsystem, and communication stack, not just installing a few .ko files. During a major kernel version upgrade (openEuler 24.03 → next LTS), the regression testing scope for UB OS Component adaptation is far larger than an ordinary driver update. Sharing the ISO reduces initial deployment risk, but does not reduce long-term maintenance complexity.
1.2 AI Compute Super-Node (Atlas 950 SuperPoD)
Higher factory pre-installation: UBM and PodManager are built into standalone chips at the factory; Atlas 850/850E have Lingqu bus devices pre-installed. Users don't install UB Switches and optical modules themselves.
OS is similarly openEuler ISO, with additional deployment of Lingqu data plane and development plane components + UBS Engine.
DeviceOS firmware-OS separation. NPU firmware (DeviceOS) includes its own Lingqu drivers; HostOS only installs userspace services and development toolchains. NPU firmware upgrades are independent of HostOS upgrades.
The cost is a three-layer version matrix:
| Layer | Content | Upgrade Frequency | Transparency |
|---|---|---|---|
| DeviceOS firmware | UB OS Component inside NPU | Follows NPU driver version | Least transparent |
| UBM firmware | Topology manager on standalone chip | Follows hardware firmware cycle | Upgrade/rollback not disclosed |
| HostOS software | Userspace services + development toolchain | Follows openEuler version | Most transparent |
Can the three layers be upgraded independently? Who tests cross-version compatibility? What happens when a DeviceOS upgrade introduces a new UB protocol feature that the UBM firmware doesn't support? What if the HostOS UBS Engine version doesn't match? The cost of Lingqu turning 128 cabinets into "one machine" is that this "machine" has three independently evolving firmware/software layers, and the complexity of the version compatibility matrix is O(n³).
1.3 Hidden Costs of Deployment
- Topology planning. 160 cabinets, full optical interconnect between cabinets. Topology structure directly affects communication performance. The communication pattern of 8,192-card AllReduce determines the optimal topology—fat trees favor all-to-all, mesh favors neighbor communication. Which topologies does the Lingqu UB Switch support? Is routing programmable? No details available.
- UBM convergence scale. Topology discovery and routing convergence speed for 8,192 UBPUs—zero public data.
- Leader election latency. UBS Engine provides N-1 fault tolerance. How long does leader election take? This directly impacts operations SLA.
- Fault isolation granularity. When one UBPU fails, does UBM isolate the single device or the entire cabinet? Single-device isolation → how many routing table entries are affected by the update? Cabinet-level isolation → how much compute capacity is lost?
2. UB OS Component: Implanting Super-Nodes into the Linux Kernel
| Module | Kernel Subsystem Extended | Core Capability |
|---|---|---|
| Device Mgmt | bus/device/driver model | ub_bus_type, UBPU device discovery and hot-plug |
| Memory Mgmt | NUMA/DMA/SVA | Cross-node memory borrowing and sharing, UMMU address translation |
| Communication | — (new middleware) | URMA communication primitives (load/store/dma/rpc) |
| Virtualization | KVM/vfio | UB device passthrough to VMs, super VMs |
| UBM (firmware) | — (standalone chip) | Topology discovery, routing, fault isolation |
2.1 Device Mgmt: ub_bus_type
The Linux device driver model is a three-layer bus_type → device → driver structure. PCIe has pci_bus_type, USB has usb_bus_type. CXL doesn't have an independent bus type—it layers CXL.mem/CXL.cache on top of PCIe enumeration. Lingqu directly adds a new ub_bus_type.
Two design philosophies: CXL takes the incremental route (physical layer is PCIe, reuses PCIe enumeration, layers memory semantics on top). The benefit is upstream Linux acceptance—the kernel has been gradually merging CXL subsystem since 5.12, and by 6.13 it's quite complete. Lingqu takes the restructuring route (physical layer is not PCIe, directly defines an independent bus). The architecture is clean, but the cost is it will never get into Linux upstream.
Upstreaming is impossible. Kernel merge criteria: multi-vendor, multi-product, general value. drivers/accel/ was accepted because Intel, AMD, and Apple all have hardware. ub_bus_type only has Huawei hardware, and Lingqu hardware isn't sold retail (only shipped with Atlas systems). Third parties can't test or contribute.
Consequences:
- Kernel components will always live in an openEuler fork, maintained solely by Huawei
- Merge conflicts during major kernel version upgrades are Huawei's problem alone
- The window between CVE publication and fix in the openEuler fork depends on Huawei's response speed
- Could CXL subsystem updates conflict with
ub_bus_typememory management code?
The openEuler fork is the hidden tax of the Lingqu kernel layer. NVIDIA's closed-source drivers also have maintenance costs, but they run on the standard kernel without needing a fork. Lingqu chose the fork.
UBPU peer model. All devices on the Lingqu bus are called UBPU—CPU, NPU, DPU, SSU all have equal status. Any UBPU can initiate communication, access other UBPU's memory, and borrow resources. The peer model has no single bottleneck point (unlike PCIe's root complex), but routing table scale is O(n²)—a fully connected routing table for 8,192 UBPUs would exceed 67 million entries. Lingqu uses a switching topology rather than full interconnection, aggregating routing tables by destination NodeID to around 8,192 entries, but multi-path routing (ECMP) multiplies this further.
Kernel-mode drivers:
| Driver | Managed Object |
|---|---|
| UBMM driver | UB Memory Manager configuration |
| NPU driver | NPU accelerator |
| DPU driver | Data processing unit |
| SSU driver | Storage service unit |
| UDMA driver | UB DMA engine (cross-node data transfer) |
| UMMU driver | UB MMU (address translation) |
| UNIC driver | UB network interface |
The userspace ubctl is analogous to lspci—device discovery, hot-plug, topology queries, configuration management.
2.2 Memory Mgmt: Cross-Node Unified Addressing
This is the most technically dense module in the kernel layer. Standard Linux struct page corresponds to local physical memory. NUMA extends this to multiple sockets but stops at a single chassis. Lingqu wants remote NPU's HBM to become part of the local NUMA topology.
UMMU: Cross-Node Address Translation
The UMMU hardware unit translates cross-node addresses. Workflow:
- CPU load/store accesses a virtual address
- Local MMU translates to physical address
- Physical address falls within the Lingqu address space → UMMU takes over
- Translates to
{NodeID, UASID, VA} - Bus routes to the target node
- Target node executes the operation and returns data
The entire process occurs without OS involvement. The application cannot distinguish between remote and local access—only the latency differs.
The {NodeID, UASID, VA} three-layer structure naturally supports multi-address-space isolation. UASID is similar to x86 PCID but extended across nodes. The address translation table scale of 8,192 UBPU × N concurrent UASIDs—UMMU's TLB capacity and miss penalty are key performance parameters, not publicly disclosed.
Latency benchmarks and capability boundaries.
| Operation | Latency | Notes |
|---|---|---|
| Local DDR (same NUMA) | 100-150ns | Baseline |
| Cross-NUMA (same chassis) | 200-300ns | x86 QPI/UPI |
| Lingqu cross-cabinet | ~400ns | OBMM measured |
| RDMA Read (same DC) | 2-5μs | RoCEv2/IB |
| NVLink (same chassis) | <100ns | NVLink 5.0 |
Capability boundary analysis. Where 400ns sits: roughly twice as slow as cross-NUMA, 5-10x faster than RDMA, 4x slower than NVLink.
Consider a specific scenario: 8,192-card AllReduce reduce-scatter, where each card fetches a chunk from each of the other 8,191 cards. Synchronous load: 400ns × 8,191 = 3.3ms. NVLink equivalent in same chassis: <100ns × 8,191 = 0.8ms. A 4x gap.
Lingqu's response: UDMA bulk transfer + URMA async DMA—multiple small loads merged into one large block transfer, amortizing the 400ns startup latency. Effective for large data blocks (1MB KV Cache, where 400ns is amortized to negligible by 128 GB/s bandwidth), but ineffective for fine-grained random access.
Conclusion: Lingqu's "single-machine semantics" is a simplification of the programming model, not an equivalence of physical performance. The hardware gives you a single-machine API, but 400ns tells you it's not actually a single machine. Performance-sensitive paths still need to be aware of data placement—hot data on local NUMA, warm data on cross-cabinet UB, cold data on remote storage.
Memory Borrow
Node A runs low on memory and temporarily borrows idle HBM/DRAM from Node B. Borrowed memory is mapped as local NUMA.remote through UMMU. Completely transparent to applications and memory allocators.
OBMM management policies: trigger threshold, borrowing ratio, reclamation timing. Measured: optimal borrowing ratio is approximately 25%, with performance loss <5%.
Why 25%? Analysis: borrowed remote memory is 400ns, local DDR is 100-150ns. Under a uniform distribution assumption (most applications aren't uniform, but this gives an upper bound), borrowing 25% means 25% of accesses hit 400ns and 75% hit 100-150ns. Weighted average ≈ 0.25 × 400 + 0.75 × 125 = 194ns, 55% slower than pure local 125ns. But real applications have locality (hot data tends to stay local), so actual loss is <5%. 25% is not a hard limit; it's an empirical optimum under a locality assumption. Applications with poor locality (large-scale graph computation random walks) should use a lower ratio.
Borrowing is exclusive—the borrower has full data ownership. No multiple writers → no need for cache coherence → cacheable. The CC problem is confined to the sharing scenario; borrowing sidesteps it entirely. Simpler hardware (no cross-cabinet cache coherence circuitry needed), but borrowing is only suitable for exclusive use.
Memory Share
Multiple nodes access the same data (shared weights, distributed inference KV Cache). set_ownership controls write access: at any time, only one owner can write; all others are read-only. Write ownership transfer occurs through a software protocol.
Lingqu does not have cross-node hardware cache coherence. This is a deliberate choice. Hardware CC across 128 cabinets:
- Snoop filter scale: snoop state tracking for 128 nodes × tens of TB of memory—storage and query latency are uncontrollable
- Coherence traffic: every write broadcasts invalidation to all nodes that might cache that line, bandwidth grows quadratically with node count
- Unpredictable latency: invalidation ack depends on the slowest node
Lingqu chose software coherence + hardware transparent access. The cost: owner switching is not an atomic operation. When two nodes alternately write to the same memory region at high frequency (training gradient accumulation), set_ownership round-trips become a bottleneck.
Capability boundary: Lingqu shared memory is suitable for low-frequency-write, high-frequency-read patterns (inference: one node writes KV Cache, multiple nodes read), not for high-frequency-write, high-frequency-read patterns (training: multiple nodes alternately write gradients). Inference is the primary target for Lingqu super-nodes, so this tradeoff is reasonable.
Comparison with CXL Type 3: Different Problem Domains
| Dimension | CXL Type 3 | Lingqu UBM |
|---|---|---|
| Physical range | PCIe link (1-2m) | Lingqu bus (cross-cabinet, >100m) |
| Device types | CXL memory expanders (homogeneous) | CPU/NPU/DPU/SSU (heterogeneous) |
| Coherence | CXL.cache (hardware CC) | Software ownership management |
| Programming model | Must be aware of CXL memory type | Unified NUMA node, transparent |
| Ecosystem | PCI-SIG standard, multi-vendor | Huawei-exclusive |
CXL hardware CC is feasible within PCIe link distances. Lingqu放弃ed hardware CC because the physical range differs by two orders of magnitude.
HMM / ZONE_DEVICE Limits
Linux HMM provides struct page for device memory through ZONE_DEVICE + DEVICE_PRIVATE, with hmm_range_fault() handling CPU access page faults to device memory. This is the foundation of NVIDIA/AMD's Unified Memory implementations.
HMM's scope is a single machine. Cross-node extension requires: new page table entry types, new migration paths, new coherence models. Technically feasible, but requires multi-vendor collaboration—3-5 years, and no one is currently pushing it. HMM is the closest open-source counterpart to Lingqu UBM, and the single-machine ceiling is a wall it cannot cross.
2.3 Communication: URMA Communication Foundation
UMDK provides three types of primitives: URMA (RDMA analogy), RPC (gRPC analogy), and Message (MPI analogy).
URMA has three modes: synchronous load/store (pointer operations), async DMA (UDMA engine), and message passing.
URMA vs RDMA physical path: RDMA goes through NIC→switch→NIC, with IP/UDP encapsulation. URMA goes through UDMA→UB Switch→UDMA, with no network protocol stack. Removing IP/UDP encapsulation and NIC protocol stack drops latency from microsecond-scale to hundred-nanosecond-scale.
Compatibility layers. URMA only runs on Lingqu hardware. RoUB wraps it as a libibverbs-compatible interface. Socket over UB wraps it as standard socket.
Analysis of the three-tier adaptation path assumptions.
| Adaptation Level | Benefit | Source | Implicit Assumption |
|---|---|---|---|
| Zero modification | ~10% | Faster underlying transport | Bottleneck is data transfer, not compute |
| SDK adaptation | ~30% | Remove verbs layer | Worth modifying code for 30% |
| Deep refactoring | 50%+ | Explicit transfer → pointer operations | Architecture suits pointer operations |
10% zero modification: Going through RoUB still passes through the verbs API. The verbs layer overhead hasn't disappeared; only the underlying transport changed from IB to Lingqu. If the application bottleneck is compute rather than communication, the benefit approaches zero.
30% SDK adaptation: xcopy(gva) replaces ibv_post_send + ibv_reg_mr + ibv_poll_cq, eliminating MR registration, CQ polling, and work request construction. The benefit depends on the communication ratio.
50%+ deep refactoring: Under unified addressing, explicit send/receive is replaced with direct pointer operations. This is an architectural-level change with a high ceiling on benefits but also the highest cost.
Cold start. 10% from zero modification isn't enough to drive migration—users with adequate IB have no motivation. The URMA API is much simpler than verbs, but the distance between "simpler" and "worth rewriting" is significant.
2.4 Virtualization: vfio + UMMU Coordination
KVM + vfio extensions. UB device passthrough to VMs: NPU passthrough, VM memory spanning multiple physical nodes ("super VM"), container live migration from 100ms → 50ms.
Challenge of vfio + UMMU coordination. Standard vfio maps device DMA addresses to Guest physical addresses through IOMMU. Lingqu device DMA addresses are in the UB address space {NodeID, UASID, VA}—not standard physical addresses. The vfio IOMMU mapping needs to be extended to UMMU format: either by adding a vfio_iommu_ub type that modifies vfio core code, or by wrapping through vfio platform.
The core issue: Guest physical address space includes remote node memory. Standard vfio only handles local IOMMU. Lingqu's vfio extension needs to handle both local IOMMU and cross-node UMMU simultaneously. During VM live migration, not only local memory pages but also UMMU mapping relationships must be migrated. These implementation details are not publicly disclosed.
2.5 UBM: Bus Manager on a Standalone Chip
UBM firmware is deployed on a standalone chip: topology discovery and maintenance, routing configuration updates, fault detection with physical isolation, and operations monitoring. HostOS crashes don't affect bus topology.
UBM scale limit analysis. Topology management for 8,192 UBPUs:
- Routing table scale. Full interconnection: 8,192 × 8,191 / 2 ≈ 33.5 million entries. Switching topology aggregates by destination NodeID to ~8,192 entries, with multi-path ECMP multiplying further.
- Fault detection. Heartbeat probing for 8,192 UBPUs at ~50μs per probe (UB bus round-trip), single round ~409ms. With processing overhead, sub-second is reasonable.
- Fault isolation granularity. Single-device isolation → routing table update; cabinet-level isolation → loss of 64 cards (assuming 64 cards per cabinet).
Inference: thousand-card topology converges in seconds; 8,192 cards may take tens of seconds. Initialization and fault recovery are not within millisecond-level SLA. Training workloads (which can tolerate brief pauses) may be OK; online inference (sub-second failover) depends on the UBS Engine failover design.
3. Kernel Route Comparison: Lingqu vs NVIDIA
Interconnect Architecture
NVIDIA: multi-layer protocol splicing. NVLink (GPU-GPU, closed-source) + PCIe/Grace C2C + IB/Spectrum-X + GPUDirect Storage. Four sets of drivers, CUDA unifies the upper layer.
Lingqu: single-protocol convergence. All interconnects use UB, ub_bus_type + UMMU + URMA in one driver set.
Memory Semantics
NVIDIA ICMSP (CES 2026): three-tier KV cache. HBM (~100ns, ~$20/GB) → DRAM (~150ns, ~$3/GB) → NVMe SSD (~10μs, ~$0.3/GB). BlueField-4 DPU handles inter-tier transfer and prefetching. Storage semantics—runtime manages which tier data resides on.
Lingqu MemFabric: global virtual address space. Memory semantics—load instructions reach remote NPU HBM directly; hardware/OS decides data placement.
| Dimension | ICMSP (Storage Semantics) | MemFabric (Memory Semantics) |
|---|---|---|
| Data placement management | Runtime explicit | Hardware/OS implicit |
| Hotness matching | Naturally tiered | Relies on page migration |
| Programming complexity | High | Low |
| Suitable scenario | Predictable hotness (inference) | Unpredictable hotness (general-purpose) |
Boundary. In inference, KV Cache hotness is predictable—prefill tokens are hot, historical tokens have decreasing temperature. Storage semantics can precisely match cost/latency for each tier. Memory semantics relies on page migration—if migration granularity is too coarse (4KB pages) or trigger latency is too high, hot data may land on slow storage. Whether Lingqu's UCM (analyzed in the next article) adds an explicit management layer akin to storage semantics on top of memory semantics is worth tracking.
4. Overall Capability Boundary Assessment
Irreplaceable: UB bus driver + UBM memory management. Direct dependency on Lingqu hardware. The closest open-source counterpart is the Linux CXL subsystem, but the physical and semantic scope is not on the same level. An equivalent solution would take 150-200 person-months, assuming you have your own interconnect hardware.
Approximable: URMA → UCX. Equivalent programming interface, but goes over the network (microsecond-scale) vs. over the bus (hundred-nanosecond-scale). When hundred-nanosecond latency isn't needed, UCX is perfectly usable.
Mature replacement: Virtualization → KVM/vfio/CRIU. Lingqu's acceleration comes from UB shared memory (100ms→50ms); standard 100ms is sufficient for most workloads.
Core judgment: the irreplaceability of the Lingqu kernel layer lies not in software, but in hardware. The value of UB drivers and UBM comes from the unique capabilities of the Lingqu physical interconnect—cross-cabinet unified addressing, hundred-nanosecond latency, and heterogeneous device peer access. Open-sourcing the code without opening the hardware ecosystem is like publishing blueprints without selling building materials.
The degree of openness itself is worth examining. The Lingqu kernel layer is open-sourced in openEuler, significantly more open than NVIDIA's closed-source drivers. But the UBS Core repository has only 24 commits, all from Huawei. Without third-party Lingqu hardware, no one can test or contribute. The code is open-sourced, but no community has been built.
5. Signals Worth Tracking
1. UBS Core community activity. 24 commits, all from Huawei. The inflection point would be a third-party chip vendor submitting a UBPU driver—even just one for a DPU.
2. Lingqu 2.0.1 specification adoption. The May 2026 refined edition (based on 30,000+ enterprise feedback) detailed 112G/224G SerDes parameters. Is any third party building chips based on it?
3. Linux kernel heterogeneous memory upstream progress. CXL Type 3 continues advancing (significant update in 6.13), HMM matures under GPU vendor promotion. If upstream solves cross-node heterogeneous memory—no current direction visible—Lingqu's moat is weakened.
4. 8,192-card production validation data. UMMU latency distribution, memory borrowing stability, UBM topology convergence speed—zero public data. Architecture blueprints without production data have limited persuasiveness.
5. Boundaries of openFuyao's K8s integration. Kernel-layer value is ultimately exposed through K8s. openFuyao's UB memory pooling plugin can only run on Lingqu hardware, limiting its influence to the Huawei ecosystem. Whether it can provide a degraded but usable experience on standard K8s determines the influence boundary of Lingqu's kernel capabilities.
6. Reverse-Engineering from Analysis: What Should the Kernel Layer's Functional Specification Look Like?
The first five chapters analyzed what the Lingqu kernel layer does and where its capability boundaries lie. Now let's reverse the perspective: if we were writing a "super-node kernel software functional specification" from scratch, based on the preceding analysis, what functional requirements should it list? Which has Lingqu already implemented, and which are gaps revealed by our analysis?
This is not Lingqu's actual roadmap; it's a functional specification view derived from this article's analysis. Requirements are numbered with FR (Functional Requirement), organized by module.
6.1 Bus and Device Management
| ID | Functional Requirement | Source | Lingqu Status |
|---|---|---|---|
| FR-1 | Independent bus type (ub_bus_type), supporting heterogeneous UBPU peer discovery and hot-plug |
Single-protocol convergence design requirement | ✅ Implemented |
| FR-2 | Programmable routing tables: support fat tree / mesh / custom topologies, dynamic optimization by communication pattern | §1.3 topology planning analysis—160-cabinet topology directly affects AllReduce performance | ❓ Not publicly disclosed whether programmable |
| FR-3 | Routing table scale ceiling: 8,192 UBPU aggregated routing table + ECMP multi-path, total entries manageable | §2.1 routing table O(n²) analysis—full interconnection at 67M entries is infeasible, switching topology aggregation is required | ❓ No data |
| FR-4 | ubctl topology visualization: display complete topology map of 8,192 UBPUs, routing paths, bottleneck links |
§1.3 operations needs topology awareness—fault localization across 160 cabinets can't rely on guessing | ✅ Basic functionality exists, visualization depth unknown |
| FR-5 | Independent kernel module versioning, decoupled from openEuler kernel version | §2.1 fork maintenance—UB OS Component regression testing shouldn't be tied to every openEuler minor version upgrade | ❌ Currently part of the openEuler fork |
FR-2 is the critical gap. Static topology means the optimal structure must be determined at deployment time and cannot be adjusted later based on workload. An 8,192-card cluster doesn't run a single workload—training and inference have completely different communication patterns. If Lingqu super-nodes are to serve both training and inference (time-sharing), programmable routing is a hard requirement.
6.2 Cross-Node Memory Management
| ID | Functional Requirement | Source | Lingqu Status |
|---|---|---|---|
| FR-6 | UMMU address translation: virtual address → {NodeID, UASID, VA}, hardware-completed |
Core capability of super-node = one machine | ✅ Implemented |
| FR-7 | UMMU TLB capacity configurable, miss penalty observable | §2.2 8,192 UBPU × N UASID translation table scale—TLB miss could be a source of latency spikes | ❓ Not publicly disclosed |
| FR-8 | Tunable memory borrowing policy: trigger threshold, borrowing ratio cap, reclamation timeout, locality detection | §2.2 25% is empirical—applications with poor locality need lower ratios | ✅ OBMM framework exists, tunability degree unknown |
| FR-9 | Precise NUMA distance annotation for borrowed memory (reflecting actual latency, not a fixed value) | §2.2 remote HBM vs remote DRAM have different latencies—NUMA distance should differentiate | ❓ Not publicly disclosed |
| FR-10 | Measurable shared memory owner switch latency, with switch count and wait time metrics | §2.3 set_ownership software overhead is the bottleneck in training scenarios—must be observable for tuning |
❓ Not publicly disclosed |
| FR-11 | Batch owner transfer for shared memory: transfer write ownership for multiple regions in a single call | §2.3 gradient accumulation in training involves many regions with alternating writes—per-region set_ownership overhead accumulates |
❓ Not publicly disclosed |
| FR-12 | Data placement awareness API: applications can query the physical location of a virtual address (local/cross-node/remote storage) and proactively trigger migration | §2.2 "single-machine semantics is programming simplification, not performance equivalence"—performance-sensitive paths need placement awareness | ❓ Not publicly disclosed |
| FR-13 | Cross-node memory cgroup limits: restrict the total remote memory a single container/process can borrow | §2.2 multi-tenant scenario—one tenant cannot indefinitely borrow other nodes' memory | ❓ Not publicly disclosed |
FR-7 and FR-12 are the most important gaps. FR-7 determines UMMU debuggability—if TLB misses cause latency spikes, operations needs to know. FR-12 determines whether applications can implement a "hot data local, warm data remote" tiering strategy—without this API, application performance optimization is pure guesswork.
FR-11 deserves separate mention. Lingqu shared memory's set_ownership is a single-operation call. A single gradient accumulation step in training may involve ownership switching for hundreds of memory regions. If each switch goes through a software protocol (request→confirm→effect), the latency of hundreds of serial switches is non-negligible. A batch transfer interface (transferring N regions in a single call) can compress N protocol round-trips into one. This isn't an optimization; it's a hard requirement for training scenarios.
6.3 Communication and Compatibility
| ID | Functional Requirement | Source | Lingqu Status |
|---|---|---|---|
| FR-14 | URMA three modes (synchronous load/store, async DMA, message passing) | Communication foundation | ✅ Implemented |
| FR-15 | RoUB compatibility layer: libibverbs interface compatibility | §2.3 zero-modification migration path | ✅ Implemented |
| FR-16 | Socket over UB compatibility layer | §2.3 transparent TCP application acceleration | ✅ Implemented |
| FR-17 | Communication performance observability: per-URMA-operation latency distribution, bandwidth utilization, UDMA queue depth | §2.3 three-tier adaptation path benefits need empirical validation—without observability, the 10%/30%/50% assumptions cannot be proven | ❓ Not publicly disclosed |
| FR-18 | RDMA application compatibility certification matrix: which mainstream RDMA applications (NCCL/oneCCL/MPI) have been validated on RoUB | §2.3 cold start problem—users need to know if existing applications can run directly | ❓ Not publicly disclosed |
| FR-19 | URMA → UCX adaptation layer: enable UCX applications to run on Lingqu hardware | §4 open-source alternative analysis—UCX is the closest cross-platform communication framework | ❓ Not publicly disclosed |
FR-17 and FR-18 are the lifelines for cold start. Lingqu ecosystem cold start depends on user migration. The prerequisite for user migration is: knowing whether existing applications can run (FR-18) and how much faster they'll be (FR-17). Without these two data points, the 10% zero-modification benefit is just a paper number.
6.4 Virtualization
| ID | Functional Requirement | Source | Lingqu Status |
|---|---|---|---|
| FR-20 | UB device vfio passthrough, supporting NPU/DPU passthrough to VMs | §2.4 compliance scenarios | ✅ Implemented |
| FR-21 | Cross-node VM memory mapping + live migration | §2.4 "super VM" | ✅ 100ms→50ms |
| FR-22 | Independent vfio UMMU mapping type (vfio_iommu_ub), without modifying vfio core code |
§2.4 vfio+UMMU coordination—modifying vfio core increases upstream compatibility risk | ❓ Implementation approach not publicly disclosed |
| FR-23 | Super VM memory limits: restrict the total cross-node memory a single VM can use | §6.2 multi-tenancy—one VM cannot consume an entire super-node's memory | ❓ Not publicly disclosed |
6.5 Bus Management (UBM)
| ID | Functional Requirement | Source | Lingqu Status |
|---|---|---|---|
| FR-24 | Topology discovery and routing configuration | Basic capability | ✅ Implemented |
| FR-25 | Configurable fault detection period, measurable convergence time | §2.5 thousand-card seconds → 8,192 cards may take tens of seconds—operations needs to know actual convergence time | ❓ Not publicly disclosed |
| FR-26 | Configurable fault isolation granularity: single-device vs cabinet-level, configurable per SLA requirements | §1.3 isolation granularity affects lost compute—training can tolerate single-device isolation (more precise but slower convergence), inference may need cabinet-level isolation (more loss but faster recovery) | ❓ Not publicly disclosed |
| FR-27 | Independent UBM firmware upgrade/rollback, without affecting HostOS and DeviceOS | §1.2 three-layer version matrix—all three layers must be independently upgradeable | ❓ Not publicly disclosed |
| FR-28 | UBM topology change event notification: notify HostOS when routing changes, triggering application-layer rerouting | §2.5 during topology convergence, applications need to know which paths are unavailable | ❓ Not publicly disclosed |
FR-25 is the key parameter for determining the Lingqu super-node's scale ceiling. If 8,192-card topology convergence time is in the tens-of-seconds range, it means: initialization >10 seconds, single-device fault recovery >10 seconds, large-scale topology recomputation possibly >30 seconds. These numbers directly impact operations SLA and business availability design.
6.6 Versioning and Operations
| ID | Functional Requirement | Source | Lingqu Status |
|---|---|---|---|
| FR-29 | Automatic three-layer version (DeviceOS / UBM / HostOS) compatibility matrix checking | §1.2 O(n³) compatibility—manual checking is infeasible | ❓ Not publicly disclosed |
| FR-30 | Firmware hot upgrade: UBM and DeviceOS firmware upgrades without interrupting services | §1.2 three-layer versioning—cold upgrades mean service interruption | ❓ Not publicly disclosed |
| FR-31 | Version rollback: any layer can roll back to the previous version on upgrade failure, rollback time <SLA | §1.2 production environment requirement | ❓ Not publicly disclosed |
| FR-32 | Kernel module ABI stability: UB OS Component's userspace-kernel ABI remains stable across versions | §2.1 fork maintenance—if ABI changes every version, userspace toolchains must follow | ❓ Not publicly disclosed |
6.7 Specification Summary: Implemented, Gaps, Unknowns
Tallying the above FRs:
| Status | Count | Percentage |
|---|---|---|
| ✅ Implemented | 11 (FR 1/6/8/14/15/16/20/21/24) | 34% |
| ❓ Not publicly disclosed / not implemented | 21 (FR 2-5, 7, 9-13, 17-19, 22-23, 25-32) | 66% |
| ❌ Explicitly missing | 0 | 0% |
34% implemented, 66% unknown. This ratio is itself a signal: Lingqu has published the architecture blueprint but not the engineering details. Architecture blueprints tell you what the system looks like; engineering details tell you whether it actually works.
Of the 66% unknown, the five highest-priority items:
- FR-25 (UBM convergence time)—determines the super-node's scale ceiling
- FR-12 (Data placement awareness API)—determines whether applications can do performance optimization
- FR-7 (UMMU TLB observability)—determines whether latency issues can be diagnosed
- FR-18 (RDMA compatibility matrix)—determines whether cold start can happen
- FR-29 (Version compatibility auto-check)—determines whether operating 128 cabinets is a nightmare
These five capabilities won't appear on architecture diagrams, but they will show up in operations tickets.
The next article will analyze the Lingqu service layer—UBS Engine control plane, MemFabric 128 TB + 48 TB hybrid pool, UCM TTFT -90% and throughput 22x, NPU Direct Storage bypassing CPU for direct storage access, and the full picture of application integration from training to inference pipelines.
Sources: Lingqu System Software Architecture & Deployment Public Course (Niu Tao), openEuler UB Service Core White Paper 2.0, APNet'21 "Huawei UB: Towards Compute-Native Networking" (Bojie Li), Huawei Connect 2025 Eric Xu Keynote, Lingqu Interconnect Community Specification Documents, openEuler SIG-Long Heterogeneous Fusion SIG Materials, KADC 2026 openFuyao Sub-forum, CLK 2025 Technical Talks, Linux Kernel CXL/HMM/ZONE_DEVICE Documentation, UCX Project Documentation, comentropy Super-Node Industry Chain Analysis, ubs-core Repository (atomgit.com)
