LineShine Addendum: New Details Confirmed by The Next Platform's Deep Dive

This article supplements and corrects "LineShine Tops TOP500: 2 EFLOPS Without a Single GPU." On June 25, 2026, The Next Platform co-founder Timothy Prickett Morgan published a deep architectural teardown of LineShine, drawing on HACI 2026 presentation slides (delivered by system chief architect Yutong Lu) and technical parameters from an NSC Shenzhen AI paper published in April. What follows confirms and corrects our original analysis in light of these new sources.

1. Architecture Confirmations and Corrections

1.1 LX2 Designer: NSC Shenzhen + Huawei HiSilicon

Our original article did not identify the LX2's designer. TNP confirms: the LX2 was jointly designed by NSC Shenzhen and Huawei (presumably its HiSilicon chip division). The SVE2 units are block-copied from ARM Neoverse IP, while the SME matrix units are a custom Huawei implementation.

1.2 Process Node: SMIC 7nm N+3

We previously inferred SMIC 7nm. TNP further pins it at SMIC 7nm N+3, an enhanced variant. TNP's reasoning: 1.55 GHz is well below the ~3 GHz SMIC can push with this process—the downclock balances core speed against memory speed while keeping the power curve in check. At 690W TDP, any higher clock would be thermally unsustainable.

This complements our original chiplet area estimation: the point is not that they couldn't go faster, but that they deliberately clocked down for an efficiency sweet spot, making up for per-core performance with scale.

1.3 Chiplet Structure and Yield

The slide confirms a 2-chiplet design. More critically, TNP derived raw core counts from the die shot:

Each chiplet has 48 core blocks, each with 4 cores → 192 raw cores per chiplet
384 raw cores per socket; 304 cores exposed → 79.2% yield
Consistent with what TNP calls "expected yield range for SMIC 7nm"

This is more precise than our original pure-area estimation—we correctly inferred a chiplet scheme but lacked the specific raw core count and yield figure.

1.4 CPU Count Correction

Our original "~47,000 CPUs" needs correction. There are two actual numbers:

Configuration described in the NSC Shenzhen paper: 20,480 nodes × 2 sockets = 40,960 LX2 CPUs
HPL benchmark configuration added ~2,200 more nodes: 22,680 nodes × 2 sockets = 45,360 LX2 CPUs, 13,789,440 cores

The HPL run used a larger configuration (~10% more nodes than the paper), consistent with TNP's observation that "China can scale LineShine further if it wants or needs to."

2. Memory System Corrections

2.1 HBM Capacity: 64 GB per Socket, Not 32 GB

This is the most important correction. Our original statement of "32 GB HBM (4 TB/s) per CPU" came from an ambiguous per-chip description in the NSC Shenzhen paper. TNP clarifies: it's 32 GB + 4 TB/s per chiplet, totaling 64 GB + 8 TB/s per socket.

The slide's "HBM 4 TB/s" refers to a single chiplet. Each chiplet's four 24-core blocks each get one HBM stack.

TNP speculates this is a slightly goosed variant of HBM2E.

2.2 DRAM: 3D-Stacked LPDDR5X, Not Plain DDR5

Our original only mentioned "DDR5." TNP provides far more detail:

256 GB LPDDR5X per socket (not DDR5), presumably sourced from ChangXin Memory Technologies (CXMT), which demonstrated 10.7 GHz LPDDR5X in late 2025
Uses wafer-to-wafer 3D stacking—custom DRAM dies bonded to logic wafers to reduce power and area
The slide explicitly states: "Customized DRAM dies reduce power and area; wafer-to-wafer 3D stacking combines DRAM and logic wafers"
8 NUMA domains across the two chiplets organize this DRAM (we originally inferred a two-level NUMA; it's actually 8 domains)
An SDMA engine automatically manages HBM ↔ DRAM data movement

2.3 Programmable HBM Modes

The slide confirms two programmable HBM modes:

Cache mode: out-of-the-box bandwidth optimization
Flat mode: deep manual tuning for power users

3. System Hierarchy Confirmed

TNP reconstructed the full physical topology from the slides—a layer our original article did not cover:

Tier	Configuration	Notes
Node	2-socket LX2	Basic compute unit
Blade	8 nodes	PCIe 5.0 interconnects nodes within a blade
Frame	16 blades = 128 nodes = 256 CPUs	Switch-interconnected blades, 30.87 PFLOPS FP64
Cabinet	2 frames	One cabinet per two frames
Full system (paper)	160 frames	20,480 nodes / 40,960 CPUs
Full system (HPL)	~177 frames	22,680 nodes / 45,360 CPUs

Within a frame, PCIe 5.0 switches connect blades (what TNP calls "the inexpensive way to do it"). Between frames, the LingQi network takes over.

4. Network: LingQi in Full Detail

Our original speculated LingQi might be an InfiniBand variant. TNP and the slides provide the complete topology:

4-layer fat-tree topology (L1–L4); only L4 uses optical links—L1 through L3 are all copper
184 compute frames + 32 network frames
L1 layer: 16 L1 switch/compute blades per compute frame
L2 layer: 8 L2 switch blades per compute frame
L3 layer: 16 L3 switch blades per network frame
L4 layer: 6 L4 switch blades per network frame
Full system: 22,000+ nodes, 200,000 ports
Full-system bisection bandwidth: ≥ 3.5 Pbps (petabits per second)
Single-hop latency: 1.07 μs
Per-node bandwidth: 1.6 Tb/s (2 × 400 Gb/s; NIC integrated on the LX2 die)

Reliability design:

Credit-based flow control for lossless communication
Dual-plane networking with multi-rail communication
Link-, chip-, and cabinet-level redundancy
Hardware-supported telemetry with second-level data collection and proactive push

The slide also shows a die photo of a switch ASIC—substantial in size—suggesting the LingQi switch is custom silicon rather than a commercial chip.

TNP notes that 1.07 μs single-hop latency "sounds more like Ethernet than InfiniBand," though it could be an InfiniBand implementation. Combined with credit-based flow control and lossless communication, LingQi's design target is clearly deterministic low-latency networking for HPC and AI training—not general-purpose data center Ethernet.

5. Performance and Power: New Data

5.1 HPL Efficiency

Our original didn't discuss this. TNP calculates: HPL computational efficiency = 80.35% (2.198 EFLOPS / 2.74 EFLOPS peak). This is remarkably high:

System	HPL Efficiency
K (Fujitsu)	93% (all-time record)
Fugaku	82.3%
LineShine	80.35%

TNP calls it "pretty damned good," attributing it to "merging big math with a healthy core instead of separating them."

5.2 TDP and System Power

Per LX2: 690W
Full system: 42.2 MW (significantly above the < 30 MW of the three major U.S. exascale systems)

TNP's take: the extra power buys lower computational complexity—no offload model, unified HBM + DRAM address space, no GPU software stack overhead.

5.3 Per-Core Peak Performance Confirmed

The slide confirms the peak numbers cited in our original:

FP64: 60.3 TFLOPS per LX2 → approximately 198 GFLOPS/core
SVE2 + SME together cover FP64/32/16/INT8

5.4 SME + SVE Software Stack: Empirical Results

A separate HACI 2026 slide titled "SME-Enabled, HBM-Aware Matrix Acceleration" provides real benchmark data for the LX2's SME/SVE software optimizations, citing three published papers. Our original analyzed the SME microarchitecture but did not cover these optimization results.

SME matrixization efficiency: By uniformly mapping differently-shaped matrix operations across HPC and AI workloads to SME—including multi-row-update matmul in stencils, square-tile matmul in GEMM, and tall-and-skinny matmul (QKᵀ and SV) in Transformers—efficient matrixization improves SME utilization by over 40%.

SVE + SME interleaved scheduling: SVE complements SME in scenarios where SME underperforms—single-row-update in stencils, boundary handling in GEMM, online softmax in Transformers. Interleaving SME and SVE instruction streams boosts IPC by up to 1.59×. This aligns with our original analysis of the D2AR paper's "asymmetric SME-GEMM" scheduling strategy—SME and SVE are not symmetrically mixed; SME dominates, with SVE filling pipeline gaps.

Memory-aware data placement:

HBM buffer pool pre-allocation → memory footprint reduced by 3.9 GB
Blocking to keep working tiles cache-resident + packing tiles into SME-friendly layouts + prefetching non-contiguous data → cache hit rate improved by up to 28%

Measured speedups:

Workload	Speedup	Baseline	Source
Stencil	Up to 4.1×	Compiler auto-vectorization	HStencil (SC'25)
GEMM	1.11–1.75×	Vendor math library	KirbyMM (DATE'26 Best Paper)
Attention	Avg. 13.62×	SOTA implementation	SMEAtten (Euro-Par'26)

The three papers are: HStencil (SC'25), KirbyMM (DATE'26 Best Paper), and SMEAtten (Euro-Par'26). These results confirm a core thesis of our original article: integrating SME into a CPU is not a cosmetic addition—with deep software stack optimization, it can deliver multi-fold speedups on specific workloads. The 13.62× on Attention is particularly striking, suggesting that a pure-CPU architecture may find a competitive path for Transformer inference outside of GPUs.

5.5 LLM Inference Benchmark: DeepSeek at 578 TPS

The HACI 2026 system overview slide also disclosed a critical inference benchmark:

Single LX2 DeepSeek Decode throughput reaches 578 TPS (tokens per second)
Aggregate throughput reaches a partially obscured "double-" figure (context suggests double-digit or double-level total throughput)
NSC Shenzhen is actively advancing Qwen and other mainstream/domestic large model training and inference deployment at scale

578 TPS is noteworthy in a CPU context. For reference, a single NVIDIA H100 typically delivers ~2,000–4,000 TPS on comparable Decode workloads (highly dependent on batch size and model size), at roughly 700W—almost identical to the LX2's 690W. 578 TPS vs. 2,000–4,000 TPS means GPUs still hold a 3–7× advantage, but considering this is a homogeneous CPU architecture + first-generation SME + no GPU software stack, the number is far from weak.

For agentic AI inference—the central thesis of our original article's Chapter 6—with its low latency, small batches, long sequences, and sparse computation—the CPU's unified memory + SME + SVE combination may well find competitiveness on total cost of ownership (TCO).

5.6 Supporting Systems: A Heterogeneous Facility

The same overview slide reveals that LineShine's NSC Shenzhen Phase II is not merely a pure-CPU cluster—it is a comprehensive computing facility:

System	Configuration	Purpose
LineShine main	ARMv9 LX2 pure CPU	HPC + AI training & inference
Industrial computing	1,580 X86 blades (101,120 cores), 10+ PFLOPS, 200 PB storage	Industrial simulation, traditional HPC
Pilot verification	100 Kunpeng servers (12,800 cores)	Ecosystem adaptation and validation
4-way / 8-way servers	16 × 4-way + 4 × 8-way (3,328 cores total)	Large-memory workloads

Additionally, LineShine's software ecosystem is compatible with 400+ mainstream HPC applications, with a toolchain including compilers, debuggers, and performance tuning tools.

6. Another System: CNIS

TNP's article also describes a second exascale-class system mentioned in the same NSC Shenzhen paper—China New-generation Intelligent Supercomputer (CNIS), a CPU+GPU heterogeneous system:

5,632 nodes
2 × 64-core CPUs + 8 GPUs per node
GPU peaks: 32.7 TFLOPS FP64 / 65.5 TFLOPS FP32 / 470 TFLOPS FP16
GPU memory: 64 GB HBM, 1.8 TB/s bandwidth
Interconnect: InfiniBand-like RDMA network, 3-layer Clos dual-plane topology, 4 × 400 Gb/s per node

Our original did not mention CNIS. TNP notes the GPU's origin is "unknown but presumably indigenous."

7. Correction Summary

Item	Original	Corrected	Source
LX2 designer	Unspecified	NSC Shenzhen + Huawei HiSilicon	TNP
Process	SMIC 7nm (inferred)	SMIC 7nm N+3 (confirmed)	TNP
CPU count	~47,000	40,960 (paper) / 45,360 (HPL)	TNP / 芯智讯
HBM capacity	32 GB per socket	64 GB per socket (2 × 32 GB chiplets)	TNP / Slides
HBM bandwidth	4 TB/s	8 TB/s per socket (2 × 4 TB/s chiplets)	TNP
DRAM type	DDR5	LPDDR5X, wafer-to-wafer 3D stacking	Slides
DRAM capacity	Unspecified	256 GB per socket	TNP
Chiplet raw cores	Not mentioned	192 cores/chiplet, 304 active (79.2% yield)	TNP
LX2 TDP	Not mentioned	690W	Slides
Full-system power	Not mentioned	42.2 MW	TNP
HPL efficiency	Not mentioned	80.35%	TNP
On-die NIC	Not mentioned	800 Gbps	Slides
LingQi single-hop latency	Not mentioned	1.07 μs	TNP
NUMA domains	Inferred 2-level	8 domains (confirmed)	Slides
CNIS system	Not mentioned	5,632-node CPU+GPU heterogeneous	TNP
DeepSeek inference	Not mentioned	578 TPS per LX2 Decode	Slides
LingQi port scale	Not mentioned	200,000 ports	Slides
LingQi flow control	Not mentioned	Credit-based, lossless	Slides
Telemetry	Not mentioned	Hardware-supported, second-level, proactive push	Slides
Software ecosystem	Not mentioned	400+ compatible apps, full toolchain	Slides
Supporting systems	Not mentioned	X86 industrial + Kunpeng verification clusters	Slides

Sources:

Timothy Prickett Morgan, "A Deep Dive On China's 'LineShine' All-CPU, Exaflops-Class Supercomputer", The Next Platform, June 25, 2026
HACI 2026 LineShine presentation slides (publicly shared by Torsten Hoefler / Tadashi Ogawa)
NSC Shenzhen, "Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials", arXiv, April 17, 2026
芯智讯, "2.198 EFLOPS! China Returns to #1 in Supercomputing After 8 Years", June 24, 2026