← Thinking Thinking

LineShine Addendum: New Details Confirmed by The Next Platform's Deep Dive

**Sources:**

2026-06-26Thinking28 min read

This article supplements and corrects "LineShine Tops TOP500: 2 EFLOPS Without a Single GPU." On June 25, 2026, The Next Platform co-founder Timothy Prickett Morgan published a deep architectural teardown of LineShine, drawing on HACI 2026 presentation slides (delivered by system chief architect Yutong Lu) and technical parameters from an NSC Shenzhen AI paper published in April. What follows confirms and corrects our original analysis in light of these new sources.


1. Architecture Confirmations and Corrections

1.1 LX2 Designer: NSC Shenzhen + Huawei HiSilicon

Our original article did not identify the LX2's designer. TNP confirms: the LX2 was jointly designed by NSC Shenzhen and Huawei (presumably its HiSilicon chip division). The SVE2 units are block-copied from ARM Neoverse IP, while the SME matrix units are a custom Huawei implementation.

1.2 Process Node: SMIC 7nm N+3

We previously inferred SMIC 7nm. TNP further pins it at SMIC 7nm N+3, an enhanced variant. TNP's reasoning: 1.55 GHz is well below the ~3 GHz SMIC can push with this process—the downclock balances core speed against memory speed while keeping the power curve in check. At 690W TDP, any higher clock would be thermally unsustainable.

This complements our original chiplet area estimation: the point is not that they couldn't go faster, but that they deliberately clocked down for an efficiency sweet spot, making up for per-core performance with scale.

1.3 Chiplet Structure and Yield

The slide confirms a 2-chiplet design. More critically, TNP derived raw core counts from the die shot:

  • Each chiplet has 48 core blocks, each with 4 cores → 192 raw cores per chiplet
  • 384 raw cores per socket; 304 cores exposed → 79.2% yield
  • Consistent with what TNP calls "expected yield range for SMIC 7nm"

This is more precise than our original pure-area estimation—we correctly inferred a chiplet scheme but lacked the specific raw core count and yield figure.

1.4 CPU Count Correction

Our original "~47,000 CPUs" needs correction. There are two actual numbers:

  • Configuration described in the NSC Shenzhen paper: 20,480 nodes × 2 sockets = 40,960 LX2 CPUs
  • HPL benchmark configuration added ~2,200 more nodes: 22,680 nodes × 2 sockets = 45,360 LX2 CPUs, 13,789,440 cores

The HPL run used a larger configuration (~10% more nodes than the paper), consistent with TNP's observation that "China can scale LineShine further if it wants or needs to."


2. Memory System Corrections

2.1 HBM Capacity: 64 GB per Socket, Not 32 GB

This is the most important correction. Our original statement of "32 GB HBM (4 TB/s) per CPU" came from an ambiguous per-chip description in the NSC Shenzhen paper. TNP clarifies: it's 32 GB + 4 TB/s per chiplet, totaling 64 GB + 8 TB/s per socket.

The slide's "HBM 4 TB/s" refers to a single chiplet. Each chiplet's four 24-core blocks each get one HBM stack.

TNP speculates this is a slightly goosed variant of HBM2E.

2.2 DRAM: 3D-Stacked LPDDR5X, Not Plain DDR5

Our original only mentioned "DDR5." TNP provides far more detail:

  • 256 GB LPDDR5X per socket (not DDR5), presumably sourced from ChangXin Memory Technologies (CXMT), which demonstrated 10.7 GHz LPDDR5X in late 2025
  • Uses wafer-to-wafer 3D stacking—custom DRAM dies bonded to logic wafers to reduce power and area
  • The slide explicitly states: "Customized DRAM dies reduce power and area; wafer-to-wafer 3D stacking combines DRAM and logic wafers"
  • 8 NUMA domains across the two chiplets organize this DRAM (we originally inferred a two-level NUMA; it's actually 8 domains)
  • An SDMA engine automatically manages HBM ↔ DRAM data movement

2.3 Programmable HBM Modes

The slide confirms two programmable HBM modes:

  • Cache mode: out-of-the-box bandwidth optimization
  • Flat mode: deep manual tuning for power users

3. System Hierarchy Confirmed

TNP reconstructed the full physical topology from the slides—a layer our original article did not cover:

Tier Configuration Notes
Node 2-socket LX2 Basic compute unit
Blade 8 nodes PCIe 5.0 interconnects nodes within a blade
Frame 16 blades = 128 nodes = 256 CPUs Switch-interconnected blades, 30.87 PFLOPS FP64
Cabinet 2 frames One cabinet per two frames
Full system (paper) 160 frames 20,480 nodes / 40,960 CPUs
Full system (HPL) ~177 frames 22,680 nodes / 45,360 CPUs

Within a frame, PCIe 5.0 switches connect blades (what TNP calls "the inexpensive way to do it"). Between frames, the LingQi network takes over.


4. Network: LingQi in Full Detail

Our original speculated LingQi might be an InfiniBand variant. TNP and the slides provide the complete topology:

  • 4-layer fat-tree topology (L1–L4); only L4 uses optical links—L1 through L3 are all copper
  • 184 compute frames + 32 network frames
  • L1 layer: 16 L1 switch/compute blades per compute frame
  • L2 layer: 8 L2 switch blades per compute frame
  • L3 layer: 16 L3 switch blades per network frame
  • L4 layer: 6 L4 switch blades per network frame
  • Full system: 22,000+ nodes, 200,000 ports
  • Full-system bisection bandwidth: ≥ 3.5 Pbps (petabits per second)
  • Single-hop latency: 1.07 μs
  • Per-node bandwidth: 1.6 Tb/s (2 × 400 Gb/s; NIC integrated on the LX2 die)

Reliability design:

  • Credit-based flow control for lossless communication
  • Dual-plane networking with multi-rail communication
  • Link-, chip-, and cabinet-level redundancy
  • Hardware-supported telemetry with second-level data collection and proactive push

The slide also shows a die photo of a switch ASIC—substantial in size—suggesting the LingQi switch is custom silicon rather than a commercial chip.

TNP notes that 1.07 μs single-hop latency "sounds more like Ethernet than InfiniBand," though it could be an InfiniBand implementation. Combined with credit-based flow control and lossless communication, LingQi's design target is clearly deterministic low-latency networking for HPC and AI training—not general-purpose data center Ethernet.


5. Performance and Power: New Data

5.1 HPL Efficiency

Our original didn't discuss this. TNP calculates: HPL computational efficiency = 80.35% (2.198 EFLOPS / 2.74 EFLOPS peak). This is remarkably high:

System HPL Efficiency
K (Fujitsu) 93% (all-time record)
Fugaku 82.3%
LineShine 80.35%

TNP calls it "pretty damned good," attributing it to "merging big math with a healthy core instead of separating them."

5.2 TDP and System Power

  • Per LX2: 690W
  • Full system: 42.2 MW (significantly above the < 30 MW of the three major U.S. exascale systems)

TNP's take: the extra power buys lower computational complexity—no offload model, unified HBM + DRAM address space, no GPU software stack overhead.

5.3 Per-Core Peak Performance Confirmed

The slide confirms the peak numbers cited in our original:

  • FP64: 60.3 TFLOPS per LX2 → approximately 198 GFLOPS/core
  • SVE2 + SME together cover FP64/32/16/INT8

5.4 SME + SVE Software Stack: Empirical Results

A separate HACI 2026 slide titled "SME-Enabled, HBM-Aware Matrix Acceleration" provides real benchmark data for the LX2's SME/SVE software optimizations, citing three published papers. Our original analyzed the SME microarchitecture but did not cover these optimization results.

SME matrixization efficiency: By uniformly mapping differently-shaped matrix operations across HPC and AI workloads to SME—including multi-row-update matmul in stencils, square-tile matmul in GEMM, and tall-and-skinny matmul (QKᵀ and SV) in Transformers—efficient matrixization improves SME utilization by over 40%.

SVE + SME interleaved scheduling: SVE complements SME in scenarios where SME underperforms—single-row-update in stencils, boundary handling in GEMM, online softmax in Transformers. Interleaving SME and SVE instruction streams boosts IPC by up to 1.59×. This aligns with our original analysis of the D2AR paper's "asymmetric SME-GEMM" scheduling strategy—SME and SVE are not symmetrically mixed; SME dominates, with SVE filling pipeline gaps.

Memory-aware data placement:

  • HBM buffer pool pre-allocation → memory footprint reduced by 3.9 GB
  • Blocking to keep working tiles cache-resident + packing tiles into SME-friendly layouts + prefetching non-contiguous data → cache hit rate improved by up to 28%

Measured speedups:

Workload Speedup Baseline Source
Stencil Up to 4.1× Compiler auto-vectorization HStencil (SC'25)
GEMM 1.11–1.75× Vendor math library KirbyMM (DATE'26 Best Paper)
Attention Avg. 13.62× SOTA implementation SMEAtten (Euro-Par'26)

The three papers are: HStencil (SC'25), KirbyMM (DATE'26 Best Paper), and SMEAtten (Euro-Par'26). These results confirm a core thesis of our original article: integrating SME into a CPU is not a cosmetic addition—with deep software stack optimization, it can deliver multi-fold speedups on specific workloads. The 13.62× on Attention is particularly striking, suggesting that a pure-CPU architecture may find a competitive path for Transformer inference outside of GPUs.

5.5 LLM Inference Benchmark: DeepSeek at 578 TPS

The HACI 2026 system overview slide also disclosed a critical inference benchmark:

  • Single LX2 DeepSeek Decode throughput reaches 578 TPS (tokens per second)
  • Aggregate throughput reaches a partially obscured "double-" figure (context suggests double-digit or double-level total throughput)
  • NSC Shenzhen is actively advancing Qwen and other mainstream/domestic large model training and inference deployment at scale

578 TPS is noteworthy in a CPU context. For reference, a single NVIDIA H100 typically delivers ~2,000–4,000 TPS on comparable Decode workloads (highly dependent on batch size and model size), at roughly 700W—almost identical to the LX2's 690W. 578 TPS vs. 2,000–4,000 TPS means GPUs still hold a 3–7× advantage, but considering this is a homogeneous CPU architecture + first-generation SME + no GPU software stack, the number is far from weak.

For agentic AI inference—the central thesis of our original article's Chapter 6—with its low latency, small batches, long sequences, and sparse computation—the CPU's unified memory + SME + SVE combination may well find competitiveness on total cost of ownership (TCO).

5.6 Supporting Systems: A Heterogeneous Facility

The same overview slide reveals that LineShine's NSC Shenzhen Phase II is not merely a pure-CPU cluster—it is a comprehensive computing facility:

System Configuration Purpose
LineShine main ARMv9 LX2 pure CPU HPC + AI training & inference
Industrial computing 1,580 X86 blades (101,120 cores), 10+ PFLOPS, 200 PB storage Industrial simulation, traditional HPC
Pilot verification 100 Kunpeng servers (12,800 cores) Ecosystem adaptation and validation
4-way / 8-way servers 16 × 4-way + 4 × 8-way (3,328 cores total) Large-memory workloads

Additionally, LineShine's software ecosystem is compatible with 400+ mainstream HPC applications, with a toolchain including compilers, debuggers, and performance tuning tools.


6. Another System: CNIS

TNP's article also describes a second exascale-class system mentioned in the same NSC Shenzhen paper—China New-generation Intelligent Supercomputer (CNIS), a CPU+GPU heterogeneous system:

  • 5,632 nodes
  • 2 × 64-core CPUs + 8 GPUs per node
  • GPU peaks: 32.7 TFLOPS FP64 / 65.5 TFLOPS FP32 / 470 TFLOPS FP16
  • GPU memory: 64 GB HBM, 1.8 TB/s bandwidth
  • Interconnect: InfiniBand-like RDMA network, 3-layer Clos dual-plane topology, 4 × 400 Gb/s per node

Our original did not mention CNIS. TNP notes the GPU's origin is "unknown but presumably indigenous."


7. Correction Summary

Item Original Corrected Source
LX2 designer Unspecified NSC Shenzhen + Huawei HiSilicon TNP
Process SMIC 7nm (inferred) SMIC 7nm N+3 (confirmed) TNP
CPU count ~47,000 40,960 (paper) / 45,360 (HPL) TNP / 芯智讯
HBM capacity 32 GB per socket 64 GB per socket (2 × 32 GB chiplets) TNP / Slides
HBM bandwidth 4 TB/s 8 TB/s per socket (2 × 4 TB/s chiplets) TNP
DRAM type DDR5 LPDDR5X, wafer-to-wafer 3D stacking Slides
DRAM capacity Unspecified 256 GB per socket TNP
Chiplet raw cores Not mentioned 192 cores/chiplet, 304 active (79.2% yield) TNP
LX2 TDP Not mentioned 690W Slides
Full-system power Not mentioned 42.2 MW TNP
HPL efficiency Not mentioned 80.35% TNP
On-die NIC Not mentioned 800 Gbps Slides
LingQi single-hop latency Not mentioned 1.07 μs TNP
NUMA domains Inferred 2-level 8 domains (confirmed) Slides
CNIS system Not mentioned 5,632-node CPU+GPU heterogeneous TNP
DeepSeek inference Not mentioned 578 TPS per LX2 Decode Slides
LingQi port scale Not mentioned 200,000 ports Slides
LingQi flow control Not mentioned Credit-based, lossless Slides
Telemetry Not mentioned Hardware-supported, second-level, proactive push Slides
Software ecosystem Not mentioned 400+ compatible apps, full toolchain Slides
Supporting systems Not mentioned X86 industrial + Kunpeng verification clusters Slides

Sources:

  • Timothy Prickett Morgan, "A Deep Dive On China's 'LineShine' All-CPU, Exaflops-Class Supercomputer", The Next Platform, June 25, 2026
  • HACI 2026 LineShine presentation slides (publicly shared by Torsten Hoefler / Tadashi Ogawa)
  • NSC Shenzhen, "Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials", arXiv, April 17, 2026
  • 芯智讯, "2.198 EFLOPS! China Returns to #1 in Supercomputing After 8 Years", June 24, 2026