Scroll Top

Huawei’s new AI accelerators smash Nvidia but are huge energy hogs

WHY THIS MATTERS IN BRIEF

At the moment Huawei is prioritising performance over energy consumption to beat Nvidia’s fastest GPU’s, next they’ll try to beat them on everything.

 

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trendsconnect, watch a keynote, or browse my blog.

Unable to use leading-edge process technologies to produce its high-end processors for Artificial Intelligence (AI), Huawei has to spend billions of dollars on developing its own range of GPU’s and EUV Lithography equipment, and rely on brute force – in other words install more processors than its industry competitors to achieve comparable performance for the AI’s it’s been developing.

 

RELATED
DeepMind's newest AI thrashes human gamers at Stratego

 

To do this, Huawei took a multifaceted strategy that includes a dual-chiplet HiSilicon Ascend 910C processor, optical interconnections, and the Huawei AI CloudMatrix 384 rack-scale solution that relies on proprietary software, reports SemiAnalysis. The whole system provides a 2.3X lower performance per watt than Nvidia’s gut busting GB200 NVL72, but it still enables Chinese companies to train advanced AI models.

 Huawei’s CloudMatrix 384 is a rack-scale AI system composed of 384 Ascend 910C processors arranged in a fully optical, all-to-all mesh network. The system spans 16 racks, including 12 compute racks housing 32 accelerators each and four networking racks facilitating high-bandwidth interconnects using 6,912 800G LPO optical transceivers.

Unlike traditional systems that use copper wires for interconnections, CloudMatrix relies entirely on optics for both intra- and inter-rack connectivity, enabling extremely high aggregate communication bandwidth. The CloudMatrix 384 is an enterprise-grade machine that features fault-tolerant capabilities and is designed for scalability.

 

RELATED
ASML's new $300 Million machine tries to keep Moore's Law on track

 

In terms of performance, the CloudMatrix 384 delivers approximately 300 PFLOPs of dense BF16 compute, which is nearly two times the throughput of Nvidia’s GB200 NVL72 system (which delivers about 180 BF16 PFLOPs). It also offers 2.1 times more total memory bandwidth despite using HBM2E and over 3.6 times greater HBM capacity. The machine also features 2.1 times higher scale-up bandwidth and 5.3 times scale-out bandwidth thanks to its optical interconnections.

However, these performance advantages come with a trade off: The system is 2.3 times less power-efficient per FLOP, 1.8 times less efficient per TB/s of memory bandwidth, and 1.1 times less efficient per TB of HBM memory compared to Nvidia.

But this does not really matter, as Chinese companies (including Huawei) cannot access Nvidia’s GB200 NVL72 anyway.  So, if they want to get truly high performance for AI training, they will be more than willing to invest in Huawei’s CloudMatrix 384.

 

RELATED
ARM, Google and Qualcomm team up to challenge Intel in the data center

 

At the end of the day, the average electricity price in mainland China has declined from $90.70 MWh in 2022 to $56 MWh in some regions in 2025, so users of Huawei’s CM384 aren’t likely to go bankrupt because of power costs. So, for China, where the energy is abundant, but advanced silicon is constrained, Huawei’s approach to brute force AI development – while not ideal – seems to work just fine.

Related Posts

Leave a comment

You have Successfully Subscribed!

Pin It on Pinterest

Share This