HPE is throwing its weight behind AMD's Helios rack-scale architecture and will offer this as part of its AI portfolio next year, including a purpose-built Juniper Networks scale-up switch.
Announced ahead of HPE's Discover event in Barcelona this week, the company claims it will be one of the first companies to offer a turnkey rack system for large-scale AI training and inference, based on AMD's reference design.
Helios is intended to operate a rack full of nodes fitted with accelerators as if they were one single large GPU, like Nvidia's DGX GB200 NVL72 system, to which Helios is pitched as a rival.
As far as AMD is concerned, Helios will be a vehicle for its next-generation Instinct MI455X GPUs and its 6th-gen Epyc CPUs, codenamed Venice, both of which are due next year, so HPE can only say that it will offer its Helios AI Rack worldwide sometime in 2026.
A slide from HPE's presentation on the network switch
The networking for this system will be a scale-up Ethernet implementation that will use UALink over Ethernet, featuring a purpose-built Juniper Networks switch based on Broadcom's Tomahawk 6 network silicon, which boasts 102.4 Tbps of aggregate bandwidth.
UALink, or Ultra Accelerator Link, is an open standard alternative to Nvidia's NVLink technology for interconnecting clusters of GPUs, the specifications for which were published earlier this year.
However, it appears that HPE and Broadcom, which is also working to develop the scale-up switch, believe that you don't need to build a network using actual UALink hardware if you can simply run the protocol over Ethernet, so that's what is happening here.
"This is an industry first scale-up solution using Ethernet, standard Ethernet. So that means it's 100 percent open standard and avoids proprietary vendor lock-in, leverages proven HPE Juniper networking technology to deliver scale and optimal performance for AI workloads," said Rami Rahim, president and general manager of HPE's networking business and former CEO of Juniper Networks prior to its acquisition.
HPE claims that this will enable its rack-scale system to support the traffic necessary for trillion-parameter model training, plus high inference throughput.
Helios is based on the double-width Open Rack Wide (ORW) specifications developed by Meta within the Open Compute Project (OCP). It supports modular trays, has liquid cooling capability, and is ideal for power-constrained environments, according to Rahim.
With 72 Instinct MI455X GPUs per rack, HPE says its rack-scale system will be capable of 260 TB/s of aggregated bandwidth and up to 2.9 exaFLOPS of 4-bit floating-point performance for handling large AI models.
This is unlikely to come cheap, of course, with Nvidia's rival GB200 NVL72 setups reportedly selling for nearly $3.5 million each. Perhaps it is no surprise that HPE lists cloud service providers and especially neoclouds as the primary target for this kit, rather than enterprise customers. ®
Source: The register