Home

Tenstorrent’s Galaxy Blackhole AI servers escape the event horizon

Tenstorrent on Tuesday announced the general availability of its Galaxy Blackhole AI compute platform.

Each of the startup's 6U systems is packed with 32 of the Blackhole accelerators we looked at last fall. The chips are interconnected in a dense Ethernet mesh by 100 Tbps of aggregate bandwidth.

Combined, Tenstorrent says each Galaxy system features 1 TB of GDDR6, 16 TB/s of memory bandwidth, and 23 petaFLOPS of dense FP8 performance, all in a system that'll set you back only $110,000.

To put that in perspective, Nvidia's eight-way DGX boxes, while faster and higher capacity, will set you back somewhere between three and five times that.

However, Tenstorrent's mesh network isn't limited to a single node. Much like Google's TPU or Amazon's Trainium2 clusters, it can be extended to support larger models, higher throughput, or more interactive user experiences by adding more systems and adjusting the ratio of tensor and pipeline parallelism.

Tenstorrent's base Galaxy Supercluster will set you back $440,000 and features four Blackhole systems, but the architecture can support up to 32 nodes with more than a thousand chips.

Curious about Tenstorrent's Blackhole chips? Check out our hands-on review here.

Jasmina Vasiljevic, senior fellow at Tenstorrent, tells us the software stack has improved considerably since we first went hands-on with the hardware. At the time, model support was quite limited and what did run hadn't been optimized for the hardware yet. This mismatch resulted in generally poor performance scaling in our testing.

We're told this is no longer the case, and that considerable effort has not only been made to port new models to the hardware but also to improve performance, despite actually downgrading the chip's performance just a few months earlier.

At least for DeepSeek V3, Tenstorrent claims its four-node Blackhole Galaxy Superclusters can process a 100,000 token prompt — the equivalent of 166 pages of text — in less than four seconds.

Meanwhile, we're told the systems can churn out up to 300 tokens a second per user, and that they expect to increase that to 350 through software refinements in the near future.

We'll note that Tenstorrent doesn't specify the batch size used in these tests, which is an important metric for evaluating how an AI system will scale in production. Achieving 350 tokens a second for a single user is a lot less impressive than it is to scale that performance to 32 or 64.

Tenstorrent does say that it's able to scale effectively from batch eight all the way up to 64 on the platform, depending on throughput and interactivity demands.

In addition to large language models, Tenstorrent is positioning Galaxy Blackhole as an ideal platform for video generation. On a four-node supercluster, the startup says it can generate 720p video faster than real-time. 

Vasiljevic tells us additional frontier models like Moonshot AI's Kimi K2 are in the works, and her team has developed a Python-based programming interface for writing optimized kernels in order to keep bringing new models to the platform.

"Ninety percent of models from Hugging Face just run on Tenstorrent," the company wrote in a release. This is a big claim and one we look forward to putting to the test.

If you'd prefer to try before you buy, Tenstorrent's hardware is seeing adoption by several large datacenter, colocation, and neocloud providers, including Cirrascale, Equinix, and Japan's ai&. We expect the chip startup to share more during its TT-Deploy event on May 1. ®

Source: The register

Previous

Next