Nvidia has a bit of a problem. Popular generative AI workloads like code assistants and agentic systems generate massive quantities of tokens and need to move them at speed. But the GPU giant's chips currently struggle to deliver.
That will start to change next week when Nvidia CEO Jensen Huang uses his company’s GPU Technology Conference (better known as GTC) to explain how he will use the token-spewing accelerator tech he acquired with upstart Groq late last year.
Market-watching firm SemiAnalysis' latest InferenceX benchmarks shows how Groq's tech helps to fill the gap in Nvidia’s current portfolio.
InferenceX's efficiency Pareto curve can be broken down into three main categories. Bulk tokens on the left, expensive low-latency tokens on the right, and the so called "goldilocks zone" in the middle. - Click to enlarge
While Nvidia's NVL72 rack systems scale well at lower per-user token generation rates, they become progressively less efficient as user interactivity increases.
By contrast, SRAM-heavy architectures, like those championed by Groq and Cerebras, excel in latency sensitive scenarios and can achieve token generation rates often exceeding 500 or even 1,000 tokens a second. That’s many more tokens than GPU-based architectures can deliver.
In fact, this capability is how Cerebras won OpenAI's business earlier this year to power its Codex model. Nvidia didn’t own anything to match Cerebras until it acquired Groq's intellectual property and talent for a staggering $20 billion in December.
By combining its GPU tech and CUDA software libraries with Groq's dataflow architecture, Nvidia has the opportunity to raise the Pareto curve dramatically, reducing the cost per token, while at the same time bolstering output speeds.
Extending Nvidia's CUDA hardware stack to include Groq's dataflow architecture will not be easy. At GTC, Nvidia might announce it will add limited support for Groq's existing architecture relatively quickly.
This GTC already feels a bit different as Nvidia has spilled the beans on its Rubin GPUs back at CES in January.
To recap, Rubin packs up to 288 GB of HBM4 memory good for 22 TB/s of bandwidth and 35-50 petaFLOPS of dense NVFP4 performance depending on the use case.
The launch represents a major performance uplift over Nvidia’s current Blackwell-generation parts, delivering 5x the dense floating point throughput. So far, Nvidia has announced the chips will be available in both an eight-way HGX platform or its NVL72 rack system, which as the name suggests, crams 72 Rubin SXM modules into a single system.
There's also Rubin GPX, which was announced back at Computex in June 2025, which will slot into select NVL racks to provide additional compute capacity for large context and video processing workflows.
We expect to see Huang hammer on the performance optimizations and efficiency gains delivered by its growing portfolio of GPUs. But with those GPUs growing ever hotter – estimates put Rubin’s thermal design power at 1.8kW or perhaps even higher – liquid cooling isn’t optional. Some buyers may balk at that requirement, which would benefit AMD and its air-cooled kit.
However, given the generation gains delivered by the Rubin architecture, there's nothing stopping Nvidia from releasing a single-die, air cooled version of the chip with five or six HBM stacks rather than eight. Such a chip would still deliver a 2.5x uplift in performance over Blackwell – without requiring liquid cooling.
That's just speculation, but we have a sneaking suspicion we might see something along these lines during next week’s festivities.
Alongside its latest datacenter GPUs, we anticipate more details on Nvidia's standalone Vera CPU.
First teased at last year's GTC, Vera features 88 custom-Arm cores which add support for simultaneous multithreading and a slew of confidential computing features previously only available on x86 platforms.
So far, we've only seen the CPU packaged as part of Nvidia's Vera-Rubin superchip. However, we've since learned Nvidia will offer the chip as a standalone processor that will compete with Intel and AMD for some mainstream applications.
Previously, Nvidia had offered Grace CPU superchips, but those were primarily for use in supercomputers and other HPC applications. However, last month the GPU giant revealed Meta would be its first partner to deploy Grace at scale and that the Social Network was already evaluating Vera CPUs for use in its datacenters as well.
Alongside new datacenter silicon, we also anticipate Huang will share more details about Nvidia's next-gen Kyber racks and Feynman GPUs, which should debut in 2027 and 2028.
We first saw the Kyber at last year's GTC. The 600 kW behemoth is set to cram 144 GPU sockets, each with four Rubin Ultra GPU dies into a standard rack form factor.
Nvidia disclosed the existence of Kyber in part because datacenter operations were already struggling with the 120kW NVL72 systems announced the year before. By revealing Kyber, Nvidia lit a fire under datacenter physical infrastructure providers so they could provision the power supplies and cooling kit necessary to support such a system by 2027. With a yearly release cadence, Nvidia can't wait for the rest of the industry to catch up – it must telegraph its next move years in advance.
With Feynman just two years out, we suspect Huang may repeat the exercise, setting new power and cooling targets, likely exceeding a megawatt per rack.
Nvidia has long rumored to have been working on an Arm-based system on chip for PCs.
A part capable of doing that job arrived last year in form of the DGX Spark and GB10 partner systems that put it to work. So far, however, OEMs have only used the chip in workstation class mini-PCs running Linux. Recent reports indicate Nvidia is working with the likes of Lenovo and Dell to bring a similar product to the Windows PC market.
As we previously reported, Nvidia is also working with Intel to integrate its GPU dies into Chipzilla's next-gen processors.
GTC seems like as good a time as any to throw gamers a bone and give Nvidia a new market to chase beyond its side hustles in the pro-visualization markets.
Integrated Nvidia graphics might not be the RTX 50 Super series cards that many had hoped to see at CES, but given the state of the memory market, it seems unlikely we'll see them make an appearance at GTC.
Beyond big iron and the remote possibility of some consumer hardware, you can bet on OpenClaw being a major talking point at GTC.
Jensen Huang is apparently quite fond of the agentic framework in spite of its many security vulnerabilities, reportedly describing it as the "most important software release probably ever."
The company is reportedly working on its own, presumably safer, version of the platform called NemoClaw.
Speaking of claws, we also expect to see a fair few more robots take the stage. Since announcing its Isaac GR00T robotics platform nearly two years ago, Nvidia has launched a steady supply of new toolkits, frameworks, and hardware development platforms aimed at giving generative AI physical form.
And to teach them to function in an unpredictable world, you can count on Nvidia's Omniverse digital twin platform to make another appearance. Introduced in 2019 at a time of rising Metaverse hype, the platform aimed to create a virtual environment in which physical processes could be simulated in the digital world before real-life implementation.
Developers have since integrated Omniverse in a variety of simulation platforms, including those used to design and build AI bit barns.
El Reg will be on the ground in San Jose next week for GTC to bring you the latest news from what has become one of the world’s most-watched tech conferences. ®
Source: The register