NVIDIA’s Blackwell Ultra tops all MLPerf training tests, claims 10 minute Llama 3.1 405B run
NVIDIA says its GB300 NVL72 rack, powered by Blackwell Ultra GPUs, placed first across every MLPerf AI training benchmark, widening the gap with rivals and setting a headline result of training Llama 3.1 with 405 billion parameters in about 10 minutes using 5,120 GPUs.
Record results and speedups
In the latest MLPerf Training round, NVIDIA reports wins in all seven tests, including:
-
Llama 3.1 405B, 10 minutes
-
Llama 2 70B LoRA, 0.4 minutes
-
Llama 3.1 8B, 5.2 minutes
-
FLUX.1, 12.5 minutes
-
DLRM dcnv2, 0.71 minutes
-
R GAT, 1.1 minutes
-
RetinaNet, 1.4 minutes
NVIDIA cites large generational gains with the same GPU counts. In Llama 3.1 40B pretraining, GB300 delivers over 4 times the performance of H100, and nearly 2 times versus the Blackwell GB200. For Llama 2 70B fine tuning, eight GB300 GPUs post 5 times the performance of H100.
What is inside GB300 NVL72
The GB300 NVL72 is a rack scale system tied together with Quantum X800 InfiniBand at 800 GB per second. Each GPU includes 279 GB of HBM3e, and NVIDIA quotes a combined GPU plus CPU memory footprint of about 40 TB per rack. The company points to the CUDA software stack as a key differentiator, along with system level optimizations.
A shift to FP4 precision for training is central to the throughput gains. NVIDIA says FP4 roughly doubles calculation speed versus FP8, and that Blackwell Ultra pushes effective training performance to about 3 times FP8 levels in compatible workloads.
Why it matters
MLPerf is the most cited cross vendor benchmark suite for AI training. Clearing all training tests strengthens NVIDIA’s case that Blackwell will remain the default choice for large scale model work. The 10 minute Llama 3.1 405B figure, if replicated in customer environments, would compress experimentation cycles and reduce cluster time for foundation model pretraining and large fine tunes.
What comes next
NVIDIA says the new results build on its June submission and reflect software and networking refinements rather than higher GPU counts. Attention now turns to power, availability and total cost for GB300 NVL72 deployments, along with how quickly frameworks and toolchains adopt FP4 across the stack.
Photo Credit: DepositPhotos.com
