Research Note

April 2024 All Eyes on NVIDIA

April 2024
 

Abstract

LightCounting covers networking and interconnect highlights from GTC 2024

Aside from CEO Jensen Huang, the DGX GB200 NVL72 was the star of the GTC 2024 keynote. The rack scale system integrates 72 next-generation Blackwell GPUs connected by NVLink to form “1 Giant GPU.” Jensen’s description of the NVLink passive-copper “backplane” caused a brief panic among investors that believed it somehow replaced InfiniBand, which it does not. The NVL72 represents next-generation AI systems, but Nvidia also revealed new details of its deployed Hopper-generation clusters. Next generation 800G (XDR) InfiniBand won’t reach customers until 2025, so early Blackwell systems will use 400G (NDR) InfiniBand instead.


Photo by LightCounting

Jensen said the Hopper-generation EOS supercomputer had just come online. This cluster uses 608 NDR switches with 64 ports each for a total of 38,912 switch ports. This system places the leaf switches in racks at the end of the row, so all InfiniBand links employ optical transceivers. We estimate the servers add 5,120 ports for a system total of 44,032 NDR ports. Because Nvidia uses what it calls “twin-port OSFP” 800G transceivers, each transceiver serves two NDR ports. Thus, we estimate the complete EOS system uses about 22,000 800G optical transceivers.

Blackwell-generation GPUs include 5th-generation NVLink, which doubles the interconnect bandwidth compared with Hopper. It does this by doubling the per-lane speed to 200Gbps, which results in 400Gbps of unidirectional bandwidth for each NVLink x2 port. Each Blackwell GPU includes 18 ports, which deliver 1.8TB/s (14.4Tbps) of aggregate bidirectional bandwidth. To connect 72 GPUs in the NVL72 rack, Nvidia developed the NVLink5 switch chip. The NVL72 rack includes nine NVLink switch trays with two ASICs each. Using 5,184 passive-copper (DAC) cables, the switches deliver all-to-all GPU connectivity within the rack.

Perhaps the biggest GTC 2024 disappointment for the networking ecosystem was the revelation that 800G InfiniBand is delayed until 2025. Despite the delay, the company disclosed the Quantum-X800 switch system and ConnectX-8 adapter (NIC). When available, these 800G InfiniBand products will double per-GPU bandwidth, as ConnectX-8 NICs will replace ConnectX-7 (400G) NICs one-for-one. They should also be first to handle optics with 200G lanes on the electrical (host) side, driving early demand for second-generation 200G/lambda DSPs.

Full text of the research note is available to LightCounting clients at: https://www.lightcounting.com/login

Price: $500

ADD

Ready to connect with LightCounting?

Enabling effective decision-making based on a unique combination of quantitative and qualitative analysis.
Reach us at info@lightcounting.com

Contact Us