LightTrends Newsletter

Newsletter in Chinese 


The Evolving Role of Optics in AI Clusters

January 2024

LightCounting releases new report titled 'Optics for AI'

AI has come to the fore in the equivalent of a blink of an eye. Forecasting AI is for the brave. LightCounting's first Optics for AI report highlights how AI is changing computer architectures and networking, with optics playing a pivotal role. LightCounting’s AI forecasting focuses on optics. But if we add one prediction, it is this: not only will optics play a vital role in the evolution of AI systems, but AI will increasingly contribute to the design of these systems at the transistor, chip and system level.

The rate of innovation varies across the industry. New applications can be developed quickly. Most of them will fail, but some will succeed and change the world seemingly overnight. Innovation in software and AI algorithms is happening faster than we can keep track of it. At least this is how it seems to outside observers, but experts may argue otherwise.

Innovation in hardware is a much more gradual yet relentless process. The optical connectivity is not an exception, and we have data to prove it. The adoption of silicon photonics took a decade, and we are still waiting for this technology to deliver truly disruptive solutions such as reliable co-packaged optics. There is little doubt that this will happen by the end of this decade, but the forecast presented in this report is focused on pluggable optical transceivers deployed in AI clusters – the primary solution for optical connectivity today and for the next 5 years.

More than 90% of optical transceivers deployed in AI Clusters today are used for InfiniBand and Ethernet connectivity. Google is the only company that is using optical transceivers for inter-core interconnects (ICI) between TPUs in their production AI clusters. Nvidia is testing optical NVLink connectivity to GPUs in their research cluster. As illustrated in the figure below, NVLink connectivity to GPUs requires 4x higher bandwidth than Ethernet and InfiniBand. Another bottleneck in AI cluster design is the limited High Bandwidth Memory (HBM) available to GPUs, and it is another factor of 3x higher bandwidth, also shown in the figure below.

Google is also the only company using optical switches to scale up and reconfigure their AI clusters. It has proven to improve the cluster performance, while minimizing cost and power consumption. We expect more companies to adopt this technology in the next 3-5 years.

The scale of demand for optics for applications in AI clusters was a pleasant surprise of 2023. The timing of ChatGTP making headlines at the end of 2022 could not have been better. Fears of an upcoming economic recession and the first signs of lower growth in revenues forced all the leading Cloud companies to cut spending, including investments into datacenters and purchases of optical transceivers. We do not have the final sales data for 2023, but there is a good chance that AI saved the market from a decline last year. There is little doubt in a very strong growth for 2024-2025.

Growth of Nvidia’s business is the main factor impacting the optical transceivers sales in 2023-2025. New designs of Nvidia’s AI clusters require a lot more transceivers. All previous systems used only InfiniBand networks for optical connectivity and these were mostly AOCs. The latest systems based on NDR (400G) InfiniBand use pluggable 400/800G SR4/SR8 and DR4/DR8 transceivers instead of AOCs. The company also announced NVLink chassis switches, designed for 800G optical connectivity in March 2022. Nvidia is currently testing NVLink over fiber internally, but these solutions should be available to end users by the end of 2024. If this takes longer, we will have to reduce our forecast for 2025-2029.

The report presents our first forecast for optical transceivers provided by Nvidia and compares it with the rest of the transceivers used in AI Clusters. Nvidia designed optical transceivers with a more stringent BER spec to minimize transmission errors. It does not prevent customers from using 3rd party optics, but it does not guarantee the system performance. This motivates many of the customers, including Microsoft, to use optics provided by Nvidia. We expect that the end users will eventually transition to using 3rd party optics to save on cost, but this will be a gradual transition.

The report is available to subscribers at: https://www.lightcounting.com/login.

Ready to connect with LightCounting?

Enabling effective decision-making based on a unique combination of quantitative and qualitative analysis.
Reach us at info@lightcounting.com

Contact Us