LightTrends

 

Nvidia Acquires Mellanox to Accelerate Innovation

LightCounting Comments on Implications of the Deal for Optical Interconnect Technologies

 

The bottom line is that the deal will accelerate innovation and optical interconnects will be one of the areas impacted.

The deal emphasizes that combination of GPUs (Graphical Processing Units, made by Nvidia) with low latency switching and broadband interconnects (made by Mellanox) is the key for high performance computers (HPCs) and datacenter clusters running AI and machine learning applications. It is entirely possible that the inspiration for the Mellanox acquisition first began as Nvidia’s best engineers recently worked alongside Mellanox’s on both the reigning number 1 supercomputer Summit and the current number 2 supercomputer Sierra each of which use a proprietary interface called NVLink between the IBM Power9 CPUs and the Nvidia Volta GPUs at each node and high-speed Dual-Rail Mellanox 100G EDR InfiniBand connections between the individual CPU/GPU compute nodes. There are 4,608 CPU/GPU compute nodes in Summit and 4,474 such nodes in Sierra. Summit “only” fills up the space of two tennis courts but contains 185 miles of fiber optic connections.

In 2018, 88% of the Top 500 supercomputers were clusters - a loosely coupled collection of servers, similar to a data center (www.top500.org). An alternative is Massively Parallel Processing (MPP) systems with ‘tightly coupled’ processor architecture.  InfiniBand connectivity (developed by Mellanox) and Omni-Path connections (developed by Intel) are used in 35% of the Top 500 systems built as clusters. InfiniBand, built by Mellanox, connects 55% of the Top 500 machines that are real supercomputers rather than cloud clusters. Many of these connections use active optical cables, discussed in the AOC/EOM report, published by LightCounting in December 2018.

The high-speed HPC cluster segment uses the InfiniBand and Omni-Path Architecture protocols due to their very low latency and low overhead as the protocol stack is much simpler and smaller than Ethernet’s. HPC machines are traditionally thought of as being used for scientific applications such as weather simulation or global warming but a large percentage of the shipments are moving into main stream engineering applications within corporations, Web 2.0 applications such as for Hadoop storage acceleration, and in cloud computing applications where the HPC services are rented.  More recently, artificial intelligence (AI) and machine learning applications have become a very key application within HPC and within data centers. Most of these applications involve huge amounts of compute and massive data movement. In some cases such as the financial markets where “flash” stock trading systems trade billions of stock trades in milliseconds – low latency is the prime requirement.

Machine learning is the ‘training’ side of artificial intelligence.  It can operate on massive data sets and the use of GPUs (Graphical Processing Units) such as NVIDIA Volta accelerators [called tensor processors, shown in Figure 1] is now typical.  Google has built their own ASIC in this category. Facebook built a machine learning system that ranks at #50 on the Top500 list and uses NVIDIA DGX-1 machines populated with their GPUs and connected with InfiniBand.

Figure 1: NVIDIA Tensor Processors

Source: Nvidia, Photo by LightCounting

Future generations of machine learning clusters are likely to take advantage of a disaggregated design, illustrated in Figure 2. Instead of combining a limited number of CPUs, GPUs and memory in standard servers comprising a cluster, all the cluster resources can be accessed and interconnected to create an optimal combination of CPUs, GPUs and memory for a specific task.

Emergence of disaggregated systems is very important for optical connectivity because they require 10x-100x more bandwidth. Also, these systems may benefit from optical switching or optical bandwidth steering technologies, according to Prof. Keren Bergman at Columbia University. This topics was addressed in a special session at OFC last week.

Figure 2: Disaggregated design of a datacenter cluster

Source: Keren Bergman, Columbia University

With the acquisition of Mellanox, Nvidia has access to all the key technologies for implementing disaggregated clusters. It is a pity that Mellanox was forced by its activist investors to shut down their development of Silicon Photonics technologies last year. Nvidia may restart these activities or acquire another company.

Cisco’s acquisition of Luxtera – a silicon photonics pioneer in the end of 2018 may spark a few more deals this year. Commenting on Luxtera’s deal at OIDA Executive forum at OFC, Bill Garner of Cisco, acknowledged that the primary reason for having silicon photonics technology in house is developing of next generation ASICs with co-packaged optics. Intel and Xilinx plan to have optics co-packaged with FPGAs as early as next year. Optical connectivity co-packaged with GPUs must be on Nvidia’s agenda and having this technology in house is certainly a plus. Adding Mellanox low latency switching technology to the mix makes it much more likely that optical switching will find an application in datacenter clusters running AI applications. Nvidia may be the first company to find the right way to do this and accelerate innovation along the way.

About LightCounting Market Research 
LightCounting -- The name alone is what sets us apart and defines us as a company. We are a leading optical communications market research company, offering semi-annual market updates, forecasts, and state-of-the-industry reports based on analysis of primary research with dozens of leading optics component, module, and system vendors, as well as service providers and cloud companies. LightCounting is the optical communications market's source for accurate, detailed and relevant information necessary for doing business in today's highly competitive environment. Register to receive our monthly newsletter: LightCounting.com or connect with us on LinkedIn and Twitter.

Interested in meeting with LightCounting at these upcoming industry events? Email us today to schedule a meeting with our team. View our recently published reports and 2019 Research Roadmap.