The Edge of Exascale: NVIDIA BlueField at Durham University

The cost of data movement — in both runtime and energy — can be a showstopper on the road to exascale. As supercomputers and machine learning farms grow, one way to improve performance and efficiency is to teach the network how to route data flows, meet security constraints, and even perform specific tasks. Smart network devices can take ownership of the data movement, bringing data into the right format before it is delivered, while contributing to security and resiliency.

The Durham Intelligent NIC Environment (DINE), part of the DiRAC memory intensive service at Durham University, is a 16-node cluster equipped with Dell EMC PowerEdge C6525 servers with NVIDIA® BlueField® DPUs. These smart network interface cards (smartNICs) enable the intelligent processing and routing of messages to improve the performance of massively parallel codes in preparation for future exascale systems. They also provide researchers with a test-bed to develop new computing and network paradigms.

The DINE cluster is hosted alongside the COSMA supercomputer and is used by computer science researchers, DiRAC researchers and international collaborators. The research computing team deployed the BlueField technology in half-height, half-width smartNIC cards. Each card is configured to operate in a host-separated mode, providing direct access to the Arm cores. Researchers can then launch HPC message passing interface (MPI) codes across the cluster, making use of both the AMD® EPYC™ server processors and the Arm processors. This, in turn, frees the compute nodes from data transfer tasks and communication duties.

“The DINE supercomputer will allow researchers to probe novel technologies in preparation for running advanced codes on exascale machines,” Dr. Alastair Basden says. “It will enable a step change in model resolution in fields such as weather forecasting, climate change, and cosmology, with a huge scientific benefit.”

To test the BlueField technology, the Durham team had to compile two versions of their code — one that executes on the server processors, and one that executes on the Arm cores. The team reported that recompiling the code for the Arm cores took seconds. When they run a job across the DINE cluster, they direct MPI jobs to run on the smartNIC instead of the CPUs, which allows the CPUs to carry on with their tasks without MPI interruptions. The smartNICs can also handle unexpected messages (buffering), take overload balancing, or manage message replication to facilitate resilient algorithms.

Along the way, faculty, staff, students, collaborators and others will benefit from working with cutting-edge technologies. According to Tobias Wenzierl, Project Principal Investigator for DINE, these technologies allow them to design algorithms and investigate ideas that will help redefine the future of HPC for facilities around the world.

“We have been suffering from a lack of MPI progress and, hence, algorithmic latency for quite a while, and we have invested significant compute effort to decide how to place our tasks on the system,” Weinzierl says. “We hope that BlueField will help us to realize these two things way more efficiently. Actually, we started to write software that does this for us on BlueField.”

Based on the results at Durham University, the NVIDIA BlueField DPUs are on the journey to the infrastructure-as-code data center, where users can send a job out and have it run wherever it is most optimized for performance and efficiency. The DINE system is also leveraged by the ExCALIBUR program, aiming to redesign high priority simulation codes and algorithms to fully harness the power of future supercomputers, keeping UK research and development at the forefront of high-performance simulation science. Get more detail in Dr. Basden’s NVIDIA GTC session available on demand.