IBM Research's Vela is an AI supercomputer in the cloud

AI models are increasingly invading all aspects of our lives and work. With each passing year, more complex models, new techniques, and new use cases require more computing power to meet the growing demand for AI.

One of the most relevant recent examples has been the advent of baseline models, AI models trained on a large set of unlabeled data that can be used for many different tasks, with minimal effort. adjustments. But these types of models are massive, in some cases exceeding billions of parameters. To train models at this scale, you need supercomputers, systems made up of many powerful computational elements working together to solve big problems with high performance.

Traditionally, building a supercomputer has involved bare metal nodes, high-performance networking hardware (like InfiniBand, Omnipath, and Slingshot), parallel file systems, and other things typically associated with high-performance computing (HPC ). But traditional supercomputers weren't built for AI; they have been designed to perform well on modeling or simulation tasks, such as those defined by US National Laboratories, or other customers seeking to fulfill a certain need.

While these systems work well for AI, and many "AI supercomputers" (such as the one designed for OpenAI) continue to follow this model, the traditional point of design has always led to technology choices that increase costs and limit deployment flexibility. We recently asked ourselves: what system are we designing if we focus exclusively on large-scale AI?

This led us to create IBM's first AI-optimized cloud-native supercomputer, Vela. It has been online since May 2022, hosted in IBM Cloud, and is currently only used by the IBM Research community. The choices we made with this design give us the flexibility to scale at will and easily deploy similar infrastructure in any IBM Cloud data center around the world. Vela is now our go-to environment for IBM researchers building our most advanced AI capabilities, including our core model work, and it's where we collaborate with partners to train models of all kinds. .

Why build an AI supercomputer in the cloud?

IBM has deep roots in the world of supercomputing, having designed generations of the most capable systems ranked in the Global 500 lists. This includes Summit and Sierra, some of the most powerful supercomputers in the world today. With every system we design, we discover new ways to improve the performance, resiliency, and cost of the workloads of interest, increase researcher productivity, and better align with the needs of our customers and partners.

Last year, we set ourselves the goal of minimizing the time it takes to build and deploy world-class AI models. This seemingly simple goal started a healthy internal debate: do we build our system on-premises, using the traditional supercomputing model, or do we build this system in the cloud, essentially building a supercomputer that is also a cloud? In this last model, we could make a small compromise on performance, but we would gain considerably in productivity. In the cloud, we configure all the resources we need through software, use a robust and established API interface, and access a broader ecosystem of services to integrate. We can take advantage of datasets residing on IBM's Cloud Object Store instead of building our own back-end storage. We can leverage IBM Cloud's VPC capability to collaborate with partners using advanced security practices. The list of potential benefits to our productivity goes on and on. As the debate unfolded, it became clear that we needed to build a cloud-native AI supercomputer. Here's how we did it.

Main design choices and opportunities for innovation

When it comes to AI-centric infrastructure, an uncompromising requirement is the need for nodes with many GPUs or AI accelerators. To configure these nodes, we had two choices: either make each node provisionable as a bare-metal system, or enable node configuration as a virtual machine (VM). It is generally accepted that bare metal is the way to maximize AI performance, but virtual machines offer more flexibility. The VM path would allow our service teams to provision and re-provision infrastructure with different software stacks required by different AI users. We knew, for example, that when this system arrived...

IBM Research's Vela is an AI supercomputer in the cloud

AI models are increasingly invading all aspects of our lives and work. With each passing year, more complex models, new techniques, and new use cases require more computing power to meet the growing demand for AI.

One of the most relevant recent examples has been the advent of baseline models, AI models trained on a large set of unlabeled data that can be used for many different tasks, with minimal effort. adjustments. But these types of models are massive, in some cases exceeding billions of parameters. To train models at this scale, you need supercomputers, systems made up of many powerful computational elements working together to solve big problems with high performance.

Traditionally, building a supercomputer has involved bare metal nodes, high-performance networking hardware (like InfiniBand, Omnipath, and Slingshot), parallel file systems, and other things typically associated with high-performance computing (HPC ). But traditional supercomputers weren't built for AI; they have been designed to perform well on modeling or simulation tasks, such as those defined by US National Laboratories, or other customers seeking to fulfill a certain need.

While these systems work well for AI, and many "AI supercomputers" (such as the one designed for OpenAI) continue to follow this model, the traditional point of design has always led to technology choices that increase costs and limit deployment flexibility. We recently asked ourselves: what system are we designing if we focus exclusively on large-scale AI?

This led us to create IBM's first AI-optimized cloud-native supercomputer, Vela. It has been online since May 2022, hosted in IBM Cloud, and is currently only used by the IBM Research community. The choices we made with this design give us the flexibility to scale at will and easily deploy similar infrastructure in any IBM Cloud data center around the world. Vela is now our go-to environment for IBM researchers building our most advanced AI capabilities, including our core model work, and it's where we collaborate with partners to train models of all kinds. .

Why build an AI supercomputer in the cloud?

IBM has deep roots in the world of supercomputing, having designed generations of the most capable systems ranked in the Global 500 lists. This includes Summit and Sierra, some of the most powerful supercomputers in the world today. With every system we design, we discover new ways to improve the performance, resiliency, and cost of the workloads of interest, increase researcher productivity, and better align with the needs of our customers and partners.

Last year, we set ourselves the goal of minimizing the time it takes to build and deploy world-class AI models. This seemingly simple goal started a healthy internal debate: do we build our system on-premises, using the traditional supercomputing model, or do we build this system in the cloud, essentially building a supercomputer that is also a cloud? In this last model, we could make a small compromise on performance, but we would gain considerably in productivity. In the cloud, we configure all the resources we need through software, use a robust and established API interface, and access a broader ecosystem of services to integrate. We can take advantage of datasets residing on IBM's Cloud Object Store instead of building our own back-end storage. We can leverage IBM Cloud's VPC capability to collaborate with partners using advanced security practices. The list of potential benefits to our productivity goes on and on. As the debate unfolded, it became clear that we needed to build a cloud-native AI supercomputer. Here's how we did it.

Main design choices and opportunities for innovation

When it comes to AI-centric infrastructure, an uncompromising requirement is the need for nodes with many GPUs or AI accelerators. To configure these nodes, we had two choices: either make each node provisionable as a bare-metal system, or enable node configuration as a virtual machine (VM). It is generally accepted that bare metal is the way to maximize AI performance, but virtual machines offer more flexibility. The VM path would allow our service teams to provision and re-provision infrastructure with different software stacks required by different AI users. We knew, for example, that when this system arrived...

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow