Back to Search Start Over

The infrastructure powering IBM's Gen AI model development

Authors :
Gershon, Talia
Seelam, Seetharami
Belgodere, Brian
Bonilla, Milton
Hoang, Lan
Barnett, Danny
Chung, I-Hsin
Mohan, Apoorve
Chen, Ming-Hung
Luo, Lixiang
Walkup, Robert
Evangelinos, Constantinos
Salaria, Shweta
Dombrowa, Marc
Park, Yoonho
Kayi, Apo
Schour, Liran
Alim, Alim
Sydney, Ali
Maniotis, Pavlos
Schares, Laurent
Metzler, Bernard
Karacali-Akyamac, Bengi
Wen, Sophia
Chiba, Tatsuhiro
Choochotkaew, Sunyanan
Yoshimura, Takeshi
Misale, Claudia
Elengikal, Tonia
Connor, Kevin O
Liu, Zhuoran
Molina, Richard
Schneidenbach, Lars
Caden, James
Laibinis, Christopher
Fonseca, Carlos
Tarasov, Vasily
Sundararaman, Swaminathan
Schmuck, Frank
Guthridge, Scott
Cohn, Jeremy
Eshel, Marc
Muench, Paul
Liu, Runyu
Pointer, William
Wyskida, Drew
Krull, Bob
Rose, Ray
Wolfe, Brent
Cornejo, William
Walter, John
Malone, Colm
Perucci, Clifford
Franco, Frank
Hinds, Nigel
Calio, Bob
Druyan, Pavel
Kilduff, Robert
Kienle, John
McStay, Connor
Figueroa, Andrew
Connolly, Matthew
Fost, Edie
Roma, Gina
Fonseca, Jake
Levy, Ido
Payne, Michele
Schenkel, Ryan
Malki, Amir
Schneider, Lion
Narkhede, Aniruddha
Moshref, Shekeba
Kisin, Alexandra
Dodin, Olga
Rippon, Bill
Wrieth, Henry
Ganci, John
Colino, Johnny
Habeger-Rose, Donna
Pandey, Rakesh
Gidh, Aditya
Gaur, Aditya
Patterson, Dennis
Salmani, Samsuddin
Varma, Rambilas
Rumana, Rumana
Sharma, Shubham
Mishra, Mayank
Panda, Rameswar
Prasad, Aditya
Stallone, Matt
Zhang, Gaoyuan
Shen, Yikang
Cox, David
Puri, Ruchir
Agrawal, Dakshi
Thorstensen, Drew
Belog, Joel
Tang, Brent
Gupta, Saurabh Kumar
Biswas, Amitabha
Maheshwari, Anup
Gampel, Eran
Van Patten, Jason
Runion, Matthew
Kaki, Sai
Bogin, Yigal
Reitz, Brian
Pritko, Steve
Najam, Shahan
Nambala, Surya
Chirra, Radhika
Welp, Rick
DiMitri, Frank
Telles, Felipe
Arvelo, Amilcar
Chu, King
Seminaro, Ed
Schram, Andrew
Eickhoff, Felix
Hanson, William
Mckeever, Eric
Joseph, Dinakaran
Chaudhary, Piyush
Shivam, Piyush
Chaudhary, Puneet
Jones, Wesley
Guthrie, Robert
Bostic, Chris
Islam, Rezaul
Duersch, Steve
Sawdon, Wayne
Lewars, John
Klos, Matthew
Spriggs, Michael
McMillan, Bill
Gao, George
Kamra, Ashish
Singh, Gaurav
Curry, Marc
Katarki, Tushar
Talerico, Joe
Shi, Zenghui
Malleni, Sai Sindhur
Gallen, Erwan
Publication Year :
2024

Abstract

AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.<br />Comment: Corresponding Authors: Talia Gershon, Seetharami Seelam,Brian Belgodere, Milton Bonilla

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2407.05467
Document Type :
Working Paper