Author: "Rajbhandari, Samyam" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Rajbhandari, Samyam"' showing total 41 results

Start Over Author "Rajbhandari, Samyam"

41 results on '"Rajbhandari, Samyam"'

1. SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

Author: Qiao, Aurick, Yao, Zhewei, Rajbhandari, Samyam, and He, Yuxiong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers' KV cache using a much earlier layer's output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs. Our training, inference, and model implementations are open-sourced and can be found through https://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb.
Published: 2024

2. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Author: Jacobs, Sam Ade, Tanaka, Masahiro, Zhang, Chengming, Zhang, Minjia, Song, Shuaiwen Leon, Rajbhandari, Samyam, and He, Yuxiong
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
Published: 2023

3. DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

Author: Yao, Zhewei, Wu, Xiaoxia, Li, Conglong, Zhang, Minjia, Qin, Heyang, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, and He, Yuxiong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.
Published: 2023

4. DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales

Author: Yao, Zhewei, Aminabadi, Reza Yazdani, Ruwase, Olatunji, Rajbhandari, Samyam, Wu, Xiaoxia, Awan, Ammar Ahmad, Rasley, Jeff, Zhang, Minjia, Li, Conglong, Holmes, Connor, Zhou, Zhongzhu, Wyatt, Michael, Smith, Molly, Kurilenko, Lev, Qin, Heyang, Tanaka, Masahiro, Che, Shuai, Song, Shuaiwen Leon, and He, Yuxiong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at the scale of billions of parameters. This paper introduces DeepSpeed-Chat, a novel system that democratizes RLHF training, making it accessible to the AI community. DeepSpeed-Chat offers three key capabilities: an easy-to-use training and inference experience for ChatGPT-like models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from InstructGPT, and a robust DeepSpeed-RLHF system that combines various optimizations for training and inference in a unified way. The system delivers unparalleled efficiency and scalability, enabling training of models with hundreds of billions of parameters in record time and at a fraction of the cost. With this development, DeepSpeed-Chat paves the way for broader access to advanced RLHF training, even for data scientists with limited resources, thereby fostering innovation and further development in the field of AI., Comment: 14 pages, 7 figures
Published: 2023

5. ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Author: Wang, Guanhua, Qin, Heyang, Jacobs, Sam Ade, Holmes, Connor, Rajbhandari, Samyam, Ruwase, Olatunji, Yan, Feng, Yang, Lei, and He, Yuxiong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Performance
Abstract: Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale., Comment: 12 pages
Published: 2023

6. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Author: Singh, Siddharth, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4 to 8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
Published: 2023
Full Text: View/download PDF

7. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Author: Workshop, BigScience, Scao, Teven Le, Fan, Angela, Akiki, Christopher, Pavlick, Ellie, Ilić, Suzana, Hesslow, Daniel, Castagné, Roman, Luccioni, Alexandra Sasha, Yvon, François, Gallé, Matthias, Tow, Jonathan, Rush, Alexander M., Biderman, Stella, Webson, Albert, Ammanamanchi, Pawan Sasanka, Wang, Thomas, Sagot, Benoît, Muennighoff, Niklas, del Moral, Albert Villanova, Ruwase, Olatunji, Bawden, Rachel, Bekman, Stas, McMillan-Major, Angelina, Beltagy, Iz, Nguyen, Huu, Saulnier, Lucile, Tan, Samson, Suarez, Pedro Ortiz, Sanh, Victor, Laurençon, Hugo, Jernite, Yacine, Launay, Julien, Mitchell, Margaret, Raffel, Colin, Gokaslan, Aaron, Simhi, Adi, Soroa, Aitor, Aji, Alham Fikri, Alfassy, Amit, Rogers, Anna, Nitzav, Ariel Kreisberg, Xu, Canwen, Mou, Chenghao, Emezue, Chris, Klamm, Christopher, Leong, Colin, van Strien, Daniel, Adelani, David Ifeoluwa, Radev, Dragomir, Ponferrada, Eduardo González, Levkovizh, Efrat, Kim, Ethan, Natan, Eyal Bar, De Toni, Francesco, Dupont, Gérard, Kruszewski, Germán, Pistilli, Giada, Elsahar, Hady, Benyamina, Hamza, Tran, Hieu, Yu, Ian, Abdulmumin, Idris, Johnson, Isaac, Gonzalez-Dios, Itziar, de la Rosa, Javier, Chim, Jenny, Dodge, Jesse, Zhu, Jian, Chang, Jonathan, Frohberg, Jörg, Tobing, Joseph, Bhattacharjee, Joydeep, Almubarak, Khalid, Chen, Kimbo, Lo, Kyle, Von Werra, Leandro, Weber, Leon, Phan, Long, allal, Loubna Ben, Tanguy, Ludovic, Dey, Manan, Muñoz, Manuel Romero, Masoud, Maraim, Grandury, María, Šaško, Mario, Huang, Max, Coavoux, Maximin, Singh, Mayank, Jiang, Mike Tian-Jian, Vu, Minh Chien, Jauhar, Mohammad A., Ghaleb, Mustafa, Subramani, Nishant, Kassner, Nora, Khamis, Nurulaqilla, Nguyen, Olivier, Espejel, Omar, de Gibert, Ona, Villegas, Paulo, Henderson, Peter, Colombo, Pierre, Amuok, Priscilla, Lhoest, Quentin, Harliman, Rheza, Bommasani, Rishi, López, Roberto Luis, Ribeiro, Rui, Osei, Salomey, Pyysalo, Sampo, Nagel, Sebastian, Bose, Shamik, Muhammad, Shamsuddeen Hassan, Sharma, Shanya, Longpre, Shayne, Nikpoor, Somaieh, Silberberg, Stanislav, Pai, Suhas, Zink, Sydney, Torrent, Tiago Timponi, Schick, Timo, Thrush, Tristan, Danchev, Valentin, Nikoulina, Vassilina, Laippala, Veronika, Lepercq, Violette, Prabhu, Vrinda, Alyafeai, Zaid, Talat, Zeerak, Raja, Arun, Heinzerling, Benjamin, Si, Chenglei, Taşar, Davut Emre, Salesky, Elizabeth, Mielke, Sabrina J., Lee, Wilson Y., Sharma, Abheesht, Santilli, Andrea, Chaffin, Antoine, Stiegler, Arnaud, Datta, Debajyoti, Szczechla, Eliza, Chhablani, Gunjan, Wang, Han, Pandey, Harshit, Strobelt, Hendrik, Fries, Jason Alan, Rozen, Jos, Gao, Leo, Sutawika, Lintang, Bari, M Saiful, Al-shaibani, Maged S., Manica, Matteo, Nayak, Nihal, Teehan, Ryan, Albanie, Samuel, Shen, Sheng, Ben-David, Srulik, Bach, Stephen H., Kim, Taewoon, Bers, Tali, Fevry, Thibault, Neeraj, Trishala, Thakker, Urmish, Raunak, Vikas, Tang, Xiangru, Yong, Zheng-Xin, Sun, Zhiqing, Brody, Shaked, Uri, Yallow, Tojarieh, Hadar, Roberts, Adam, Chung, Hyung Won, Tae, Jaesung, Phang, Jason, Press, Ofir, Li, Conglong, Narayanan, Deepak, Bourfoune, Hatim, Casper, Jared, Rasley, Jeff, Ryabinin, Max, Mishra, Mayank, Zhang, Minjia, Shoeybi, Mohammad, Peyrounette, Myriam, Patry, Nicolas, Tazi, Nouamane, Sanseviero, Omar, von Platen, Patrick, Cornette, Pierre, Lavallée, Pierre François, Lacroix, Rémi, Rajbhandari, Samyam, Gandhi, Sanchit, Smith, Shaden, Requena, Stéphane, Patil, Suraj, Dettmers, Tim, Baruwa, Ahmed, Singh, Amanpreet, Cheveleva, Anastasia, Ligozat, Anne-Laure, Subramonian, Arjun, Névéol, Aurélie, Lovering, Charles, Garrette, Dan, Tunuguntla, Deepak, Reiter, Ehud, Taktasheva, Ekaterina, Voloshina, Ekaterina, Bogdanov, Eli, Winata, Genta Indra, Schoelkopf, Hailey, Kalo, Jan-Christoph, Novikova, Jekaterina, Forde, Jessica Zosa, Clive, Jordan, Kasai, Jungo, Kawamura, Ken, Hazan, Liam, Carpuat, Marine, Clinciu, Miruna, Kim, Najoung, Cheng, Newton, Serikov, Oleg, Antverg, Omer, van der Wal, Oskar, Zhang, Rui, Zhang, Ruochen, Gehrmann, Sebastian, Mirkin, Shachar, Pais, Shani, Shavrina, Tatiana, Scialom, Thomas, Yun, Tian, Limisiewicz, Tomasz, Rieser, Verena, Protasov, Vitaly, Mikhailov, Vladislav, Pruksachatkun, Yada, Belinkov, Yonatan, Bamberger, Zachary, Kasner, Zdeněk, Rueda, Alice, Pestana, Amanda, Feizpour, Amir, Khan, Ammar, Faranak, Amy, Santos, Ana, Hevia, Anthony, Unldreaj, Antigona, Aghagol, Arash, Abdollahi, Arezoo, Tammour, Aycha, HajiHosseini, Azadeh, Behroozi, Bahareh, Ajibade, Benjamin, Saxena, Bharat, Ferrandis, Carlos Muñoz, McDuff, Daniel, Contractor, Danish, Lansky, David, David, Davis, Kiela, Douwe, Nguyen, Duong A., Tan, Edward, Baylor, Emi, Ozoani, Ezinwanne, Mirza, Fatima, Ononiwu, Frankline, Rezanejad, Habib, Jones, Hessie, Bhattacharya, Indrani, Solaiman, Irene, Sedenko, Irina, Nejadgholi, Isar, Passmore, Jesse, Seltzer, Josh, Sanz, Julio Bonis, Dutra, Livia, Samagaio, Mairon, Elbadri, Maraim, Mieskes, Margot, Gerchick, Marissa, Akinlolu, Martha, McKenna, Michael, Qiu, Mike, Ghauri, Muhammed, Burynok, Mykola, Abrar, Nafis, Rajani, Nazneen, Elkott, Nour, Fahmy, Nour, Samuel, Olanrewaju, An, Ran, Kromann, Rasmus, Hao, Ryan, Alizadeh, Samira, Shubber, Sarmad, Wang, Silas, Roy, Sourav, Viguier, Sylvain, Le, Thanh, Oyebade, Tobi, Le, Trieu, Yang, Yoyo, Nguyen, Zach, Kashyap, Abhinav Ramesh, Palasciano, Alfredo, Callahan, Alison, Shukla, Anima, Miranda-Escalada, Antonio, Singh, Ayush, Beilharz, Benjamin, Wang, Bo, Brito, Caio, Zhou, Chenxi, Jain, Chirag, Xu, Chuxin, Fourrier, Clémentine, Periñán, Daniel León, Molano, Daniel, Yu, Dian, Manjavacas, Enrique, Barth, Fabio, Fuhrimann, Florian, Altay, Gabriel, Bayrak, Giyaseddin, Burns, Gully, Vrabec, Helena U., Bello, Imane, Dash, Ishani, Kang, Jihyun, Giorgi, John, Golde, Jonas, Posada, Jose David, Sivaraman, Karthik Rangasai, Bulchandani, Lokesh, Liu, Lu, Shinzato, Luisa, de Bykhovetz, Madeleine Hahn, Takeuchi, Maiko, Pàmies, Marc, Castillo, Maria A, Nezhurina, Marianna, Sänger, Mario, Samwald, Matthias, Cullan, Michael, Weinberg, Michael, De Wolf, Michiel, Mihaljcic, Mina, Liu, Minna, Freidank, Moritz, Kang, Myungsun, Seelam, Natasha, Dahlberg, Nathan, Broad, Nicholas Michio, Muellner, Nikolaus, Fung, Pascale, Haller, Patrick, Chandrasekhar, Ramya, Eisenberg, Renata, Martin, Robert, Canalli, Rodrigo, Su, Rosaline, Su, Ruisi, Cahyawijaya, Samuel, Garda, Samuele, Deshmukh, Shlok S, Mishra, Shubhanshu, Kiblawi, Sid, Ott, Simon, Sang-aroonsiri, Sinee, Kumar, Srishti, Schweter, Stefan, Bharati, Sushil, Laud, Tanmay, Gigant, Théo, Kainuma, Tomoya, Kusa, Wojciech, Labrak, Yanis, Bajaj, Yash Shailesh, Venkatraman, Yash, Xu, Yifan, Xu, Yingxin, Xu, Yu, Tan, Zhe, Xie, Zhongli, Ye, Zifan, Bras, Mathilde, Belkada, Younes, and Wolf, Thomas
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
Published: 2022

8. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Author: Aminabadi, Reza Yazdani, Rajbhandari, Samyam, Zhang, Minjia, Awan, Ammar Ahmad, Li, Cheng, Li, Du, Zheng, Elton, Rasley, Jeff, Smith, Shaden, Ruwase, Olatunji, and He, Yuxiong
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50\%$ of A6000 peak).
Published: 2022

9. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Author: Smith, Shaden, Patwary, Mostofa, Norick, Brandon, LeGresley, Patrick, Rajbhandari, Samyam, Casper, Jared, Liu, Zhun, Prabhumoye, Shrimai, Zerveas, George, Korthikanti, Vijay, Zhang, Elton, Child, Rewon, Aminabadi, Reza Yazdani, Bernauer, Julie, Song, Xia, Shoeybi, Mohammad, He, Yuxiong, Houston, Michael, Tiwary, Saurabh, and Catanzaro, Bryan
Subjects: Computer Science - Computation and Language
Abstract: Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations., Comment: Shaden Smith and Mostofa Patwary contributed equally
Published: 2022

10. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Author: Rajbhandari, Samyam, Li, Conglong, Yao, Zhewei, Zhang, Minjia, Aminabadi, Reza Yazdani, Awan, Ammar Ahmad, Rasley, Jeff, and He, Yuxiong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible., Comment: This paper is published at ICML 2022: https://proceedings.mlr.press/v162/rajbhandari22a
Published: 2022

11. Scalable and Efficient MoE Training for Multitask Multilingual Models

Author: Kim, Young Jin, Awan, Ammar Ahmad, Muzio, Alexandre, Salinas, Andres Felipe Cruz, Lu, Liyang, Hendy, Amr, Rajbhandari, Samyam, He, Yuxiong, and Awadalla, Hany Hassan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.
Published: 2021

12. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

Author: Rajbhandari, Samyam, Ruwase, Olatunji, Rasley, Jeff, Smith, Shaden, and He, Yuxiong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Performance
Abstract: In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been supported primarily though system innovations that allow large models to fit in the aggregate GPU memory of multiple GPUs. However, we are getting close to the GPU memory wall. It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists. In addition, training models at that scale requires complex combinations of parallelism techniques that puts a big burden on the data scientists to refactor their model. In this paper we present ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs(40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed, a deep learning optimization library that makes distributed training easy, efficient, and effective.
Published: 2021

13. 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

Author: Li, Conglong, Awan, Ammar Ahmad, Tang, Hanlin, Rajbhandari, Samyam, and He, Yuxiong
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: To train large models (like BERT and GPT-3) on hundreds of GPUs, communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP network. On one side large batch-size optimization such as LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression, but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition, we introduce a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance. For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB.
Published: 2021

14. 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Author: Tang, Hanlin, Gan, Shaoduo, Awan, Ammar Ahmad, Rajbhandari, Samyam, Li, Conglong, Lian, Xiangru, Liu, Ji, Zhang, Ce, and He, Yuxiong
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to $5\times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam's variance (non-linear term) becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). Experiments on up to 256 GPUs show that 1-bit Adam enables up to $3.3\times$ higher throughput for BERT-Large pre-training and up to $2.9\times$ higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for our proposed work., Comment: arXiv admin note: text overlap with arXiv:2008.11343
Published: 2021

15. ZeRO-Offload: Democratizing Billion-Scale Model Training

Author: Ren, Jie, Rajbhandari, Samyam, Aminabadi, Reza Yazdani, Ruwase, Olatunji, Yang, Shuangyan, Zhang, Minjia, Li, Dong, and He, Yuxiong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from the data scientists or sacrificing computational efficiency. ZeRO-Offload enables large model training by offloading data and compute to CPU. To preserve compute efficiency, it is designed to minimize the data movement to/from GPU, and reduce CPU compute time while maximizing memory savings on GPU. As a result, ZeRO-Offload can achieve 40 TFlops/GPU on a single NVIDIA V100 GPU for 10B parameter model compared to 30TF using PyTorch alone for a 1.4B parameter model, the largest that can be trained without running out of memory. ZeRO-Offload is also designed to scale on multiple-GPUs when available, offering near linear speedup on up to 128 GPUs. Additionally, it can work together with model parallelism to train models with over 70 billion parameters on a single DGX-2 box, a 4.5x increase in model size compared to using model parallelism alone. By combining compute and memory efficiency with ease-of-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU.
Published: 2021

16. APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

Author: Tang, Hanlin, Gan, Shaoduo, Rajbhandari, Samyam, Lian, Xiangru, Liu, Ji, He, Yuxiong, and Zhang, Ce
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet. However, Adam is generally not compatible with information (gradient) compression technology. Therefore, the communication usually becomes the bottleneck for parallelizing Adam. In this paper, we propose a communication efficient {\bf A}DAM {\bf p}reconditioned {\bf M}omentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients. The proposed algorithm achieves a similar convergence efficiency to Adam in term of epochs, but significantly reduces the running time per epoch. In terms of end-to-end performance (including the full-precision pre-condition step), APMSqueeze is able to provide {sometimes by up to $2-10\times$ speed-up depending on network bandwidth.} We also conduct theoretical analysis on the convergence and efficiency.
Published: 2020

17. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Author: Rajbhandari, Samyam, Rasley, Jeff, Ruwase, Olatunji, and He, Yuxiong
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning
Abstract: Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8.3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create the world's largest language model (Turing-NLG, 17B parameters) with record breaking accuracy.
Published: 2019

18. AntMan: Sparse Low-Rank Compression to Accelerate RNN inference

Author: Rajbhandari, Samyam, Shrivastava, Harsh, and He, Yuxiong
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Wide adoption of complex RNN based models is hindered by their inference performance, cost and memory requirements. To address this issue, we develop AntMan, combining structured sparsity with low-rank decomposition synergistically, to reduce model computation, size and execution time of RNNs while attaining desired accuracy. AntMan extends knowledge distillation based training to learn the compressed models efficiently. Our evaluation shows that AntMan offers up to 100x computation reduction with less than 1pt accuracy drop for language and machine reading comprehension models. Our evaluation also shows that for a given accuracy target, AntMan produces 5x smaller models than the state-of-art. Lastly, we show that AntMan offers super-linear speed gains compared to theoretical speedup, demonstrating its practical value on commodity hardware.
Published: 2019

19. Learning Intrinsic Sparse Structures within Long Short-Term Memory

Author: Wen, Wei, He, Yuxiong, Rajbhandari, Samyam, Zhang, Minjia, Wang, Wenhan, Liu, Fang, Hu, Bin, Chen, Yiran, and Li, Hai
Subjects: Computer Science - Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Neural and Evolutionary Computing
Abstract: Model compression is significant for the wide adoption of Recurrent Neural Networks (RNNs) in both user devices possessing limited resources and business clusters requiring quick responses to large-scale service requests. This work aims to learn structurally-sparse Long Short-Term Memory (LSTM) by reducing the sizes of basic structures within LSTM units, including input updates, gates, hidden states, cell states and outputs. Independently reducing the sizes of basic structures can result in inconsistent dimensions among them, and consequently, end up with invalid LSTM units. To overcome the problem, we propose Intrinsic Sparse Structures (ISS) in LSTMs. Removing a component of ISS will simultaneously decrease the sizes of all basic structures by one and thereby always maintain the dimension consistency. By learning ISS within LSTM units, the obtained LSTMs remain regular while having much smaller basic structures. Based on group Lasso regularization, our method achieves 10.59x speedup without losing any perplexity of a language modeling of Penn TreeBank dataset. It is also successfully evaluated through a compact model with only 2.69M weights for machine Question Answering of SQuAD dataset. Our approach is successfully extended to non- LSTM RNNs, like Recurrent Highway Networks (RHNs). Our source code is publicly available at https://github.com/wenwei202/iss-rnns, Comment: Published in ICLR 2018 ( the Sixth International Conference on Learning Representations)
Published: 2017

20. DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Author: Holmes, Connor, Tanaka, Masahiro, Wyatt, Michael, Awan, Ammar Ahmad, Rasley, Jeff, Rajbhandari, Samyam, Aminabadi, Reza Yazdani, Qin, Heyang, Bakhtiari, Arash, Kurilenko, Lev, He, Yuxiong, Holmes, Connor, Tanaka, Masahiro, Wyatt, Michael, Awan, Ammar Ahmad, Rasley, Jeff, Rajbhandari, Samyam, Aminabadi, Reza Yazdani, Qin, Heyang, Bakhtiari, Arash, Kurilenko, Lev, and He, Yuxiong
Abstract: The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.
Published: 2024

21. Fast LSTM by dynamic decomposition on cloud and distributed systems

Author: You, Yang, He, Yuxiong, Rajbhandari, Samyam, Wang, Wenhan, Hsieh, Cho-Jui, Keutzer, Kurt, and Demmel, James
Published: 2020
Full Text: View/download PDF

22. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Author: Singh, Siddharth, primary, Ruwase, Olatunji, additional, Awan, Ammar Ahmad, additional, Rajbhandari, Samyam, additional, He, Yuxiong, additional, and Bhatele, Abhinav, additional
Published: 2023
Full Text: View/download PDF

23. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Author: Singh, Siddarth, Singh, Siddarth, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong, Bhatele, Abhinav, Singh, Siddarth, Singh, Siddarth, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong, and Bhatele, Abhinav
Abstract: Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, threedimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4–8× larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
Published: 2023

24. 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed

Author: Li, Conglong, primary, Awan, Ammar Ahmad, additional, Tang, Hanlin, additional, Rajbhandari, Samyam, additional, and He, Yuxiong, additional
Published: 2022
Full Text: View/download PDF

25. DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Author: Aminabadi, Reza Yazdani, primary, Rajbhandari, Samyam, additional, Awan, Ammar Ahmad, additional, Li, Cheng, additional, Li, Du, additional, Zheng, Elton, additional, Ruwase, Olatunji, additional, Smith, Shaden, additional, Zhang, Minjia, additional, Rasley, Jeff, additional, and He, Yuxiong, additional
Published: 2022
Full Text: View/download PDF

26. Effective Utilization of Tensor Symmetry in Operation Optimization of Tensor Contraction Expressions

Author: Lai, Pai-Wei, Zhang, Huaijian, Rajbhandari, Samyam, Valeev, Edward, Kowalski, Karol, and Sadayappan, P.
Published: 2012
Full Text: View/download PDF

27. ZeRO-infinity

Author: Rajbhandari, Samyam, primary, Ruwase, Olatunji, additional, Rasley, Jeff, additional, Smith, Shaden, additional, and He, Yuxiong, additional
Published: 2021
Full Text: View/download PDF

28. ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Author: Rajbhandari, Samyam, primary, Rasley, Jeff, additional, Ruwase, Olatunji, additional, and He, Yuxiong, additional
Published: 2020
Full Text: View/download PDF

29. DeepSpeed

Author: Rasley, Jeff, primary, Rajbhandari, Samyam, additional, Ruwase, Olatunji, additional, and He, Yuxiong, additional
Published: 2020
Full Text: View/download PDF

30. Fast LSTM Inference by Dynamic Decomposition on Cloud Systems

Author: You, Yang, primary, He, Yuxiong, additional, Rajbhandari, Samyam, additional, Wang, Wenhan, additional, Hsieh, Cho-Jui, additional, Keutzer, Kurt, additional, and Demmel, James, additional
Published: 2019
Full Text: View/download PDF

31. Optimizing CNNs on Multicores for Scalability, Performance and Goodput

Author: Rajbhandari, Samyam, primary, He, Yuxiong, additional, Ruwase, Olatunji, additional, Carbin, Michael, additional, and Chilimbi, Trishul, additional
Published: 2017
Full Text: View/download PDF

32. Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis

Author: Rajbhandari, Samyam, primary, Rastello, Fabrice, additional, Kowalski, Karol, additional, Krishnamoorthy, Sriram, additional, and Sadayappan, P., additional
Published: 2017
Full Text: View/download PDF

33. A Domain-Specific Compiler for a Parallel Multiresolution Adaptive Numerical Simulation Environment

Author: Rajbhandari, Samyam, primary, Kim, Jinsung, additional, Krishnamoorthy, Sriram, additional, Pouchet, Louis-Noel, additional, Rastello, Fabrice, additional, Harrison, Robert J., additional, and Sadayappan, P., additional
Published: 2016
Full Text: View/download PDF

34. Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis.

Author: Rajbhandari, Samyam, Rastello, Fabrice, Kowalski, Karol, Krishnamoorthy, Sriram, and Sadayappan, P.
Published: 2017
Full Text: View/download PDF

35. Optimizing CNNs on Multicores for Scalability, Performance and Goodput.

Author: Rajbhandari, Samyam, He, Yuxiong, Ruwase, Olatunji, Carbin, Michael, and Chilimbi, Trishul
Published: 2017
Full Text: View/download PDF

36. A Communication-Optimal Framework for Contracting Distributed Tensors

Author: Rajbhandari, Samyam, primary, Nikam, Akshay, additional, Lai, Pai-Wei, additional, Stock, Kevin, additional, Krishnamoorthy, Sriram, additional, and Sadayappan, P., additional
Published: 2014
Full Text: View/download PDF

37. CAST: Contraction Algorithm for Symmetric Tensors

Author: Rajbhandari, Samyam, primary, Nikam, Akshay, additional, Lai, Pai-Wei, additional, Stock, Kevin, additional, Krishnamoorthy, Sriram, additional, and Sadayappan, P., additional
Published: 2014
Full Text: View/download PDF

38. A framework for load balancing of tensor contraction expressions via dynamic task partitioning

Author: Lai, Pai-Wei, primary, Stock, Kevin, additional, Rajbhandari, Samyam, additional, Krishnamoorthy, Sriram, additional, and Sadayappan, P., additional
Published: 2013
Full Text: View/download PDF

39. A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning.

Author: Pai-Wei Lai, Stock, Kevin, Rajbhandari, Samyam, Krishnamoorthy, Sriram, and Sadayappan, P.
Published: 2013
Full Text: View/download PDF

40. International Conference on Computational Science, ICCS 2012.

Author: Lai, Pai-Wei, Zhang, Huaijian, Rajbhandari, Samyam, Valeev, Edward, Kowalski, Karol, and Sadayappan, P.
Subjects: CONFERENCES & conventions, MATHEMATICAL optimization, TENSOR products, QUANTUM chemistry, MATHEMATICAL transformations, MATHEMATICAL symmetry, PARAMETER estimation
Abstract: Abstract: The optimization of tensor expressions with hundreds of terms is required for the development of accurate quantum chemistry models such as the coupled cluster method. In this paper, we address the effective exploitation of symmetry properties of tensors in performing algebraic transformations for minimizing operation count of tensor expressions. We develop rules to detect symmetries in intermediate tensors, cost models for tensor contractions with symmetries, and a canonical representation to facilitate effective common subexpression elimination. We demonstrate significant improvements to the operation counts for the coupled cluster method when compared to several state-of-the-art im-plementations. Furthermore, we show that tensor expressions optimized for a few input parameter combinations can be used to achieve operation counts within 3% of the optimal, for the entire parameter space of interest. [Copyright &y& Elsevier]
Published: 2012
Full Text: View/download PDF

41. Locality Optimizations for Regular and Irregular Applications

Author: Rajbhandari, Samyam
Subjects: Computer Science, Computer Engineering, locality, tensors, tensor contractions, convolutional neural network, MADNESS, 4-index transform, k-d trees, recursive tree traversal, locality optimization, scalability, performance, optimizations, fusion, communication optimal
Abstract: The fastest supercomputer in the world as of July 2016 is the Sunway TaihuLight. It can achieve a staggering performance of 93 PetaFlops. This incredible performance is achieved via massive parallelism. Today’s supercomputers and compute clusters have tens of thousands of distributed memory nodes with each node comprised of several shared memory multi/many core processors.Scaling on these massively parallel systems is not an easy task. A major performance and scalability bottleneck is the limited data movement bandwidth, which can be orders of magnitude smaller than the computation bandwidth. Developing applications to scale on these massively parallel systems requires minimizing data movement volume at different levels of memory hierarchy using locality optimization techniques. Locality optimization aims to reduce the data movement between slow and fast memory by rescheduling/remapping the original computation to reuse the data once it is in fast memory, thereby avoiding subsequent movement of the same data from slow memory.This dissertation explores multiple aspects of locality optimizations for enhancing scalability and performance of various regular and irregular applications on massively parallel computing environment. It develops distributed algorithms, lower bound techniques, and compiler and runtime frameworks for optimizing Tensor Contractions, Four-Index Transform, Convolutional Neural Networks (CNNs), and Recursive TreeTraversal on k-d trees. Each of these application domains is limited in performance and scalability primarily by data movement costs at a particular level of memory hierarchy.To be specific, on a massively parallel system, distributed Tensor Contractions can have limited scalability due to the cost of communication between distributed memory nodes. The Four-Index Transform, on the other hand, can be limited in the size of the largest problem that can be completed in a reasonable amount of time due to data transfer cost from disk to memory. On a multi-core CPU, the state-of-art approach for training CNNs can have limited scalability and performance due to relatively large data movement between L3/L2 cache. Similarly, Recursive Tree Traversal programs on k-d trees are limited in performance and scalability by memory/cache bandwidth.This thesis develops solutions primarily for reducing the aforementioned data movement costs. The solutions developed in this thesis improves performance and scalability of the above-mentioned computations resulting in overall speedups ranging from 4x to more than 10x over the state-of-art on target systems.
Published: 2016

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

41 results on '"Rajbhandari, Samyam"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources