Author: "Bhatele, Abhinav" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bhatele, Abhinav"' showing total 331 results

Start Over Author "Bhatele, Abhinav"

331 results on '"Bhatele, Abhinav"'

1. From Pixels to Prose: A Large Dataset of Dense Image Captions

Author: Singla, Vasu, Yue, Kaiyu, Paul, Sukriti, Shirkavand, Reza, Jayawardhana, Mayuka, Ganjdanesh, Alireza, Huang, Heng, Bhatele, Abhinav, Somepalli, Gowthami, and Goldstein, Tom
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose, Comment: pixelprose 16M dataset
Published: 2024

2. Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

Author: Hans, Abhimanyu, Wen, Yuxin, Jain, Neel, Kirchenbauer, John, Kazemi, Hamid, Singhania, Prajwal, Singh, Siddharth, Somepalli, Gowthami, Geiping, Jonas, Bhatele, Abhinav, and Goldstein, Tom
Subjects: Computer Science - Computation and Language
Abstract: Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, randomly sampled subsets of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks., Comment: 10 pages, 8 figures, and 1 table in the main body. Code available at https://github.com/ahans30/goldfish-loss and checkpoints at https://huggingface.co/collections/tomg-group-umd/goldfish-loss-mitigating-memorization-in-llms-66c175becb6aab07744f7272
Published: 2024

3. Loki: Low-rank Keys for Efficient Sparse Attention

Author: Singhania, Prajwal, Singh, Siddharth, He, Shwai, Feizi, Soheil, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning
Abstract: Inference on large language models (LLMs) can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in LLM inference contributes significantly to these costs, which has sparked an interest in approximating the self-attention computation to reduce such costs. In this work, we propose to approximate self-attention by focusing on the dimensionality of key vectors computed in the attention block. Our analysis reveals that key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting this observation, we propose Loki, a novel sparse attention method that ranks and selects tokens in the KV-cache based on attention scores computed in low-dimensional space. Our evaluations show that Loki is able to speed up the attention computation due to reduced data movement (load/store) and compute costs while maintaining the efficacy of the models better than other popular approximation methods., Comment: Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems (Main Conference Track)
Published: 2024

4. Transformers Can Do Arithmetic with the Right Embeddings

Author: McLeish, Sean, Bansal, Arpit, Stein, Alex, Jain, Neel, Kirchenbauer, John, Bartoldson, Brian R., Kailkhura, Bhavya, Bhatele, Abhinav, Geiping, Jonas, Schwarzschild, Avi, and Goldstein, Tom
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that this fix enables architectural modifications such as input injection and recurrent layers to improve performance even further. With positions resolved, we can study the logical extrapolation ability of transformers. Can they solve arithmetic problems that are larger and more complex than those in their training data? We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems. Finally, we show that these gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
Published: 2024

5. Performance-Aligned LLMs for Generating Fast Code

Author: Nichols, Daniel, Polasam, Pranav, Menon, Harshitha, Marathe, Aniruddha, Gamblin, Todd, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Software Engineering
Abstract: Optimizing scientific software is a difficult task because codebases are often large and complex, and performance can depend upon several factors including the algorithm, its implementation, and hardware among others. Causes of poor performance can originate from disparate sources and be difficult to diagnose. Recent years have seen a multitude of work that use large language models (LLMs) to assist in software development tasks. However, these tools are trained to model the distribution of code as text, and are not specifically designed to understand performance aspects of code. In this work, we introduce a reinforcement learning based methodology to align the outputs of code LLMs with performance. This allows us to build upon the current code modeling capabilities of LLMs and extend them to generate better performing code. We demonstrate that our fine-tuned model improves the expected speedup of generated code over base models for a set of benchmark tasks from 0.9 to 1.6 for serial code and 1.9 to 4.5 for OpenMP code.
Published: 2024

6. Taking GPU Programming Models to Task for Performance Portability

Author: Davis, Joshua H., Sivaraman, Pranav, Kitson, Joy, Parasyris, Konstantinos, Menon, Harshitha, Minn, Isaac, Georgakoudis, Giorgis, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Portability is critical to ensuring high productivity in developing and maintaining scientific software as the diversity in on-node hardware architectures increases. While several programming models provide portability for diverse GPU platforms, they don't make any guarantees about performance portability. In this work, we explore several programming models -- CUDA, HIP, Kokkos, RAJA, OpenMP, OpenACC, and SYCL, to study if the performance of these models is consistently good across NVIDIA and AMD GPUs. We use five proxy applications from different scientific domains, create implementations where missing, and use them to present a comprehensive comparative evaluation of the programming models. We provide a Spack scripting-based methodology to ensure reproducibility of experiments conducted in this work. Finally, we attempt to answer the question -- to what extent does each programming model provide performance portability for heterogeneous systems in real-world usage?, Comment: 12 pages, 4 figures
Published: 2024

7. Automated Programmatic Performance Analysis of Parallel Programs

Author: Cankur, Onur, Tomar, Aditya, Nichols, Daniel, Scully-Allison, Connor, Isaacs, Katherine E., and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Developing efficient parallel applications is critical to advancing scientific development but requires significant performance analysis and optimization. Performance analysis tools help developers manage the increasing complexity and scale of performance data, but often rely on the user to manually explore low-level data and are rigid in how the data can be manipulated. We propose a Python-based API, Chopper, which provides high-level and flexible performance analysis for both single and multiple executions of parallel applications. Chopper facilitates performance analysis and reduces developer effort by providing configurable high-level methods for common performance analysis tasks such as calculating load imbalance, hot paths, scalability bottlenecks, correlation between metrics and CCT nodes, and causes of performance variability within a robust and mature Python environment that provides fluid access to lower-level data manipulations. We demonstrate how Chopper allows developers to quickly and succinctly explore performance and identify issues across applications such as AMG, Laghos, LULESH, Quicksilver and Tortuga.
Published: 2024

8. Can Large Language Models Write Parallel Code?

Author: Nichols, Daniel, Davis, Joshua H., Xie, Zhaojun, Rajaram, Arjun, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence
Abstract: Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for complex programs. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code. In order to evaluate language models, we create a benchmark, ParEval, consisting of prompts that represent 420 different coding tasks related to scientific and parallel computing. We use ParEval to evaluate the effectiveness of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for evaluating the performance of generated code, and use them to explore how well each large language model performs for 12 different computational problem types and six different parallel programming models.
Published: 2024
Full Text: View/download PDF

9. ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Author: Xu, Yiheng, Sivaraman, Pranav, Devarajan, Hariharan, Mohror, Kathryn, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Parallel applications can spend a significant amount of time performing I/O on large-scale supercomputers. Fast near-compute storage accelerators called burst buffers can reduce the time a processor spends performing I/O and mitigate I/O bottlenecks. However, determining if a given application could be accelerated using burst buffers is not straightforward even for storage experts. The relationship between an application's I/O characteristics (such as I/O volume, processes involved, etc.) and the best storage sub-system for it can be complicated. As a result, adapting parallel applications to use burst buffers efficiently is a trial-and-error process. In this work, we present a Python-based tool called PrismIO that enables programmatic analysis of I/O traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel file systems and explain why certain I/O patterns perform poorly. Further, we use machine learning to model the relationship between I/O characteristics and burst buffer selections. We run IOR (an I/O benchmark) with various I/O characteristics on different storage systems and collect performance data. We use the data as the input for training the model. Our model can predict if a file of an application should be placed on BBs for unseen IOR scenarios with an accuracy of 94.47% and for four real applications with an accuracy of 95.86%.
Published: 2023

10. Jorge: Approximate Preconditioning for GPU-efficient Second-order Optimization

Author: Singh, Siddharth, Sating, Zachary, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Despite their better convergence properties compared to first-order optimizers, second-order optimizers for deep learning have been less popular due to their significant computational costs. The primary efficiency bottleneck in such optimizers is matrix inverse calculations in the preconditioning step, which are expensive to compute on GPUs. In this paper, we introduce Jorge, a second-order optimizer that promises the best of both worlds -- rapid convergence benefits of second-order methods, and high computational efficiency typical of first-order methods. We address the primary computational bottleneck of computing matrix inverses by completely eliminating them using an approximation of the preconditioner computation. This makes Jorge extremely efficient on GPUs in terms of wall-clock time. Further, we describe an approach to determine Jorge's hyperparameters directly from a well-tuned SGD baseline, thereby significantly minimizing tuning efforts. Our empirical evaluations demonstrate the distinct advantages of using Jorge, outperforming state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple deep learning models, both in terms of sample efficiency and wall-clock time.
Published: 2023

11. HPC-Coder: Modeling Parallel Programs using Large Language Models

Author: Nichols, Daniel, Marathe, Aniruddha, Menon, Harshitha, Gamblin, Todd, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence
Abstract: Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software even more burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. Until recently, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform, especially for parallel programs. However, with recent advancements in language modeling, and the availability of large amounts of open-source code related data, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models. We compare several pre-trained LLMs on HPC-related tasks and introduce a new model, HPC-Coder, fine-tuned on parallel codes. In our experiments, we show that this model can auto-complete HPC functions where generic models cannot, decorate for loops with OpenMP pragmas, and model performance changes in scientific application repositories as well as programming competition solutions.
Published: 2023
Full Text: View/download PDF

12. Pipit: Scripting the analysis of parallel execution traces

Author: Bhatele, Abhinav, Dhakal, Rakrish, Movsesyan, Alexander, Ranjan, Aditya K., and Cankur, Onur
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Performance analysis is a critical step in the oft-repeated, iterative process of performance tuning of parallel programs. Per-process, per-thread traces (detailed logs of events with timestamps) enable in-depth analysis of parallel program execution to identify different kinds of performance issues. Often times, trace collection tools provide a graphical tool to analyze the trace output. However, these GUI-based tools only support specific file formats, are challenging to scale to large trace sizes, limit data exploration to the implemented graphical views, and do not support automated comparisons of two or more datasets. In this paper, we present a programmatic approach to analyzing parallel execution traces by leveraging pandas, a powerful Python-based data analysis library. We have developed a Python library, Pipit, on top of pandas that can read traces in different file formats (OTF2, HPCToolkit, Projections, Nsight Systems, etc.) and provides a uniform data structure in the form of a pandas DataFrame. Pipit provides operations to aggregate, filter, and transform the events in a trace to present the data in different ways. We also provide several functions to quickly and easily identify performance issues in parallel executions. More importantly, the API is easily extensible to support custom analyses by different end users.
Published: 2023

13. A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Author: Singh, Siddharth, Singhania, Prajwal, Ranjan, Aditya K., Sating, Zack, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. This 4D approach is a hybrid of 3D tensor and data parallelism, and is implemented in the AxoNN framework. In addition, we employ two key strategies to further minimize communication overheads. First, we aggressively overlap expensive collective operations (reduce-scatter, all-gather, and all-reduce) with computation. Second, we develop an analytical model to identify high-performing configurations within the large search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion parameter GPT on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. Additionally, it achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.
Published: 2023

14. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Author: Singh, Siddharth, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4 to 8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
Published: 2023
Full Text: View/download PDF

15. Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training

Author: Singh, Siddharth and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
Published: 2023

16. Design Concerns for Integrated Scripting and Interactive Visualization in Notebook Environments

Author: Scully-Allison, Connor, Lumsden, Ian, Williams, Katy, Bartels, Jesse, Taufer, Michela, Brink, Stephanie, Bhatele, Abhinav, Pearce, Olga, and Isaacs, Katherine E.
Subjects: Computer Science - Human-Computer Interaction
Abstract: Interactive visualization can support fluid exploration but is often limited to predetermined tasks. Scripting can support a vast range of queries but may be more cumbersome for free-form exploration. Embedding interactive visualization in scripting environments, such as computational notebooks, provides an opportunity to leverage the strengths of both direct manipulation and scripting. We investigate interactive visualization design methodology, choices, and strategies under this paradigm through a design study of calling context trees used in performance analysis, a field which exemplifies typical exploratory data analysis workflows with Big Data and hard to define problems. We first produce a formal task analysis assigning tasks to graphical or scripting contexts based on their specificity, frequency, and suitability. We then design a notebook-embedded interactive visualization and validate it with intended users. In a follow-up study, we present participants with multiple graphical and scripting interaction modes to elicit feedback about notebook-embedded visualization design, finding consensus in support of the interaction model. We report and reflect on observations regarding the process and design implications for combining visualization and scripting in notebooks., Comment: Submitted to IEEE VIS 2022
Published: 2022
Full Text: View/download PDF

17. A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

Author: Nichols, Daniel, Singh, Siddharth, Lin, Shu-Huai, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields. This phenomenon has spurred the development of algorithms for distributed training of neural networks over a larger number of hardware accelerators. In this paper, we discuss and compare current state-of-the-art frameworks for large scale distributed deep learning. First, we survey current practices in distributed learning and identify the different types of parallelism used. Then, we present empirical results comparing their performance on large image and language training tasks. Additionally, we address their statistical efficiency and memory consumption behavior. Based on our results, we discuss algorithmic and implementation portions of each framework which hinder performance.
Published: 2021

18. AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning

Author: Singh, Siddharth and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents AxoNN, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12-100 billion parameters on 48-384 NVIDIA Tesla V100 GPUs, AxoNN achieves a per-GPU throughput of 49.4-54.78% of theoretical peak and reduces the training time by 22-37 days (15-25% speedup) as compared to the state-of-the-art., Comment: Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE Computer Society, May 2022
Published: 2021

19. Analytics of Longitudinal System Monitoring Data for Performance Prediction

Author: Costello, Ian J. and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Performance
Abstract: In recent years, several HPC facilities have started continuous monitoring of their systems and jobs to collect performance-related data for understanding performance and operational efficiency. Such data can be used to optimize the performance of individual jobs and the overall system by creating data-driven models that can predict the performance of jobs waiting in the scheduler queue. In this paper, we model the performance of representative control jobs using longitudinal system-wide monitoring data and machine learning to explore the causes of performance variability. We analyze these prediction models in great detail to identify the features that are dominant predictors of performance. We demonstrate that such models can be application-agnostic and can be used for predicting performance of applications that are not included in training.
Published: 2020

20. Scalable Comparative Visualization of Ensembles of Call Graphs

Author: Kesavan, Suraj P., Bhatia, Harsh, Bhatele, Abhinav, Gamblin, Todd, Bremer, Peer-Timo, and Ma, Kwan-Liu
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Optimizing the performance of large-scale parallel codes is critical for efficient utilization of computing resources. Code developers often explore various execution parameters, such as hardware configurations, system software choices, and application parameters, and are interested in detecting and understanding bottlenecks in different executions. They often collect hierarchical performance profiles represented as call graphs, which combine performance metrics with their execution contexts. The crucial task of exploring multiple call graphs together is tedious and challenging because of the many structural differences in the execution contexts and significant variability in the collected performance metrics (e.g., execution runtime). In this paper, we present an enhanced version of CallFlow to support the exploration of ensembles of call graphs using new types of visualizations, analysis, graph operations, and features. We introduce ensemble-Sankey, a new visual design that combines the strengths of resource-flow (Sankey) and box-plot visualization techniques. Whereas the resource-flow visualization can easily and intuitively describe the graphical nature of the call graph, the box plots overlaid on the nodes of Sankey convey the performance variability within the ensemble. Our interactive visual interface provides linked views to help explore ensembles of call graphs, e.g., by facilitating the analysis of structural differences, and identifying similar or distinct call graphs. We demonstrate the effectiveness and usefulness of our design through case studies on large-scale parallel codes., Comment: 12 pages, 6 figures, Submitted to IEEE VIS 2020
Published: 2020

21. Comparative Evaluation of Call Graph Generation by Profiling Tools

Author: Cankur, Onur, Bhatele, Abhinav, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Varbanescu, Ana-Lucia, editor, Bhatele, Abhinav, editor, Luszczek, Piotr, editor, and Marc, Baboulin, editor
Published: 2022
Full Text: View/download PDF

22. Visual Analytics Challenges in Analyzing Calling Context Trees

Author: Bergel, Alexandre, Bhatele, Abhinav, Boehme, David, Gralka, Patrick, Griffin, Kevin, Hermanns, Marc-André, Okanović, Dušan, Pearce, Olga, Vierjahn, Tom, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bhatele, Abhinav, editor, Boehme, David, editor, Levine, Joshua A., editor, Malony, Allen D., editor, and Schulz, Martin, editor
Published: 2019
Full Text: View/download PDF

23. HPC-Coder: Modeling Parallel Programs using Large Language Models

Author: Nichols, Daniel, primary, Marathe, Aniruddha, additional, Menon, Harshitha, additional, Gamblin, Todd, additional, and Bhatele, Abhinav, additional
Published: 2024
Full Text: View/download PDF

24. Predicting GPUDirect Benefits for HPC Workloads

Author: Khetawat, Harsh, primary, Jain, Nikhil, additional, Bhatele, Abhinav, additional, and Mueller, Frank, additional
Published: 2024
Full Text: View/download PDF

25. Design Concerns for Integrated Scripting and Interactive Visualization in Notebook Environments

Author: Scully-Allison, Connor, Lumsden, Ian, Williams, Katy, Bartels, Jesse, Taufer, Michela, Brink, Stephanie, Bhatele, Abhinav, Pearce, Olga, and Isaacs, Katherine E.
Abstract: Interactive visualization can support fluid exploration but is often limited to predetermined tasks. Scripting can support a vast range of queries but may be more cumbersome for free-form exploration. Embedding interactive visualization in scripting environments, such as computational notebooks, provides an opportunity to leverage the strengths of both direct manipulation and scripting. We investigate interactive visualization design methodology, choices, and strategies under this paradigm through a design study of calling context trees used in performance analysis, a field which exemplifies typical exploratory data analysis workflows with Big Data and hard to define problems. We first produce a formal task analysis assigning tasks to graphical or scripting contexts based on their specificity, frequency, and suitability. We then design a notebook-embedded interactive visualization and validate it with intended users. In a follow-up study, we present participants with multiple graphical and scripting interaction modes to elicit feedback about notebook-embedded visualization design, finding consensus in support of the interaction model. We report and reflect on observations regarding the process and design implications for combining visualization and scripting in notebooks.
Published: 2024
Full Text: View/download PDF

26. A Large-Scale Epidemic Simulation Framework for Realistic Social Contact Networks

Author: Kitson, Joy, Costello, Ian, Chen, Jiangzhuo, Jiménez, Diego, Hoops, Stefan, Mortveit, Henning, Meneses, Esteban, Yeom, Jae-Seung, Marathe, Madhav V., Bhatele, Abhinav, Kitson, Joy, Costello, Ian, Chen, Jiangzhuo, Jiménez, Diego, Hoops, Stefan, Mortveit, Henning, Meneses, Esteban, Yeom, Jae-Seung, Marathe, Madhav V., and Bhatele, Abhinav
Abstract: Global pandemics can wreak havoc and lead to significant social, economic, and personal losses. Preventing the spread of infectious diseases requires implementing interventions at different levels of government, and evaluating the potential impact and efficacy of those preemptive measures. Agent-based modeling can be used for detailed studies of epidemic diffusion and possible interventions. We present Loimos, a highly parallel simulation of epidemic diffusion written on top of Charm++, an asynchronous task-based parallel runtime. Loimos uses a hybrid of time-stepping and discrete-event simulation to model disease spread. We demonstrate that our implementation of Loimos is able to scale to large core counts on an HPC system. In particular, Loimos is able to simulate a US-scale synthetic interaction network in an average of 1.497 seconds per simulation day when executed on 16 nodes on Rivanna at the University of Virginia, processing around 428 billion interactions (person-person edges) in under five minutes for an average of 1.4 billion traversed edges per second (TEPS)., Comment: 13 pages (including references), 9 figures
Published: 2024

27. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Author: Singh, Siddharth, primary, Ruwase, Olatunji, additional, Awan, Ammar Ahmad, additional, Rajbhandari, Samyam, additional, He, Yuxiong, additional, and Bhatele, Abhinav, additional
Published: 2023
Full Text: View/download PDF

28. Massively parallel first-principles simulation of electron dynamics in materials

Author: Draeger, Erik W., Andrade, Xavier, Gunnels, John A., Bhatele, Abhinav, Schleife, André, and Correa, Alfredo A.
Published: 2017
Full Text: View/download PDF

29. Visual Analytics Challenges in Analyzing Calling Context Trees

Author: Bergel, Alexandre, primary, Bhatele, Abhinav, additional, Boehme, David, additional, Gralka, Patrick, additional, Griffin, Kevin, additional, Hermanns, Marc-André, additional, Okanović, Dušan, additional, Pearce, Olga, additional, and Vierjahn, Tom, additional
Published: 2019
Full Text: View/download PDF

30. Pipit: Enabling programmatic analysis of parallel execution traces

Author: Bhatele, Abhinav, Dhakal, Rakrish, Movsesyan, Alexander, Ranjan, Aditya, Marry, Jordan, and Cankur, Onur
Subjects: Performance (cs.PF), FOS: Computer and information sciences, Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Performance analysis is an important part of the oft-repeated, iterative process of performance tuning during the development of parallel programs. Per-process per-thread traces (detailed logs of events with timestamps) enable in-depth analysis of parallel program execution to identify various kinds of performance issues. Often times, trace collection tools provide a graphical tool to analyze the trace output. However, these GUI-based tools only support specific file formats, are difficult to scale when the data is large, limit data exploration to the implemented graphical views, and do not support automated comparisons of two or more datasets. In this paper, we present a programmatic approach to analyzing parallel execution traces by leveraging pandas, a powerful Python-based data analysis library. We have developed a Python library, Pipit, on top of pandas that can read traces in different file formats (OTF2, HPCToolkit, Projections, Nsight, etc.) and provide a uniform data structure in the form of a pandas DataFrame. Pipit provides operations to aggregate, filter, and transform the events in a trace to present the data in different ways. We also provide several functions to quickly identify performance issues in parallel executions.
Published: 2023

31. Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations

Author: Acun, Bilge, Jain, Nikhil, Bhatele, Abhinav, Mubarak, Misbah, Carothers, Christopher D., Kale, Laxmikant V., Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Hunold, Sascha, editor, Costan, Alexandru, editor, Giménez, Domingo, editor, Iosup, Alexandru, editor, Ricci, Laura, editor, Gómez Requena, María Engracia, editor, Scarano, Vittorio, editor, Varbanescu, Ana Lucia, editor, Scott, Stephen L., editor, Lankes, Stefan, editor, Weidendorfer, Josef, editor, and Alexander, Michael, editor
Published: 2015
Full Text: View/download PDF

32. A Flexible Data Model to Support Multi-domain Performance Analysis

Author: Schulz, Martin, Bhatele, Abhinav, Böhme, David, Bremer, Peer-Timo, Gamblin, Todd, Gimenez, Alfredo, Isaacs, Kate, Niethammer, Christoph, editor, Gracia, José, editor, Knüpfer, Andreas, editor, Resch, Michael M., editor, and Nagel, Wolfgang E., editor
Published: 2015
Full Text: View/download PDF

33. Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training

Author: Singh, Siddharth, primary and Bhatele, Abhinav, additional
Published: 2023
Full Text: View/download PDF

34. Porting a Computational Fluid Dynamics Code with AMR to Large-scale GPU Platforms

Author: Davis, Joshua H., primary, Shafner, Justin, additional, Nichols, Daniel, additional, Grube, Nathan, additional, Martin, Pino, additional, and Bhatele, Abhinav, additional
Published: 2023
Full Text: View/download PDF

35. Scalable Comparative Visualization of Ensembles of Call Graphs

Author: Kesavan, Suraj P., primary, Bhatia, Harsh, additional, Bhatele, Abhinav, additional, Brink, Stephanie, additional, Pearce, Olga, additional, Gamblin, Todd, additional, Bremer, Peer-Timo, additional, and Ma, Kwan-Liu, additional
Published: 2023
Full Text: View/download PDF

36. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Author: Singh, Siddarth, Singh, Siddarth, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong, Bhatele, Abhinav, Singh, Siddarth, Singh, Siddarth, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong, and Bhatele, Abhinav
Abstract: Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, threedimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4–8× larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
Published: 2023

37. Modeling Parallel Programs using Large Language Models

Author: Nichols, Daniel, Marathe, Aniruddha, Menon, Harshitha, Gamblin, Todd, and Bhatele, Abhinav
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Parallel software codes in high performance computing (HPC) continue to grow in complexity and scale as we enter the exascale era. A diverse set of emerging hardware and programming paradigms make developing, optimizing, and maintaining parallel software burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. So far, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform. However, with recent advancements in language modeling, and the wealth of code related data that is now available online, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We train LLMs using code and performance data that is specific to parallel codes. We compare several recent LLMs on HPC related tasks and introduce a new model, HPC-Coder, trained on parallel code. In our experiments we show that this model can auto-complete HPC functions where general models cannot, decorate for loops with OpenMP pragmas, and model performance changes in two scientific application repositories.
Published: 2023
Full Text: View/download PDF

38. Communication-minimizing Asynchronous Tensor Parallelism

Author: Singh, Siddharth, Sating, Zack, and Bhatele, Abhinav
Subjects: Performance (cs.PF), FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Performance, Artificial Intelligence (cs.AI), Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)
Abstract: As state-of-the-art neural networks scale to billions of parameters, designing parallel algorithms that can train these networks efficiently on multi-GPU clusters has become critical. This paper presents Tensor3D, a novel three-dimensional (3D) approach to parallelize tensor computations, that strives to minimize the idle time incurred due to communication in parallel training of large multi-billion parameter models. First, we introduce an intelligent distribution of neural network parameters across GPUs that eliminates communication required for satisfying data dependencies of individual layers. Then, we propose a novel overdecomposition of the parallel training process, using which we achieve significant overlap of communication with computation, thereby reducing GPU idle time. Finally, we present a communication model, which helps users identify communication optimal decompositions of available hardware resources for a given neural network. For a 28B parameter CNN on 256 A100 GPUs, Tensor3D improves the training time by nearly 60% as compared to Megatron-LM.
Published: 2023
Full Text: View/download PDF

39. Creating a Tool Set for Optimizing Topology-Aware Node Mappings

Author: Schulz, Martin, Bhatele, Abhinav, Bremer, Peer-Timo, Gamblin, Todd, Isaacs, Katherine, Levine, Joshua A., Pascucci, Valerio, Brunst, Holger, editor, Müller, Matthias S., editor, Nagel, Wolfgang E., editor, and Resch, Michael M., editor
Published: 2012
Full Text: View/download PDF

40. Topology Aware Task Mapping

Author: Bhatele, Abhinav and Padua, David, editor
Published: 2011
Full Text: View/download PDF

41. NAMD (NAnoscale Molecular Dynamics)

Author: Kalé, Laxmikant V., Bhatele, Abhinav, Bohm, Eric J., Phillips, James C., and Padua, David, editor
Published: 2011
Full Text: View/download PDF

42. A Case Study of Communication Optimizations on 3D Mesh Interconnects

Author: Bhatelé, Abhinav, Bohm, Eric, Kalé, Laxmikant V., Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Sips, Henk, editor, Epema, Dick, editor, and Lin, Hai-Xiang, editor
Published: 2009
Full Text: View/download PDF

43. Verifying Simulator Readiness for Evaluating Potential Exascale Interconnect Technologies [PowerPoint]

Author: Hemmert, Karl, primary, Wilke, Jeremiah, additional, Kenny, Joseph, additional, Lewis, Cannada, additional, Bhatele, Abhinav, additional, Georgakoudis, Giorgis, additional, Pakin, Scott, additional, Mubarak, Misbah, additional, and Groves, Taylor, additional
Published: 2019
Full Text: View/download PDF

44. Designing an Interactive, Notebook-Embedded, Tree Visualization to Support Exploratory Performance Analysis

Author: Scully-Allison, Connor, Lumsden, Ian, Williams, Katy, Bartels, Jesse, Taufer, Michela, Brink, Stephanie, Bhatele, Abhinav, Pearce, Olga, and Isaacs, Katherine E.
Subjects: FOS: Computer and information sciences, Computer Science - Human-Computer Interaction, Human-Computer Interaction (cs.HC)
Abstract: Interactive visualization via direct manipulation has inherent design trade-offs in flexibility, discoverability, and ease-of-use. Scripting languages can support a vast range of user queries and tasks, but may be more cumbersome for free-form exploration. Embedding interactive visualization in a scripting environment, such as a computational notebook, provides an opportunity for leveraging the strengths of both direct manipulation and scripting. We conduct a design study investigating this opportunity in the context of calling context trees as used for performance analysis of parallel software. Our collaborators make new performance analysis functionality available to users via Jupyter notebook examples, making the project setting conducive to such an investigation. Through a series of semi-structured interviews and regular meetings with project stakeholders, we produce a formal task analysis grounded in the expectation that tasks may be supported by scripting, interactive visualization, or both paradigms. We then design an interactive bivariate calling context tree visualization for embedding in Jupyter notebooks with features to pass data and state between the scripting and visualization contexts. We evaluated our embedded design with seven high performance computing experts. The experts were able to complete tasks and provided further feedback on the visualization and the notebook-embedded interactive visualization paradigm. We reflect upon the project and discuss factors in both the process and the design of the embedded visualization., Submitted to IEEE VIS 2022
Published: 2022

45. AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning

Author: Singh, Siddharth and Bhatele, Abhinav
Subjects: Performance (cs.PF), FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Performance, Artificial Intelligence (cs.AI), Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Distributed, Parallel, and Cluster Computing (cs.DC), Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents AxoNN, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12-100 billion parameters on 48-384 NVIDIA Tesla V100 GPUs, AxoNN achieves a per-GPU throughput of 49.4-54.78% of theoretical peak and reduces the training time by 22-37 days (15-25% speedup) as compared to the state-of-the-art., Comment: Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE Computer Society, May 2022
Published: 2022

46. Resource Utilization Aware Job Scheduling to Mitigate Performance Variability

Author: Nichols, Daniel, primary, Marathe, Aniruddha, additional, Shoga, Kathleen, additional, Gamblin, Todd, additional, and Bhatele, Abhinav, additional
Published: 2022
Full Text: View/download PDF

47. Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations

Author: Acun, Bilge, primary, Jain, Nikhil, additional, Bhatele, Abhinav, additional, Mubarak, Misbah, additional, Carothers, Christopher D., additional, and Kale, Laxmikant V., additional
Published: 2015
Full Text: View/download PDF

48. A Simulation Study of Hardware Parameters for Future GPU-based HPC Platforms

Author: Bhowmik, Saptarshi, primary, Jain, Nikhil, additional, Yuan, Xin, additional, and Bhatele, Abhinav, additional
Published: 2021
Full Text: View/download PDF

49. Visualizing Hierarchical Performance Profiles of Parallel Codes Using CallFlow

Author: Nguyen, Huu Tan, primary, Bhatele, Abhinav, additional, Jain, Nikhil, additional, Kesavan, Suraj P., additional, Bhatia, Harsh, additional, Gamblin, Todd, additional, Ma, Kwan-Liu, additional, and Bremer, Peer-Timo, additional
Published: 2021
Full Text: View/download PDF

50. Understanding and Mitigating Network Interference on High-Performance Computing Systems

Author: Strout, Michelle, Zhang, Beichuan, Bhatele, Abhinav, Jain, Nikhil, Smith, Staci, Strout, Michelle, Zhang, Beichuan, Bhatele, Abhinav, Jain, Nikhil, and Smith, Staci
Abstract: On most high-performance computing platforms, concurrently executing jobs share network resources. This sharing can lead to inter-job network interference, which can have a significant impact on the performance of communication-intensive applications. In this dissertation we focus on understanding and mitigating inter-job network interference on systems built with the fat-tree topology, a network architecture that is currently deployed in many of the top supercomputers in the world. We first analyze network congestion caused by multi-job workloads on a production fat-tree based system Cab, and establish a regression model to relate network hotspots to application performance degradation. The model shows that the typical routing strategy for fat-tree networks is ineffective at balancing network traffic and mitigating interference. We propose an alternative adaptive routing strategy, which we call adaptive flow-aware routing. We implement our strategy on Cab, and tests show up to a 46% improvement in job run time when compared to default routing. However, any reactive, routing-based approach to mitigating inter-job interference cannot guarantee low worst-case interference. A better approach—in that it completely eliminates the interference—is to implement scheduling policies that proactively enforce network isolation for every job. Existing schedulers that allocate isolated partitions lead to lowered system utilization, which creates a barrier to adoption. Accordingly, we design and implement Jigsaw, a new scheduling approach for three-level fat-trees that overcomes this barrier by explicitly allocating nodes and links to jobs in such a way that system fragmentation is reduced while full bandwidth is guaranteed to each job. This is made possible by constraints on node and link allocation that we develop and prove are necessary for full partition bandwidth. Jigsaw typically achieves system utilization of 95-96%, within a few percentage points of a standard scheduler's
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

331 results on '"Bhatele, Abhinav"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources