Author: "Chowdhury, Mosharaf" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Chowdhury, Mosharaf"' showing total 164 results

Start Over Author "Chowdhury, Mosharaf"

164 results on '"Chowdhury, Mosharaf"'

1. Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services

Author: Liu, Jiachen, Wu, Zhiyu, Chung, Jae-Won, Lai, Fan, Lee, Myungjin, and Chowdhury, Mosharaf
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: The advent of large language models (LLMs) has transformed text-based services, enabling capabilities ranging from real-time translation to AI-driven chatbots. However, existing serving systems primarily focus on optimizing server-side aggregate metrics like token generation throughput, ignoring individual user experience with streamed text. As a result, under high and/or bursty load, a significant number of users can receive unfavorable service quality or poor Quality-of-Experience (QoE). In this paper, we first formally define QoE of text streaming services, where text is delivered incrementally and interactively to users, by considering the end-to-end token delivery process throughout the entire interaction with the user. Thereafter, we propose Andes, a QoE-aware serving system that enhances user experience for LLM-enabled text streaming services. At its core, Andes strategically allocates contended GPU resources among multiple requests over time to optimize their QoE. Our evaluations demonstrate that, compared to the state-of-the-art LLM serving systems like vLLM, Andes improves the average QoE by up to 3.2$\times$ under high request rate, or alternatively, it attains up to 1.6$\times$ higher request rate while preserving high QoE., Comment: 16 pages, 22 figures
Published: 2024

2. FedTrans: Efficient Federated Learning via Multi-Model Transformation

Author: Zhu, Yuxuan, Liu, Jiachen, Chowdhury, Mosharaf, and Lai, Fan
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Federated learning (FL) aims to train machine learning (ML) models across potentially millions of edge client devices. Yet, training and customizing models for FL clients is notoriously challenging due to the heterogeneity of client data, device capabilities, and the massive scale of clients, making individualized model exploration prohibitively expensive. State-of-the-art FL solutions personalize a globally trained model or concurrently train multiple models, but they often incur suboptimal model accuracy and huge training costs. In this paper, we introduce FedTrans, a multi-model FL training framework that automatically produces and trains high-accuracy, hardware-compatible models for individual clients at scale. FedTrans begins with a basic global model, identifies accuracy bottlenecks in model architectures during training, and then employs model transformation to derive new models for heterogeneous clients on the fly. It judiciously assigns models to individual clients while performing soft aggregation on multi-model updates to minimize total training costs. Our evaluations using realistic settings show that FedTrans improves individual client model accuracy by 14% - 72% while slashing training costs by 1.6X - 20X over state-of-the-art solutions.
Published: 2024

3. Toward Cross-Layer Energy Optimizations in AI Systems

Author: Chung, Jae-Won, Talati, Nishil, and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The "AI for Science, Energy, and Security" report from DOE outlines a significant focus on developing and optimizing artificial intelligence workflows for a foundational impact on a broad range of DOE missions. With the pervasive usage of artificial intelligence (AI) and machine learning (ML) tools and techniques, their energy efficiency is likely to become the gating factor toward adoption. This is because generative AI (GenAI) models are massive energy hogs: for instance, training a 200-billion parameter large language model (LLM) at Amazon is estimated to have taken 11.9 GWh, which is enough to power more than a thousand average U.S. households for a year. Inference consumes even more energy, because a model trained once serve millions. Given this scale, high energy efficiency is key to addressing the power delivery problem of constructing and operating new supercomputers and datacenters specialized for AI workloads. In that regard, we outline software- and architecture-level research challenges and opportunities, setting the stage for creating cross-layer energy optimizations in AI systems., Comment: 2024 Energy-Efficient Computing for Science Workshop
Published: 2024

4. Venn: Resource Management Across Federated Learning Jobs

Author: Liu, Jiachen, Lai, Fan, Ding, Ding, Zhang, Yiwen, and Chowdhury, Mosharaf
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: In recent years, federated learning (FL) has emerged as a promising approach for machine learning (ML) and data science across distributed edge devices. With the increasing popularity of FL, resource contention between multiple FL jobs training on the same device population is increasing as well. Scheduling edge resources among multiple FL jobs is different from GPU scheduling for cloud ML because of the ephemeral nature and planetary scale of participating devices as well as the overlapping resource requirements of diverse FL jobs. Existing resource managers for FL jobs opt for random assignment of devices to FL jobs for simplicity and scalability, which leads to poor performance. In this paper, we present Venn, an FL resource manager, that efficiently schedules ephemeral, heterogeneous devices among many FL jobs, with the goal of reducing their average job completion time (JCT). Venn formulates the Intersection Resource Scheduling (IRS) problem to identify complex resource contention among multiple FL jobs. Then, Venn proposes a contention-aware scheduling heuristic to minimize the average scheduling delay. Furthermore, it proposes a resource-aware device-to-job matching heuristic that focuses on optimizing response collection time by mitigating stragglers. Our evaluation shows that, compared to the state-of-the-art FL resource managers, Venn improves the average JCT by up to 1.88X., Comment: 15 pages, 15 figrues
Published: 2023

5. Reducing Energy Bloat in Large Model Training

Author: Chung, Jae-Won, Gu, Yile, Jang, Insu, Meng, Luoxi, Bansal, Nikhil, and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Training large AI models on numerous GPUs consumes a massive amount of energy, making power delivery one of the largest limiting factors in building and operating datacenters for AI workloads. However, we observe that not all energy consumed during training directly contributes to end-to-end throughput; a significant portion can be removed without slowing down training. We call this portion energy bloat. In this work, we identify two independent sources of energy bloat in large model training and propose Perseus, a training system that mitigates both. To do this, Perseus obtains the time--energy tradeoff frontier of a large model training job using an efficient graph cut-based algorithm, and schedules computation energy consumption across time to reduce both types of energy bloat. Evaluation on large models, including GPT-3 and Bloom, shows that Perseus reduces the energy consumption of large model training by up to 30% without any throughput loss or hardware modification., Comment: SOSP 24 | Open-source part of Zeus at https://ml.energy/zeus/research_overview/perseus/
Published: 2023
Full Text: View/download PDF

6. Efficient Large Language Models: A Survey

Author: Wan, Zhongwei, Wang, Xin, Liu, Che, Alam, Samiul, Zheng, Yu, Liu, Jiachen, Qu, Zhongnan, Yan, Shen, Zhu, Yi, Zhang, Quanlu, Chowdhury, Mosharaf, and Zhang, Mi
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding and language generation, and thus have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we organize the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey. We will actively maintain the repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of efficient LLMs research and inspire them to contribute to this important and exciting field., Comment: Camera ready version of Transactions on Machine Learning Research (TMLR)
Published: 2023

7. Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Author: Jang, Insu, Yang, Zhenning, Zhang, Zhen, Jin, Xin, and Chowdhury, Mosharaf
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least $f+1$ logically equivalent pipeline replicas to tolerate any $f$ simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after $f$ or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to $29.6x$., Comment: SOSP'23 | Camera-ready + figures and numbers are corrected
Published: 2023
Full Text: View/download PDF

8. Memory Disaggregation: Advances and Open Challenges

Author: Maruf, Hasan Al and Chowdhury, Mosharaf
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Compute and memory are tightly coupled within each server in traditional datacenters. Large-scale datacenter operators have identified this coupling as a root cause behind fleet-wide resource underutilization and increasing Total Cost of Ownership (TCO). With the advent of ultra-fast networks and cache-coherent interfaces, memory disaggregation has emerged as a potential solution, whereby applications can leverage available memory even outside server boundaries. This paper summarizes the growing research landscape of memory disaggregation from a software perspective and introduces the challenges toward making it practical under current and future hardware trends. We also reflect on our seven-year journey in the SymbioticLab to build a comprehensive disaggregated memory system over ultra-fast networks. We conclude with some open challenges toward building next-generation memory disaggregation systems leveraging emerging cache-coherent interconnects.
Published: 2023

9. Chasing Low-Carbon Electricity for Practical and Sustainable DNN Training

Author: Yang, Zhenning, Meng, Luoxi, Chung, Jae-Won, and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Deep learning has experienced significant growth in recent years, resulting in increased energy consumption and carbon emission from the use of GPUs for training deep neural networks (DNNs). Answering the call for sustainability, conventional solutions have attempted to move training jobs to locations or time frames with lower carbon intensity. However, moving jobs to other locations may not always be feasible due to large dataset sizes or data regulations. Moreover, postponing training can negatively impact application service quality because the DNNs backing the service are not updated in a timely fashion. In this work, we present a practical solution that reduces the carbon footprint of DNN training without migrating or postponing jobs. Specifically, our solution observes real-time carbon intensity shifts during training and controls the energy consumption of GPUs, thereby reducing carbon footprint while maintaining training performance. Furthermore, in order to proactively adapt to shifting carbon intensity, we propose a lightweight machine learning algorithm that predicts the carbon intensity of the upcoming time frame. Our solution, Chase, reduces the total carbon footprint of training ResNet-50 on ImageNet by 13.6% while only increasing training time by 2.5%., Comment: ICLR 23 Workshop | https://ml.energy/zeus
Published: 2023

10. FLINT: A Platform for Federated Learning Integration

Author: Wang, Ewen, Kannan, Ajay, Liang, Yuefeng, Chen, Boyi, and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, F.2.2, I.2.7
Abstract: Cross-device federated learning (FL) has been well-studied from algorithmic, system scalability, and training speed perspectives. Nonetheless, moving from centralized training to cross-device FL for millions or billions of devices presents many risks, including performance loss, developer inertia, poor user experience, and unexpected application failures. In addition, the corresponding infrastructure, development costs, and return on investment are difficult to estimate. In this paper, we present a device-cloud collaborative FL platform that integrates with an existing machine learning platform, providing tools to measure real-world constraints, assess infrastructure capabilities, evaluate model training performance, and estimate system resource requirements to responsibly bring FL into production. We also present a decision workflow that leverages the FL-integrated platform to comprehensively evaluate the trade-offs of cross-device FL and share our empirical evaluations of business-critical machine learning applications that impact hundreds of millions of users., Comment: Preprint for MLSys 2023
Published: 2023

11. DPack: Efficiency-Oriented Privacy Budget Scheduling

Author: Tholoniat, Pierre, Kostopoulou, Kelly, Chowdhury, Mosharaf, Cidon, Asaf, Geambasu, Roxana, Lécuyer, Mathias, and Yang, Junfeng
Subjects: Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Machine learning (ML) models can leak information about users, and differential privacy (DP) provides a rigorous way to bound that leakage under a given budget. This DP budget can be regarded as a new type of compute resource in workloads of multiple ML models training on user data. Once it is used, the DP budget is forever consumed. Therefore, it is crucial to allocate it most efficiently to train as many models as possible. This paper presents the scheduler for privacy that optimizes for efficiency. We formulate privacy scheduling as a new type of multidimensional knapsack problem, called privacy knapsack, which maximizes DP budget efficiency. We show that privacy knapsack is NP-hard, hence practical algorithms are necessarily approximate. We develop an approximation algorithm for privacy knapsack, DPack, and evaluate it on microbenchmarks and on a new, synthetic private-ML workload we developed from the Alibaba ML cluster trace. We show that DPack: (1) often approaches the efficiency-optimal schedule, (2) consistently schedules more tasks compared to a state-of-the-art privacy scheduling algorithm that focused on fairness (1.3-1.7x in Alibaba, 1.0-2.6x in microbenchmarks), but (3) sacrifices some level of fairness for efficiency. Therefore, using DPack, DP ML operators should be able to train more models on the same amount of user data while offering the same privacy guarantee to their users., Comment: Published at EuroSys '25. v2: camera-ready version
Published: 2022
Full Text: View/download PDF

12. Auxo: Efficient Federated Learning via Scalable Client Clustering

Author: Liu, Jiachen, Lai, Fan, Dai, Yinwei, Akella, Aditya, Madhyastha, Harsha, and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Federated learning (FL) is an emerging machine learning (ML) paradigm that enables heterogeneous edge devices to collaboratively train ML models without revealing their raw data to a logically centralized server. However, beyond the heterogeneous device capacity, FL participants often exhibit differences in their data distributions, which are not independent and identically distributed (Non-IID). Many existing works present point solutions to address issues like slow convergence, low final accuracy, and bias in FL, all stemming from client heterogeneity. In this paper, we explore an additional layer of complexity to mitigate such heterogeneity by grouping clients with statistically similar data distributions (cohorts). We propose Auxo to gradually identify such cohorts in large-scale, low-availability, and resource-constrained FL populations. Auxo then adaptively determines how to train cohort-specific models in order to achieve better model performance and ensure resource efficiency. Our extensive evaluations show that, by identifying cohorts with smaller heterogeneity and performing efficient cohort-based training, Auxo boosts various existing FL solutions in terms of final accuracy (2.1% - 8.2%), convergence time (up to 2.2x), and model bias (4.8% - 53.8%)., Comment: 18 pages
Published: 2022
Full Text: View/download PDF

13. Orloj: Predictably Serving Unpredictable DNNs

Author: Yu, Peifeng, Qiu, Yuqing, Jin, Xin, and Chowdhury, Mosharaf
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Existing DNN serving solutions can provide tight latency SLOs while maintaining high throughput via careful scheduling of incoming requests, whose execution times are assumed to be highly predictable and data-independent. However, inference requests to emerging dynamic DNNs -- e.g., popular natural language processing (NLP) models and computer vision (CV) models that skip layers -- are data-dependent. They exhibit poor performance when served using existing solutions because they experience large variance in request execution times depending on the input -- the longest request in a batch inflates the execution times of the smaller ones, causing SLO misses in the absence of careful batching. In this paper, we present Orloj, a dynamic DNN serving system, that captures this variance in dynamic DNNs using empirical distributions of expected request execution times, and then efficiently batches and schedules them without knowing a request's precise execution time. Orloj significantly outperforms state-of-the-art serving solutions for high variance dynamic DNN workloads by 51--80% in finish rate under tight SLO constraints, and over 100% under more relaxed SLO settings. For well-studied static DNN workloads, Orloj keeps comparable performance with the state-of-the-art.
Published: 2022

14. Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

Author: You, Jie, Chung, Jae-Won, and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose Zeus, an optimization framework to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 15.3%-75.8% for diverse workloads., Comment: NSDI 2023 | Homepage https://ml.energy/zeus
Published: 2022

15. Swan: A Neural Engine for Efficient DNN Training on Smartphone SoCs

Author: Singapuram, Sanjay Sri Vallabh, Lai, Fan, Hu, Chuheng, and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The need to train DNN models on end-user devices (e.g., smartphones) is increasing with the need to improve data privacy and reduce communication overheads. Unlike datacenter servers with powerful CPUs and GPUs, modern smartphones consist of a diverse collection of specialized cores following a system-on-a-chip (SoC) architecture that together perform a variety of tasks. We observe that training DNNs on a smartphone SoC without carefully considering its resource constraints can not only lead to suboptimal training performance but significantly affect user experience as well. In this paper, we present Swan, a neural engine to optimize DNN training on smartphone SoCs without hurting user experience. Extensive large-scale evaluations show that Swan can improve performance by 1.2 - 23.3x over the state-of-the-art.
Published: 2022

16. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory

Author: Maruf, Hasan Al, Wang, Hao, Dhanotia, Abhishek, Weiner, Johannes, Agarwal, Niket, Bhattacharya, Pallab, Petersen, Chris, Chowdhury, Mosharaf, Kanaujia, Shobhit, and Chauhan, Prakash
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Operating Systems
Abstract: The increasing demand for memory in hyperscale applications has led to memory becoming a large portion of the overall datacenter spend. The emergence of coherent interfaces like CXL enables main memory expansion and offers an efficient solution to this problem. In such systems, the main memory can constitute different memory technologies with varied characteristics. In this paper, we characterize memory usage patterns of a wide range of datacenter applications across the server fleet of Meta. We, therefore, demonstrate the opportunities to offload colder pages to slower memory tiers for these applications. Without efficient memory management, however, such systems can significantly degrade performance. We propose a novel OS-level application-transparent page placement mechanism (TPP) for CXL-enabled memory. TPP employs a lightweight mechanism to identify and place hot/cold pages to appropriate memory tiers. It enables a proactive page demotion from local memory to CXL-Memory. This technique ensures a memory headroom for new page allocations that are often related to request processing and tend to be short-lived and hot. At the same time, TPP can promptly promote performance-critical hot pages trapped in the slow CXL-Memory to the fast local memory, while minimizing both sampling overhead and unnecessary migrations. TPP works transparently without any application-specific knowledge and can be deployed globally as a kernel release. We evaluate TPP in the production server fleet with early samples of new x86 CPUs with CXL 1.1 support. TPP makes a tiered memory system performant as an ideal baseline (<1% gap) that has all the memory in the local tier. It is 18% better than today's Linux, and 5-17% better than existing solutions including NUMA Balancing and AutoTiering. Most of the TPP patches have been merged in the Linux v5.18 release.
Published: 2022
Full Text: View/download PDF

17. Elastic Model Aggregation with Parameter Service

Author: Gu, Juncheng, Chowdhury, Mosharaf, Shin, Kang G., and Akella, Aditya
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Model aggregation, the process that updates model parameters, is an important step for model convergence in distributed deep learning (DDL). However, the parameter server (PS), a popular paradigm of performing model aggregation, causes CPU underutilization in deep learning (DL) clusters, due to the bursty nature of aggregation and static resource allocation. To remedy this problem, we propose Parameter Service, an elastic model aggregation framework for DDL training, which decouples the function of model aggregation from individual training jobs and provides a shared model aggregation service to all jobs in the cluster. In Parameter Service, model aggregations are efficiently packed and dynamically migrated to fit into the available CPUs with negligible time overhead. Furthermore, Parameter Service can elastically manage its CPU resources based on its load to enhance resource efficiency. We have implemented Parameter Service in a prototype system called AutoPS and evaluated it via testbed experimentation and trace-driven simulations. AutoPS reduces up to 75% of CPU consumption with little or no performance impact on the training jobs. The design of Parameter Service is transparent to the users and can be incorporated in popular DL frameworks.
Published: 2022

18. Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing

Author: Wang, Yiding, Sun, Decang, Chen, Kai, Lai, Fan, and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Training deep neural networks (DNNs) is time-consuming. While most existing solutions try to overlap/schedule computation and communication for efficient training, this paper goes one step further by skipping computing and communication through DNN layer freezing. Our key insight is that the training progress of internal DNN layers differs significantly, and front layers often become well-trained much earlier than deep layers. To explore this, we first introduce the notion of training plasticity to quantify the training progress of internal DNN layers. Then we design Egeria, a knowledge-guided DNN training system that employs semantic knowledge from a reference model to accurately evaluate individual layers' training plasticity and safely freeze the converged ones, saving their corresponding backward computation and communication. Our reference model is generated on the fly using quantization techniques and runs forward operations asynchronously on available CPUs to minimize the overhead. In addition, Egeria caches the intermediate outputs of the frozen layers with prefetching to further skip the forward computation. Our implementation and testbed experiments with popular vision and language models show that Egeria achieves 19%-43% training speedup w.r.t. the state-of-the-art without sacrificing accuracy., Comment: Accepted to EuroSys '23
Published: 2022

19. Treehouse: A Case For Carbon-Aware Datacenter Software

Author: Anderson, Thomas, Belay, Adam, Chowdhury, Mosharaf, Cidon, Asaf, and Zhang, Irene
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Computers and Society, Computer Science - Machine Learning, Computer Science - Networking and Internet Architecture
Abstract: The end of Dennard scaling and the slowing of Moore's Law has put the energy use of datacenters on an unsustainable path. Datacenters are already a significant fraction of worldwide electricity use, with application demand scaling at a rapid rate. We argue that substantial reductions in the carbon intensity of datacenter computing are possible with a software-centric approach: by making energy and carbon visible to application developers on a fine-grained basis, by modifying system APIs to make it possible to make informed trade offs between performance and carbon emissions, and by raising the level of application programming to allow for flexible use of more energy efficient means of compute and storage. We also lay out a research agenda for systems software to reduce the carbon footprint of datacenter computing.
Published: 2022

20. The Internet of Federated Things (IoFT): A Vision for the Future and In-depth Survey of Data-driven Approaches for Federated Learning

Author: Kontar, Raed, Shi, Naichen, Yue, Xubo, Chung, Seokhyun, Byon, Eunshin, Chowdhury, Mosharaf, Jin, Judy, Kontar, Wissam, Masoud, Neda, Noueihed, Maher, Okwudire, Chinedum E., Raskutti, Garvesh, Saigal, Romesh, Singh, Karandeep, and Ye, Zhisheng
Subjects: Computer Science - Machine Learning
Abstract: The Internet of Things (IoT) is on the verge of a major paradigm shift. In the IoT system of the future, IoFT, the cloud will be substituted by the crowd where model training is brought to the edge, allowing IoT devices to collaboratively extract knowledge and build smart analytics/models while keeping their personal data stored locally. This paradigm shift was set into motion by the tremendous increase in computational power on IoT devices and the recent advances in decentralized and privacy-preserving model training, coined as federated learning (FL). This article provides a vision for IoFT and a systematic overview of current efforts towards realizing this vision. Specifically, we first introduce the defining characteristics of IoFT and discuss FL data-driven approaches, opportunities, and challenges that allow decentralized inference within three dimensions: (i) a global model that maximizes utility across all IoT devices, (ii) a personalized model that borrows strengths across all devices yet retains its own model, (iii) a meta-learning model that quickly adapts to new devices or learning tasks. We end by describing the vision and challenges of IoFT in reshaping different industries through the lens of domain experts. Those industries include manufacturing, transportation, energy, healthcare, quality & reliability, business, and computing., Comment: Accepted at IEEE
Published: 2021
Full Text: View/download PDF

21. Memtrade: A Disaggregated-Memory Marketplace for Public Clouds

Author: Maruf, Hasan Al, Zhong, Yuhong, Wang, Hongyi, Chowdhury, Mosharaf, Cidon, Asaf, and Waldspurger, Carl
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: We present Memtrade, the first memory disaggregation system for public clouds. Public clouds introduce a set of unique challenges for resource disaggregation across different tenants, including security, isolation and pricing. Memtrade allows producer virtual machines (VMs) to lease both their unallocated memory and allocated-but-idle application memory to remote consumer VMs for a limited period of time. Memtrade does not require any modifications to host-level system software or support from the cloud provider. It harvests producer memory using an application-aware control loop to form a distributed transient remote memory pool with minimal performance impact; it employs a broker to match producers with consumers while satisfying performance constraints; and it exposes the matched memory to consumers as a secure KV cache. Our evaluation using real-world cluster traces shows that Memtrade provides significant performance benefit for consumers (improving average read latency up to 2.8x) while preserving confidentiality and integrity, with little impact on producer applications (degrading performance by less than 2.1%).
Published: 2021

22. Fed-ensemble: Improving Generalization through Model Ensembling in Federated Learning

Author: Shi, Naichen, Lai, Fan, Kontar, Raed Al, and Chowdhury, Mosharaf
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: In this paper we propose Fed-ensemble: a simple approach that bringsmodel ensembling to federated learning (FL). Instead of aggregating localmodels to update a single global model, Fed-ensemble uses random permutations to update a group of K models and then obtains predictions through model averaging. Fed-ensemble can be readily utilized within established FL methods and does not impose a computational overhead as it only requires one of the K models to be sent to a client in each communication round. Theoretically, we show that predictions on newdata from all K models belong to the same predictive posterior distribution under a neural tangent kernel regime. This result in turn sheds light onthe generalization advantages of model averaging. We also illustrate thatFed-ensemble has an elegant Bayesian interpretation. Empirical results show that our model has superior performance over several FL algorithms,on a wide range of data sets, and excels in heterogeneous settings often encountered in FL applications.
Published: 2021
Full Text: View/download PDF

23. FedScale: Benchmarking Model and System Performance of Federated Learning at Scale

Author: Lai, Fan, Dai, Yinwei, Singapuram, Sanjay S., Liu, Jiachen, Zhu, Xiangfeng, Madhyastha, Harsha V., and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: We present FedScale, a federated learning (FL) benchmarking suite with realistic datasets and a scalable runtime to enable reproducible FL research. FedScale datasets encompass a wide range of critical FL tasks, ranging from image classification and object detection to language modeling and speech recognition. Each dataset comes with a unified evaluation protocol using real-world data splits and evaluation metrics. To reproduce realistic FL behavior, FedScale contains a scalable and extensible runtime. It provides high-level APIs to implement FL algorithms, deploy them at scale across diverse hardware and software backends, and evaluate them at scale, all with minimal developer efforts. We combine the two to perform systematic benchmarking experiments and highlight potential opportunities for heterogeneity-aware co-optimizations in FL. FedScale is open-source and actively maintained by contributors from different institutions at http://fedscale.ai. We welcome feedback and contributions from the community.
Published: 2021

24. Oort: Efficient Federated Learning via Guided Participant Selection

Author: Lai, Fan, Zhu, Xiangfeng, Madhyastha, Harsha V., and Chowdhury, Mosharaf
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Federated Learning (FL) is an emerging direction in distributed machine learning (ML) that enables in-situ model training and testing on edge data. Despite having the same end goals as traditional ML, FL executions differ significantly in scale, spanning thousands to millions of participating devices. As a result, data characteristics and device capabilities vary widely across clients. Yet, existing efforts randomly select FL participants, which leads to poor model and system efficiency. In this paper, we propose Oort to improve the performance of federated training and testing with guided participant selection. With an aim to improve time-to-accuracy performance in model training, Oort prioritizes the use of those clients who have both data that offers the greatest utility in improving model accuracy and the capability to run training quickly. To enable FL developers to interpret their results in model testing, Oort enforces their requirements on the distribution of participant data while improving the duration of federated testing by cherry-picking clients. Our evaluation shows that, compared to existing participant selection mechanisms, Oort improves time-to-accuracy performance by 1.2x-14.1x and final model accuracy by 1.3%-9.8%, while efficiently enforcing developer-specified model testing criteria at the scale of millions of clients.
Published: 2020

25. BoPF: Mitigating the Burstiness-Fairness Tradeoff in Multi-Resource Clusters

Author: Le, Tan N., Sun, Xiao, Chowdhury, Mosharaf, and Liu, Zhenhua
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Simultaneously supporting latency- and throughout-sensitive workloads in a shared environment is an increasingly more common challenge in big data clusters. Despite many advances, existing cluster schedulers force the same performance goal - fairness in most cases - on all jobs. Latency-sensitive jobs suffer, while throughput-sensitive ones thrive. Using prioritization does the opposite: it opens up a path for latency-sensitive jobs to dominate. In this paper, we tackle the challenges in supporting both short-term performance and long-term fairness simultaneously with high resource utilization by proposing Bounded Priority Fairness (BoPF). BoPF provides short-term resource guarantees to latency-sensitive jobs and maintains long-term fairness for throughput-sensitive jobs. BoPF is the first scheduler that can provide long-term fairness, burst guarantee, and Pareto efficiency in a strategyproof manner for multi-resource scheduling. Deployments and large-scale simulations show that BoPF closely approximates the performance of Strict Priority as well as the fairness characteristics of DRF. In deployments, BoPF speeds up latency-sensitive jobs by 5.38 times compared to DRF, while still maintaining long-term fairness. In the meantime, BoPF improves the average completion times of throughput-sensitive jobs by up to 3.05 times compared to Strict Priority.
Published: 2019

26. Effectively Prefetching Remote Memory with Leap

Author: Maruf, Hasan Al and Chowdhury, Mosharaf
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Operating Systems
Abstract: Memory disaggregation over RDMA can improve the performance of memory-constrained applications by replacing disk swapping with remote memory accesses. However, state-of-the-art memory disaggregation solutions still use data path components designed for slow disks. As a result, applications experience remote memory access latency significantly higher than that of the underlying low-latency network, which itself is too high for many applications. In this paper, we propose Leap, a prefetching solution for remote memory accesses due to memory disaggregation. At its core, Leap employs an online, majority-based prefetching algorithm, which increases the page cache hit rate. We complement it with a lightweight and efficient data path in the kernel that isolates each application's data path to the disaggregated memory and mitigates latency bottlenecks arising from legacy throughput-optimizing operations. Integration of Leap in the Linux kernel improves the median and tail remote page access latencies of memory-bound applications by up to 104.04x and 22.62x, respectively, over the default data path. This leads to up to 10.16x performance improvements for applications using disaggregated memory in comparison to the state-of-the-art solutions.
Published: 2019

27. Hydra: Resilient and Highly Available Remote Memory

Author: Lee, Youngmoon, Maruf, Hasan Al, Chowdhury, Mosharaf, Cidon, Asaf, and Shin, Kang G.
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Networking and Internet Architecture
Abstract: We present Hydra, a low-latency, low-overhead, and highly available resilience mechanism for remote memory. Hydra can access erasure-coded remote memory within a single-digit microsecond read/write latency, significantly improving the performance-efficiency trade-off over the state-of-the-art -- it performs similar to in-memory replication with 1.6X lower memory overhead. We also propose CodingSets, a novel coding group placement algorithm for erasure-coded data, that provides load balancing while reducing the probability of data loss under correlated failures by an order of magnitude. With Hydra, even when only 50% of memory is local, unmodified memory-intensive applications achieve performance close to that of the fully in-memory case in the presence of remote failures and outperform the state-of-the-art solutions by up to 4.35X.
Published: 2019

28. Near Optimal Coflow Scheduling in Networks

Author: Chowdhury, Mosharaf, Khuller, Samir, Purohit, Manish, Yang, Sheng, and You, Jie
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms
Abstract: The coflow scheduling problem has emerged as a popular abstraction in the last few years to study data communication problems within a data center. In this basic framework, each coflow has a set of communication demands and the goal is to schedule many coflows in a manner that minimizes the total weighted completion time. A coflow is said to complete when all its communication needs are met. This problem has been extremely well studied for the case of complete bipartite graphs that model a data center with full bisection bandwidth and several approximation algorithms and effective heuristics have been proposed recently. In this work, we study a slightly different model of coflow scheduling in general graphs (to capture traffic between data centers) and develop practical and efficient approximation algorithms for it. Our main result is a randomized 2 approximation algorithm for the single path and free path model, significantly improving prior work. In addition, we demonstrate via extensive experiments that the algorithm is practical, easy to implement and performs well in practice.
Published: 2019
Full Text: View/download PDF

29. RDMA Performance Isolation With Justitia

Author: Zhang, Yiwen, Tan, Yue, Stephens, Brent, and Chowdhury, Mosharaf
Subjects: Computer Science - Networking and Internet Architecture
Abstract: Despite its increasing popularity, most of RDMA's benefits such as ultra-low latency can be achieved only when running an application in isolation. Using microbenchmarks and real open-source RDMA applications, we identify a series of performance anomalies when multiple applications coexist and show that such anomalies are pervasive across InfiniBand, RoCEv2, and iWARP. They arise due to a fundamental tradeoff between performance isolation and work conservation, which the state-of-the-art RDMA congestion control protocols such as DCQCN cannot resolve. We present Justitia to address these performance anomalies. Justitia is a software-only, host-based, and easy-to-deploy solution that maximizes RNIC utilization while guaranteeing performance isolation via shaping, rate limiting, and pacing at senders. Our evaluation of Justitia on multiple RDMA implementations show that Justitia effectively isolates different types of traffic and significantly improves latency (by up to 56.9x) and throughput (by up to 9.7x) of real-world RDMA-based applications without compromising low CPU usage or modifying the applications.
Published: 2019

30. Terra: Scalable Cross-Layer GDA Optimizations

Author: You, Jie and Chowdhury, Mosharaf
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Geo-distributed analytics (GDA) frameworks transfer large datasets over the wide-area network (WAN). Yet existing frameworks often ignore the WAN topology. This disconnect between WAN-bound applications and the WAN itself results in missed opportunities for cross-layer optimizations. In this paper, we present Terra to bridge this gap. Instead of decoupled WAN routing and GDA transfer scheduling, Terra applies scalable cross-layer optimizations to minimize WAN transfer times for GDA jobs. We present a two-pronged approach: (i) a scalable algorithm for joint routing and scheduling to make fast decisions; and (ii) a scalable, overlay-based enforcement mechanism that avoids expensive switch rule updates in the WAN. Together, they enable Terra to quickly react to WAN uncertainties such as large bandwidth fluctuations and failures in an application-aware manner as well. Integration with the FloodLight SDN controller and Apache YARN, and evaluation on 4 workloads and 3 WAN topologies show that Terra improves the average completion times of GDA jobs by 1.55x-3.43x. GDA jobs running with Terra meets 2.82x-4.29x more deadlines and can quickly react to WAN-level events in an application-aware manner.
Published: 2019

31. Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

Author: Yu, Peifeng and Chowdhury, Mosharaf
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications. However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time sharing and preemption are expensive. Worse, when a DL application cannot completely use a GPU's resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization. We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, in order to achieve fine-grained GPU sharing among multiple DL applications. Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases. Our integration of Salus with TensorFlow and evaluation on popular DL jobs show that Salus can improve the average completion time of DL training jobs by $3.19\times$, GPU utilization for hyper-parameter tuning by $2.38\times$, and GPU utilization of DL inference applications by $42\times$ over not sharing the GPU and $7\times$ over NVIDIA MPS with small overhead.
Published: 2019

32. Fast and Accurate Performance Analysis of LTE Radio Access Networks

Author: Iyer, Anand Padmanabha, Stoica, Ion, Chowdhury, Mosharaf, and Li, Li Erran
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Learning, Computer Science - Networking and Internet Architecture
Abstract: An increasing amount of analytics is performed on data that is procured in a real-time fashion to make real-time decisions. Such tasks include simple reporting on streams to sophisticated model building. However, the practicality of such analyses are impeded in several domains because they are faced with a fundamental trade-off between data collection latency and analysis accuracy. In this paper, we study this trade-off in the context of a specific domain, Cellular Radio Access Networks (RAN). Our choice of this domain is influenced by its commonalities with several other domains that produce real-time data, our access to a large live dataset, and their real-time nature and dimensionality which makes it a natural fit for a popular analysis technique, machine learning (ML). We find that the latency accuracy trade-off can be resolved using two broad, general techniques: intelligent data grouping and task formulations that leverage domain characteristics. Based on this, we present CellScope, a system that addresses this challenge by applying a domain specific formulation and application of Multi-task Learning (MTL) to RAN performance analysis. It achieves this goal using three techniques: feature engineering to transform raw data into effective features, a PCA inspired similarity metric to group data from geographically nearby base stations sharing performance commonalities, and a hybrid online-offline model for efficient model updates. Our evaluation of CellScope shows that its accuracy improvements over direct application of ML range from 2.5x to 4.4x while reducing the model update overhead by up to 4.8x. We have also used CellScope to analyze a live LTE consisting of over 2 million subscribers for a period of over 10 months, where it uncovered several problems and insights, some of them previously unknown.
Published: 2016

33. Pyxis: Scheduling Mixed Tasks in Disaggregated Datacenters

Author: Qi, Sheng, Jin, Chao, Chowdhury, Mosharaf, Liu, Zhenming, Liu, Xuanzhe, and Jin, Xin
Abstract: Disaggregating compute from storage is an emerging trend in cloud computing. Effectively utilizing resources in both compute and storage pool is the key to high performance. The state-of-the-art scheduler provides optimal scheduling decisions for workloads with homogeneous tasks. However, cloud applications often generate a mix of tasks with diverse compute and IO characteristics, resulting in sub-optimal performance for existing solutions. We present Pyxis, a system that provides optimal scheduling decisions for mixed workloads in disaggregated datacenters with theoretical guarantees. Pyxis is capable of maximizing overall throughput while meeting latency SLOs. Pyxis decouples the scheduling of different tasks. Our insight is that the optimal solution has an “all-or-nothing” structure that can be captured by a single turning point in the spectrum of tasks. Based on task characteristics, the turning point partitions the tasks either all to storage nodes or all to compute nodes (none to storage nodes). We theoretically prove that the optimal solution has such a structure, and design an online algorithm with sub-second convergence. We implement a prototype of Pyxis. Experiments on CloudLab with various synthetic and application workloads show that Pyxis improves the throughput by 3–21× over the state-of-the-art solution.
Published: 2024
Full Text: View/download PDF

34. Toward Cross-Layer Energy Optimizations in Machine Learning Systems

Author: Chung, Jae-Won, Chowdhury, Mosharaf, Chung, Jae-Won, and Chowdhury, Mosharaf
Abstract: The enormous energy consumption of machine learning (ML) and generative AI workloads shows no sign of waning, taking a toll on operating costs, power delivery, and environmental sustainability. Despite a long line of research on energy-efficient hardware, we found that software plays a critical role in ML energy optimization through two recent works: Zeus and Perseus. This is especially true for large language models (LLMs) because their model sizes and, therefore, energy demands are growing faster than hardware efficiency improvements. Therefore, we advocate for a cross-layer approach for energy optimizations in ML systems, where hardware provides architectural support that pushes energy-efficient software further, while software leverages and abstracts the hardware to develop techniques that bring hardware-agnostic energy-efficiency gains.
Published: 2024

35. Flamingo: A User-Centric System for Fast and Energy-Efficient DNN Training on Smartphones

Author: Singapuram, Sanjay Sri Vallabh, primary, Hu, Chuheng, additional, Lai, Fan, additional, Zhang, Chengsong, additional, and Chowdhury, Mosharaf, additional
Published: 2023
Full Text: View/download PDF

36. Simplifying Cloud Management with Cloudless Computing

Author: Qiu, Yiming, primary, Kon, Patrick Tser Jern, additional, Xing, Jiarong, additional, Huang, Yibo, additional, Liu, Hongyi, additional, Wang, Xinyu, additional, Huang, Peng, additional, Chowdhury, Mosharaf, additional, and Chen, Ang, additional
Published: 2023
Full Text: View/download PDF

37. Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Author: Jang, Insu, primary, Yang, Zhenning, additional, Zhang, Zhen, additional, Jin, Xin, additional, and Chowdhury, Mosharaf, additional
Published: 2023
Full Text: View/download PDF

38. Treehouse: A Case For Carbon-Aware Datacenter Software

Author: Anderson, Thomas, primary, Belay, Adam, additional, Chowdhury, Mosharaf, additional, Cidon, Asaf, additional, and Zhang, Irene, additional
Published: 2023
Full Text: View/download PDF

39. Memory Disaggregation: Advances and Open Challenges

Author: Al Maruf, Hasan, primary and Chowdhury, Mosharaf, additional
Published: 2023
Full Text: View/download PDF

40. Memtrade: Marketplace for Disaggregated Memory Clouds

Author: Maruf, Hasan Al, primary, Zhong, Yuhong, additional, Wang, Hongyi, additional, Chowdhury, Mosharaf, additional, Cidon, Asaf, additional, and Waldspurger, Carl, additional
Published: 2023
Full Text: View/download PDF

41. Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing

Author: Wang, Yiding, primary, Sun, Decang, additional, Chen, Kai, additional, Lai, Fan, additional, and Chowdhury, Mosharaf, additional
Published: 2023
Full Text: View/download PDF

42. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory

Author: Maruf, Hasan Al, primary, Wang, Hao, additional, Dhanotia, Abhishek, additional, Weiner, Johannes, additional, Agarwal, Niket, additional, Bhattacharya, Pallab, additional, Petersen, Chris, additional, Chowdhury, Mosharaf, additional, Kanaujia, Shobhit, additional, and Chauhan, Prakash, additional
Published: 2023
Full Text: View/download PDF

43. Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing

Author: Wang, Yiding, Sun, Decang, Chen, Kai, Lai, Fan, Chowdhury, Mosharaf, Wang, Yiding, Sun, Decang, Chen, Kai, Lai, Fan, and Chowdhury, Mosharaf
Abstract: Training deep neural networks (DNNs) is time-consuming. While most existing solutions try to overlap/schedule computation and communication for efficient training, this paper goes one step further by skipping computing and communication through DNN layer freezing. Our key insight is that the training progress of internal DNN layers differs significantly, and front layers often become well-trained much earlier than deep layers. To explore this, we first introduce the notion of training plasticity to quantify the training progress of internal DNN layers. Then we design Egeria, a knowledge-guided DNN training system that employs semantic knowledge from a reference model to accurately evaluate individual layers' training plasticity and safely freeze the converged ones, saving their corresponding backward computation and communication. Our reference model is generated on the fly using quantization techniques and runs forward operations asynchronously on available CPUs to minimize the overhead. In addition, Egeria caches the intermediate outputs of the frozen layers with prefetching to further skip the forward computation. Our implementation and testbed experiments with popular vision and language models show that Egeria achieves 19%-43% training speedup w.r.t. the state-of-the-art without sacrificing accuracy.
Published: 2023

44. Perseus: Removing Energy Bloat from Large Model Training

Author: Chung, Jae-Won, Gu, Yile, Jang, Insu, Meng, Luoxi, Bansal, Nikhil, Chowdhury, Mosharaf, Chung, Jae-Won, Gu, Yile, Jang, Insu, Meng, Luoxi, Bansal, Nikhil, and Chowdhury, Mosharaf
Abstract: Training large AI models on numerous GPUs consumes a massive amount of energy. We observe that not all energy consumed during training directly contributes to end-to-end training throughput, and a significant portion can be removed without slowing down training, which we call energy bloat. In this work, we identify two independent sources of energy bloat in large model training, intrinsic and extrinsic, and propose Perseus, a unified optimization framework that mitigates both. Perseus obtains the "iteration time-energy" Pareto frontier of any large model training job using an efficient iterative graph cut-based algorithm and schedules energy consumption of its forward and backward computations across time to remove intrinsic and extrinsic energy bloat. Evaluation on large models like GPT-3 and Bloom shows that Perseus reduces energy consumption of large model training by up to 30%, enabling savings otherwise unobtainable before., Comment: Open-source at https://ml.energy/zeus/perseus
Published: 2023

45. Fed-ensemble: Ensemble Models in Federated Learning for Improved Generalization and Uncertainty Quantification

Author: Shi, Naichen, Lai, Fan, Al Kontar, Raed, and Chowdhury, Mosharaf
Abstract: The increase in the computational power of edge devices has opened up the possibility of processing some of the data at the edge and distributing model learning. This paradigm is often called federated learning (FL), where edge devices exploit their local computational resources to train models collaboratively. Though FL has seen recent success, it is unclear how to characterize uncertainties in FL predictions. In this paper, we propose Fed-ensemble: a simple approach that brings model ensembling to FL. Instead of aggregating local models to update a single global model, Fed-ensemble uses random permutations to update a group of $K$ models and then obtains predictions through model averaging. Fed-ensemble can be readily utilized within established FL methods and does not impose a computational overhead compared with single-model methods. Empirical results show that our model has superior performance over several FL algorithms on a wide range of data sets and excels in heterogeneous settings often encountered in FL applications. Also, by carefully choosing client-dependent weights in the inference stage, Fed-ensemble becomes personalized and yields even better performance. Theoretically, we show that predictions on new data from all $K$ models belong to the same predictive posterior distribution under a neural tangent kernel regime. This result, in turn, sheds light on the generalization advantages of model averaging and justifies the uncertainty quantification capability. We also illustrate that Fed-ensemble has an elegant Bayesian interpretation. Note to Practitioners—Fed-ensemble provides an algorithm that extracts a set of $K$ solutions without imposing any additional communication overhead in FL. Given multiple solutions, Fed-ensemble can be exploited to personalize inference as well as quantify uncertainty. Such capabilities may be beneficial within multiple practical systems that require uncertainty-aware decision-making. Further, Fed-ensemble may be useful for model validation and hypothesis testing.
Published: 2024
Full Text: View/download PDF

46. Fed-ensemble: Ensemble Models in Federated Learning for Improved Generalization and Uncertainty Quantification

Author: Shi, Naichen, primary, Lai, Fan, additional, Kontar, Raed Al, additional, and Chowdhury, Mosharaf, additional
Published: 2023
Full Text: View/download PDF

47. Packing Privacy Budget Efficiently

Author: Tholoniat, Pierre, Kostopoulou, Kelly, Chowdhury, Mosharaf, Cidon, Asaf, Geambasu, Roxana, Lécuyer, Mathias, and Yang, Junfeng
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Cryptography and Security, Cryptography and Security (cs.CR), Machine Learning (cs.LG)
Abstract: Machine learning (ML) models can leak information about users, and differential privacy (DP) provides a rigorous way to bound that leakage under a given budget. This DP budget can be regarded as a new type of compute resource in workloads of multiple ML models training on user data. Once it is used, the DP budget is forever consumed. Therefore, it is crucial to allocate it most efficiently to train as many models as possible. This paper presents the scheduler for privacy that optimizes for efficiency. We formulate privacy scheduling as a new type of multidimensional knapsack problem, called privacy knapsack, which maximizes DP budget efficiency. We show that privacy knapsack is NP-hard, hence practical algorithms are necessarily approximate. We develop an approximation algorithm for privacy knapsack, DPK, and evaluate it on microbenchmarks and on a new, synthetic private-ML workload we developed from the Alibaba ML cluster trace. We show that DPK: (1) often approaches the efficiency-optimal schedule, (2) consistently schedules more tasks compared to a state-of-the-art privacy scheduling algorithm that focused on fairness (1.3-1.7x in Alibaba, 1.0-2.6x in microbenchmarks), but (3) sacrifices some level of fairness for efficiency. Therefore, using DPK, DP ML operators should be able to train more models on the same amount of user data while offering the same privacy guarantee to their users.
Published: 2022

48. Auxo: Heterogeneity-Mitigating Federated Learning via Scalable Client Clustering

Author: Liu, Jiachen, Lai, Fan, Dai, Yinwei, Akella, Aditya, Madhyastha, Harsha, and Chowdhury, Mosharaf
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)
Abstract: Federated learning (FL) is an emerging machine learning (ML) paradigm that enables heterogeneous edge devices to collaboratively train ML models without revealing their raw data to a logically centralized server. Heterogeneity across participants is a fundamental challenge in FL, both in terms of non-independent and identically distributed (Non-IID) data distributions and variations in device capabilities. Many existing works present point solutions to address issues like slow convergence, low final accuracy, and bias in FL, all stemming from the client heterogeneity. We observe that, in a large population, there exist groups of clients with statistically similar data distributions (cohorts). In this paper, we propose Auxo to gradually identify cohorts among large-scale, low-participation, and resource-constrained FL populations. Auxo then adaptively determines how to train cohort-specific models in order to achieve better model performance and ensure resource efficiency. By identifying cohorts with smaller heterogeneity and performing efficient cohort-based training, our extensive evaluations show that Auxo substantially boosts the state-of-the-art solutions in terms of final accuracy, convergence time, and model bias., 18 pages
Published: 2022

49. Topology-Awareness and Reoptimization Mechanism for Virtual Network Embedding

Author: Farooq Butt, Nabeel, Chowdhury, Mosharaf, Boutaba, Raouf, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Crovella, Mark, editor, Feeney, Laura Marie, editor, Rubenstein, Dan, editor, and Raghavan, S. V., editor
Published: 2010
Full Text: View/download PDF

50. Aequitas

Author: Zhang, Yiwen, primary, Kumar, Gautam, additional, Dukkipati, Nandita, additional, Wu, Xian, additional, Jha, Priyaranjan, additional, Chowdhury, Mosharaf, additional, and Vahdat, Amin, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

164 results on '"Chowdhury, Mosharaf"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources