1,338 results on '"Krishnamurthy, Arvind"'
Search Results
2. Lessons for Policy from Research
- Author
-
Krishnamurthy, Arvind
- Published
- 2023
3. Corporate Debt Overhang and Credit Policy
- Author
-
Brunnermeier, Markus and Krishnamurthy, Arvind
- Published
- 2021
- Full Text
- View/download PDF
4. NanoFlow: Towards Optimal Large Language Model Serving Throughput
- Author
-
Zhu, Kan, Zhao, Yilong, Zhao, Liangyu, Zuo, Gefei, Gu, Yile, Xie, Dedong, Gao, Yufei, Xu, Qinyu, Tang, Tian, Ye, Zihao, Kamahori, Keisuke, Lin, Chien-Yu, Wang, Stephanie, Krishnamurthy, Arvind, and Kasikci, Baris
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems' performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow splits requests into nano-batches at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping; then, to get benefit from overlapping, NanoFlow uses an operation-level pipeline with execution unit scheduling, which partitions the device's functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.
- Published
- 2024
5. Comments and Discussion
- Author
-
Eggertsson, Gauti B. and Krishnamurthy, Arvind
- Published
- 2019
- Full Text
- View/download PDF
6. An Architecture For Edge Networking Services
- Author
-
Brown, Lloyd, Marx, Emily, Bali, Dev, Amaro, Emmanuel, Sur, Debnil, Kissel, Ezra, Monga, Inder, Katz-Bassett, Ethan, Krishnamurthy, Arvind, McCauley, James, Narechania, Tejas, Panda, Aurojit, and Shenker, Scott
- Subjects
Information and Computing Sciences ,Cybersecurity and Privacy - Published
- 2024
7. Comments and Discussion
- Author
-
Krishnamurthy, Arvind
- Published
- 2018
- Full Text
- View/download PDF
8. Relational Network Verification
- Author
-
Xu, Xieyang, Yuan, Yifei, Kincaid, Zachary, Krishnamurthy, Arvind, Mahajan, Ratul, Walker, David, and Zhai, Ennan
- Subjects
Computer Science - Networking and Internet Architecture - Abstract
Relational network verification is a new approach to validating network changes. In contrast to traditional network verification, which analyzes specifications for a single network snapshot, relational network verification analyzes specifications concerning two network snapshots (e.g., pre- and post-change snapshots) and captures their similarities and differences. Relational change specifications are compact and precise because they specify the flows or paths that change between snapshots and then simply mandate that other behaviors of the network "stay the same", without enumerating them. To achieve similar guarantees, single-snapshot specifications need to enumerate all flow and path behaviors that are not expected to change, so we can check that nothing has accidentally changed. Thus, precise single-snapshot specifications are proportional to network size, which makes them impractical to generate for many real-world networks. To demonstrate the value of relational reasoning, we develop a high-level relational specification language and a tool called Rela to validate network changes. Rela first compiles input specifications and network snapshot representations to finite state transducers. It then checks compliance using decision procedures for automaton equivalence. Our experiments using data on complex changes to a global backbone (with over 10^3 routers) find that Rela specifications need fewer than 10 terms for 93% of them and it validates 80% of them within 20 minutes.
- Published
- 2024
9. Laconic: Streamlined Load Balancers for SmartNICs
- Author
-
Cui, Tianyi, Zhao, Chenxingyu, Zhang, Wei, Zhang, Kaiyuan, and Krishnamurthy, Arvind
- Subjects
Computer Science - Networking and Internet Architecture - Abstract
Load balancers are pervasively used inside today's clouds to scalably distribute network requests across data center servers. Given the extensive use of load balancers and their associated operating costs, several efforts have focused on improving their efficiency by implementing Layer-4 load-balancing logic within the kernel or using hardware acceleration. This work explores whether the more complex and connection-oriented Layer-7 load-balancing capability can also benefit from hardware acceleration. In particular, we target the offloading of load-balancing capability onto programmable SmartNICs. We fully leverage the cost and energy efficiency of SmartNICs using three key ideas. First, we argue that a full and complex TCP/IP stack is not required for Layer-7 load balancers and instead propose a lightweight forwarding agent on the SmartNIC. Second, we develop connection management data structures with a high degree of concurrency with minimal synchronization when executed on multi-core SmartNICs. Finally, we describe how the load-balancing logic could be accelerated using custom packet-processing accelerators on SmartNICs. We prototype Laconic on two types of SmartNIC hardware, achieving over 150 Gbps throughput using all cores on BlueField-2, while a single SmartNIC core achieves 8.7x higher throughput and comparable latency to Nginx on a single x86 core.
- Published
- 2024
10. ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics
- Author
-
Zhao, Liangyu, Maleki, Saeed, Yang, Ziyue, Pourreza, Hossein, Shah, Aashaka, Hwang, Changho, and Krishnamurthy, Arvind
- Subjects
Computer Science - Networking and Internet Architecture ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning - Abstract
As modern DNN models grow ever larger, collective communications between the accelerators (allreduce, etc.) emerge as a significant performance bottleneck. Designing efficient communication schedules is challenging given today's highly diverse and heterogeneous network fabrics. In this paper, we present ForestColl, a tool that generates efficient schedules for any network topology. ForestColl constructs broadcast/aggregation spanning trees as the communication schedule, achieving theoretically minimum network congestion. Its schedule generation runs in strongly polynomial time and is highly scalable. ForestColl supports any network fabrics, including both switching fabrics and direct connections, as well as any network graph structure. We evaluated ForestColl on multi-cluster AMD MI250 and NVIDIA A100 platforms. ForestColl's schedules achieved up to 52\% higher performance compared to the vendors' own optimized communication libraries, RCCL and NCCL. ForestColl also outperforms other state-of-the-art schedule generation techniques with both up to 61\% more efficient generated schedules and orders of magnitude faster schedule generation speed., Comment: arXiv admin note: text overlap with arXiv:2305.18461
- Published
- 2024
11. Suprahilar and Retrocrural Domains in RPLND for NSGCT Testis—Going Beyond Where the Light Touches!
- Author
-
Venkatesh, Shrinivas, Phillips, Malar Raj, Krishnamurthy, Shalini Shree, Suresh, Krishna, Malik, Kanuj, Ramakrishnan, Ayaloor Seshadri, Krishnamurthy, Arvind, Ellusamy, Hemanth Raj, and Raja, Anand
- Published
- 2024
- Full Text
- View/download PDF
12. Bringing Reconfigurability to the Network Stack
- Author
-
Narayan, Akshay, Panda, Aurojit, Alizadeh, Mohammad, Balakrishnan, Hari, Krishnamurthy, Arvind, and Shenker, Scott
- Subjects
Computer Science - Networking and Internet Architecture - Abstract
Reconfiguring the network stack allows applications to specialize the implementations of communication libraries depending on where they run, the requests they serve, and the performance they need to provide. Specializing applications in this way is challenging because developers need to choose the libraries they use when writing a program and cannot easily change them at runtime. This paper introduces Bertha, which allows these choices to be changed at runtime without limiting developer flexibility in the choice of network and communication functions. Bertha allows applications to safely use optimized communication primitives (including ones with deployment limitations) without limiting deployability. Our evaluation shows cases where this results in 16x higher throughput and 63% lower latency than current portable approaches while imposing minimal overheads when compared to a hand-optimized versions that use deployment-specific communication primitives., Comment: 12 pages, 10 figures
- Published
- 2023
13. Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
- Author
-
Zhao, Yilong, Lin, Chien-Yu, Zhu, Kan, Ye, Zihao, Chen, Lequn, Zheng, Size, Ceze, Luis, Krishnamurthy, Arvind, Chen, Tianqi, and Kasikci, Baris
- Subjects
Computer Science - Machine Learning - Abstract
The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization in the serving context. Atom improves end-to-end throughput (token/s) by up to $7.7\times$ compared to the FP16 and by $2.5\times$ compared to INT8 quantization, while maintaining the same latency target.
- Published
- 2023
14. Punica: Multi-Tenant LoRA Serving
- Author
-
Chen, Lequn, Ye, Zihao, Wu, Yongji, Zhuo, Danyang, Ceze, Luis, and Krishnamurthy, Arvind
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning - Abstract
Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. Punica is open source at https://github.com/punica-ai/punica .
- Published
- 2023
15. The Donor Went Down to Georgia: Out-of-District Donations and Rivalrous Representation
- Author
-
Nathan, Charles, Krishnamurthy, Arvind, Bram, Curtis, and Todd, Jason Douglas
- Published
- 2024
- Full Text
- View/download PDF
16. Efficient Credit Policies in a Housing Debt Crisis
- Author
-
Eberly, Janice and Krishnamurthy, Arvind
- Published
- 2015
- Full Text
- View/download PDF
17. Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies
- Author
-
Basu, Prithwish, Zhao, Liangyu, Fantl, Jason, Pal, Siddharth, Krishnamurthy, Arvind, and Khoury, Joud
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Networking and Internet Architecture - Abstract
The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance., Comment: HPDC '24
- Published
- 2023
18. Quark: A High-Performance Secure Container Runtime for Serverless Computing
- Author
-
Zhao, Chenxingyu, Sun, Yulin, Xiong, Ying, and Krishnamurthy, Arvind
- Subjects
Computer Science - Networking and Internet Architecture - Abstract
Secure container runtimes serve as the foundational layer for creating and running containers, which is the bedrock of emerging computing paradigms like microservices and serverless computing. Although existing secure container runtimes indeed enhance security via running containers over a guest kernel and a Virtual Machine Monitor (VMM or Hypervisor), they incur performance penalties in critical areas such as networking, container startup, and I/O system calls. In our practice of operating microservices and serverless computing, we build a high-performance secure container runtime named Quark. Unlike existing solutions that rely on traditional VM technologies by importing Linux for the guest kernel and QEMU for the VMM, we take a different approach to building Quark from the ground up, paving the way for extreme customization to unlock high performance. Our development centers on co-designing a custom guest kernel and a VMM for secure containers. To this end, we build a lightweight guest OS kernel named QKernel and a specialized VMM named QVisor. The QKernel-QVisor codesign allows us to deliver three key advancements: high-performance RDMA-based container networking, fast container startup mode, and efficient mechanisms for executing I/O syscalls. In our practice with real-world apps like Redis, Quark cuts down P95 latency by 79.3% and increases throughput by 2.43x compared to Kata. Moreover, Quark container startup achieves 96.5% lower latency than the cold-start mode while saving 81.3% memory cost to the keep-warm mode. Quark is open-source with an industry-standard codebase in Rust., Comment: arXiv admin note: text overlap with arXiv:2305.10621. The paper on arXiv:2305.10621 presents a detailed version of the TSoR module in Quark
- Published
- 2023
19. Symphony: Optimized DNN Model Serving using Deferred Batch Scheduling
- Author
-
Chen, Lequn, Deng, Weixin, Canumalla, Anirudh, Xin, Yu, Zhuo, Danyang, Philipose, Matthai, and Krishnamurthy, Arvind
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning - Abstract
Having large batch sizes is one of the most critical aspects of increasing the accelerator efficiency and the performance of DNN model inference. However, existing model serving systems cannot achieve adequate batch sizes while meeting latency objectives as these systems eagerly dispatch requests to accelerators to minimize the accelerator idle time. We propose Symphony, a DNN serving system that explores deferred batch scheduling to optimize system efficiency and throughput. Further, unlike other prior systems, Symphony's GPU usage is load-proportional: it consolidates workloads on the appropriate number of GPUs and works smoothly with cluster auto-scaling tools. Symphony consists of two core design points. First, Symphony defines a schedulable window in which a batch of inference requests can be dispatched. This window is computed in order to improve accelerator efficiency while meeting the request's SLO. Second, Symphony implements a scalable, low-latency, fine-grained coordination scheme across accelerators to dispatch and execute requests in the schedulable window. Through extensive scheduler-only benchmarks, we demonstrate that Symphony can schedule millions of requests per second and coordinate thousands of GPUs while also enabling robust autoscaling that adapts to workload changes. Symphony outperforms prior systems by achieving 5x higher goodput when given the same number of GPUs and 60% reduction in GPUs when given the same workload.
- Published
- 2023
20. Bandwidth Optimal Pipeline Schedule for Collective Communication
- Author
-
Zhao, Liangyu and Krishnamurthy, Arvind
- Subjects
Computer Science - Networking and Internet Architecture ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Discrete Mathematics ,Computer Science - Machine Learning - Abstract
We present a strongly polynomial-time algorithm to generate bandwidth optimal allgather/reduce-scatter on any network topology, with or without switches. Our algorithm constructs pipeline schedules achieving provably the best possible bandwidth performance on a given topology. To provide a universal solution, we model the network topology as a directed graph with heterogeneous link capacities and switches directly as vertices in the graph representation. The algorithm is strongly polynomial-time with respect to the topology size. This work heavily relies on previous graph theory work on edge-disjoint spanning trees and edge splitting. While we focus on allgather, the methods in this paper can be easily extended to generate schedules for reduce, broadcast, reduce-scatter, and allreduce.
- Published
- 2023
21. TSoR: TCP Socket over RDMA Container Network for Cloud Native Computing
- Author
-
Sun, Yulin, Qu, Qingming, Zhao, Chenxingyu, Krishnamurthy, Arvind, Chang, Hong, and Xiong, Ying
- Subjects
Computer Science - Networking and Internet Architecture ,Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
Cloud-native containerized applications constantly seek high-performance and easy-to-operate container network solutions. RDMA network is a potential enabler with higher throughput and lower latency than the standard TCP/IP network stack. However, several challenges remain in equipping containerized applications with RDMA network: 1) How to deliver transparent improvements without modifying application code; 2) How to integrate RDMA-based network solutions with container orchestration systems; 3) How to efficiently utilize RDMA for container networks. In this paper, we present an RDMA-based container network solution, TCP Socket over RDMA (TSoR), which addresses all the above challenges. To transparently accelerate applications using POSIX socket interfaces without modifications, we integrate TSoR with a container runtime that can intercept system calls for socket interfaces. To be compatible with orchestration systems like Kubernetes, TSoR implements a container network following the Kubernetes network model and satisfies all requirements of the model. To leverage RDMA benefits, TSoR designs a high-performance network stack that efficiently transfers TCP traffic using RDMA network. Thus, TSoR provides a turn-key solution for existing Kubernetes clusters to adopt the high-performance RDMA network with minimal effort. Our evaluation results show that TSoR provides up to 2.3x higher throughput and 64\% lower latency for existing containerized applications, such as Redis key-value store and Node.js web server, with no code changes. TSoR code will be open-sourced.
- Published
- 2023
22. The Effects of Quantitative Easing on Interest Rates: Channels and Implications for Policy
- Author
-
Krishnamurthy, Arvind and Vissing-Jorgensen, Annette
- Published
- 2012
- Full Text
- View/download PDF
23. Can We Save The Public Internet?
- Author
-
Blumenthal, Marjory, Govindan, Ramesh, Katz-Bassett, Ethan, Krishnamurthy, Arvind, McCauley, James, Merrill, Nick, Narechania, Tejas, Panda, Aurojit, and Shenker, Scott
- Subjects
Internet architecture ,Public Internet ,Internet enhancements ,Computer Software ,Distributed Computing ,Communications Technologies ,Networking & Telecommunications ,Communications engineering ,Distributed computing and systems software - Abstract
The goal of this short document is to explain why recent developments in the Internet's infrastructure are problematic. As context, we note that the Internet was originally designed to provide a simple universal service - global end-to-end packet delivery - on which a wide variety of end-user applications could be built. The early Internet supported this packet-delivery service via an interconnected collection of commercial Internet Service Providers (ISPs) that we will refer to collectively as the “public Internet.” The Internet has fulfilled its packet-delivery mission far beyond all expectations and is now the dominant global communications infrastructure. By providing a level playing field on which new applications could be deployed, the Internet has enabled a degree of innovation that no one could have foreseen. To improve performance for some common applications, “enhancements” such as caching (as in content-delivery networks) have been gradually added to the Internet. The resulting performance improvements are so significant that such enhancements are now effectively necessary to meet current content delivery demands. Despite these tangible benefits, this document argues that the way these enhancements are currently deployed seriously undermines the sustainability of the public Internet and could lead to an Internet infrastructure that reaches fewer people and is largely concentrated among only a few large-scale providers. We wrote this document because we fear that these developments are now decidedly tipping the Internet's playing field towards those who can deploy these enhancements at massive scale, which in turn will limit the degree to which the future Internet can support unfettered innovation. This document begins by explaining our concerns but goes on to articulate how this unfortunate fate can be avoided. To provide more depth for those who seek it, we provide a separate addendum with further detail.
- Published
- 2023
24. Dissecting Service Mesh Overheads
- Author
-
Zhu, Xiangfeng, She, Guozhen, Xue, Bowen, Zhang, Yu, Zhang, Yongsu, Zou, Xuan Kelvin, Duan, Xiongchun, He, Peng, Krishnamurthy, Arvind, Lentz, Matthew, Zhuo, Danyang, and Mahajan, Ratul
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Networking and Internet Architecture - Abstract
Service meshes play a central role in the modern application ecosystem by providing an easy and flexible way to connect different services that form a distributed application. However, because of the way they interpose on application traffic, they can substantially increase application latency and resource consumption. We develop a decompositional approach and a tool, called MeshInsight, to systematically characterize the overhead of service meshes and to help developers quantify overhead in deployment scenarios of interest. Using MeshInsight, we confirm that service meshes can have high overhead -- up to 185% higher latency and up to 92% more virtual CPU cores for our benchmark applications -- but the severity is intimately tied to how they are configured and the application workload. The primary contributors to overhead vary based on the configuration too. IPC (inter-process communication) and socket writes dominate when the service mesh operates as a TCP proxy, but protocol parsing dominates when it operates as an HTTP proxy. MeshInsight also enables us to study the end-to-end impact of optimizations to service meshes. We show that not all seemingly-promising optimizations lead to a notable overhead reduction in realistic settings.
- Published
- 2022
25. The Rest of the World’s Dollar-Weighted Return on U.S. Treasurys
- Author
-
Jiang, Zhengyang, Krishnamurthy, Arvind, and Lustig, Hanno
- Published
- 2023
- Full Text
- View/download PDF
26. Efficient Direct-Connect Topologies for Collective Communications
- Author
-
Zhao, Liangyu, Pal, Siddharth, Chugh, Tapan, Wang, Weiyang, Fantl, Jason, Basu, Prithwish, Khoury, Joud, and Krishnamurthy, Arvind
- Subjects
Computer Science - Networking and Internet Architecture ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning - Abstract
We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload. Our algorithms start from small, optimal base topologies and associated communication schedules and use techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient collective schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.
- Published
- 2022
27. SuperNIC: A Hardware-Based, Programmable, and Multi-Tenant SmartNIC
- Author
-
Shan, Yizhou, Lin, Will, Kosta, Ryan, Krishnamurthy, Arvind, and Zhang, Yiying
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
With CPU scaling slowing down in today's data centers, more functionalities are being offloaded from the CPU to auxiliary devices. One such device is the SmartNIC, which is being increasingly adopted in data centers. In today's cloud environment, VMs on the same server can each have their own network computation (or network tasks) or workflows of network tasks to offload to a SmartNIC. These network tasks can be dynamically added/removed as VMs come and go and can be shared across VMs. Such dynamism demands that a SmartNIC not only schedules and processes packets but also manages and executes offloaded network tasks for different users. Although software solutions like an OS exist for managing software-based network tasks, such software-based SmartNICs cannot keep up with the quickly increasing data-center network speed. This paper proposes a new SmartNIC platform called SuperNIC that allows multiple tenants to efficiently and safely offload FPGA-based network computation DAGs. For efficiency and scalability, our core idea is to group network tasks into chains that are connected and scheduled as one unit. We further propose techniques to automatically scale network task chains with different types of parallelism. Moreover, we propose a fair share mechanism that considers both fair space sharing and fair time sharing of different types of hardware resources. Our FPGA prototype of SuperNIC achieves high bandwidth, low latency performance whilst efficiently utilizing and fairly sharing resources., Comment: 17 pages
- Published
- 2021
28. Effect of Neoadjuvant Concurrent Chemoradiation on Operability and Survival in Locally Advanced Inoperable Breast Cancer
- Author
-
Iyer, Priya, Krishnamurthy, Arvind, Velusamy, Sridevi, Sundersingh, Shirley, Rajaram, Swaminathan, Balasubramanian, Ananthi, and Radhakrishnan, Venkatraman
- Published
- 2024
- Full Text
- View/download PDF
29. Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
- Author
-
Luo, Liang, Nelson, Jacob, Krishnamurthy, Arvind, and Ceze, Luis
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Artificial Intelligence ,Computer Science - Networking and Internet Architecture - Abstract
ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads of distributed training of deep neural networks and gradient boosted decision trees using state-of-the-art frameworks.
- Published
- 2021
30. AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly
- Author
-
Jin, Yuchen, Zhou, Tianyi, Zhao, Liangyu, Zhu, Yibo, Guo, Chuanxiong, Canini, Marco, and Krishnamurthy, Arvind
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
The learning rate (LR) schedule is one of the most important hyper-parameters needing careful tuning in training DNNs. However, it is also one of the least automated parts of machine learning systems and usually costs significant manual effort and computing. Though there are pre-defined LR schedules and optimizers with adaptive LR, they introduce new hyperparameters that need to be tuned separately for different tasks/datasets. In this paper, we consider the question: Can we automatically tune the LR over the course of training without human involvement? We propose an efficient method, AutoLRS, which automatically optimizes the LR for each training stage by modeling training dynamics. AutoLRS aims to find an LR applied to every $\tau$ steps that minimizes the resulted validation loss. We solve this black-box optimization on the fly by Bayesian optimization (BO). However, collecting training instances for BO requires a system to evaluate each LR queried by BO's acquisition function for $\tau$ steps, which is prohibitively expensive in practice. Instead, we apply each candidate LR for only $\tau'\ll\tau$ steps and train an exponential model to predict the validation loss after $\tau$ steps. This mutual-training process between BO and the loss-prediction model allows us to limit the training steps invested in the BO search. We demonstrate the advantages and the generality of AutoLRS through extensive experiments of training DNNs for tasks from diverse domains using different optimizers. The LR schedules auto-generated by AutoLRS lead to a speedup of $1.22\times$, $1.43\times$, and $1.5\times$ when training ResNet-50, Transformer, and BERT, respectively, compared to the LR schedules in their original papers, and an average speedup of $1.31\times$ over state-of-the-art heavily-tuned LR schedules., Comment: Published as a conference paper at ICLR 2021
- Published
- 2021
31. Srifty: Swift and Thrifty Distributed Training on the Cloud
- Author
-
Luo, Liang, West, Peter, Krishnamurthy, Arvind, and Ceze, Luis
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks. In this work, we characterize this variance in the context of distributed NN training and present results of a comprehensive throughput and cost-efficiency study we conducted across a wide array of instances to prune for the optimal VM search space. Using insights from these studies, we built Srifty, a system that combines runtime profiling with learned performance models to accurately predict training performance and find the best VM choice that satisfies user constraints, potentially leveraging both heterogeneous setups and spot instances. We integrated Srifty with PyTorch and evaluated it on Amazon EC2. We conducted a large-scale generalization study of Srifty across more than 2K training setups on EC2. Our results show that Srifty achieves an iteration latency prediction error of 8%, and its VM instance recommendations offer significant throughput gain and cost reduction while satisfying user constraints compared to existing solutions in complex, real-world scenarios.
- Published
- 2020
32. Making Distributed Mobile Applications SAFE: Enforcing User Privacy Policies on Untrusted Applications with Secure Application Flow Enforcement
- Author
-
Szekeres, Adriana, Zhang, Irene, Bailey, Katelin, Ackerman, Isaac, Shen, Haichen, Roesner, Franziska, Ports, Dan R. K., Krishnamurthy, Arvind, and Levy, Henry M.
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Operating Systems - Abstract
Today's mobile devices sense, collect, and store huge amounts of personal information, which users share with family and friends through a wide range of applications. Once users give applications access to their data, they must implicitly trust that the apps correctly maintain data privacy. As we know from both experience and all-too-frequent press articles, that trust is often misplaced. While users do not trust applications, they do trust their mobile devices and operating systems. Unfortunately, sharing applications are not limited to mobile clients but must also run on cloud services to share data between users. In this paper, we leverage the trust that users have in their mobile OSes to vet cloud services. To do so, we define a new Secure Application Flow Enforcement (SAFE) framework, which requires cloud services to attest to a system stack that will enforce policies provided by the mobile OS for user data. We implement a mobile OS that enforces SAFE policies on unmodified mobile apps and two systems for enforcing policies on untrusted cloud services. Using these prototypes, we demonstrate that it is possible to enforce existing user privacy policies on unmodified applications.
- Published
- 2020
33. Implications of Asset Market Data for Equilibrium Models of Exchange Rates
- Author
-
Jiang, Zhengyang, primary, Krishnamurthy, Arvind, additional, and Lustig, Hanno, additional
- Published
- 2023
- Full Text
- View/download PDF
34. Talek: Private Group Messaging with Hidden Access Patterns
- Author
-
Cheng, Raymond, Scott, William, Masserova, Elisaweta, Zhang, Irene, Goyal, Vipul, Anderson, Thomas, Krishnamurthy, Arvind, and Parno, Bryan
- Subjects
Computer Science - Cryptography and Security - Abstract
Talek is a private group messaging system that sends messages through potentially untrustworthy servers, while hiding both data content and the communication patterns among its users. Talek explores a new point in the design space of private messaging; it guarantees access sequence indistinguishability, which is among the strongest guarantees in the space, while assuming an anytrust threat model, which is only slightly weaker than the strongest threat model currently found in related work. Our results suggest that this is a pragmatic point in the design space, since it supports strong privacy and good performance: we demonstrate a 3-server Talek cluster that achieves throughput of 9,433 messages/second for 32,000 active users with 1.7-second end-to-end latency. To achieve its security goals without coordination between clients, Talek relies on information-theoretic private information retrieval. To achieve good performance and minimize server-side storage, Talek introduces new techniques and optimizations that may be of independent interest, e.g., a novel use of blocked cuckoo hashing and support for private notifications. The latter provide a private, efficient mechanism for users to learn, without polling, which logs have new messages.
- Published
- 2020
- Full Text
- View/download PDF
35. DeepSense: Enabling Carrier Sense in Low-Power Wide Area Networks Using Deep Learning
- Author
-
Chan, Justin, Wang, Anran, Krishnamurthy, Arvind, and Gollakota, Shyamnath
- Subjects
Computer Science - Networking and Internet Architecture - Abstract
The last few years have seen the proliferation of low-power wide area networks like LoRa, Sigfox and 802.11ah, each of which use a different and sometimes proprietary coding and modulation scheme, work below the noise floor and operate on the same frequency band. We introduce DeepSense, which is the first carrier sense mechanism that enables random access and coexistence for low-power wide area networks even when signals are below the noise floor. Our key insight is that any communication protocol that operates below the noise floor has to use coding at the physical layer. We show that neural networks can be used as a general algorithmic framework that can learn the coding mechanisms being employed by such protocols to identify signals that are hidden within noise. Our evaluation shows that DeepSense performs carrier sense across 26 different LPWAN protocols and configurations below the noise floor and can operate in the presence of frequency shifts as well as concurrent transmissions. Beyond carrier sense, we also show that DeepSense can support multi bit-rate LoRa networks by classifying between 21 different LoRa configurations and flexibly adapting bitrates based on signal strength. In a deployment of a multi-rate LoRa network, DeepSense improves bit rate by 4x for nearby devices and provides a 1.7x increase in the number of locations that can connect to the campus-wide network.
- Published
- 2019
36. Scaling Distributed Machine Learning with In-Network Aggregation
- Author
-
Sapio, Amedeo, Canini, Marco, Ho, Chen-Yu, Nelson, Jacob, Kalnis, Panos, Kim, Changhoon, Krishnamurthy, Arvind, Moshref, Masoud, Ports, Dan R. K., and Richtárik, Peter
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning ,Computer Science - Networking and Internet Architecture ,Statistics - Machine Learning - Abstract
Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5$\times$ for a number of real-world benchmark models.
- Published
- 2019
37. ADARES: Adaptive Resource Management for Virtual Machines
- Author
-
Cano, Ignacio, Chen, Lequn, Fonseca, Pedro, Chen, Tianqi, Cheah, Chern, Gupta, Karan, Chandra, Ramesh, and Krishnamurthy, Arvind
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning - Abstract
Virtual execution environments allow for consolidation of multiple applications onto the same physical server, thereby enabling more efficient use of server resources. However, users often statically configure the resources of virtual machines through guesswork, resulting in either insufficient resource allocations that hinder VM performance, or excessive allocations that waste precious data center resources. In this paper, we first characterize real-world resource allocation and utilization of VMs through the analysis of an extensive dataset, consisting of more than 250k VMs from over 3.6k private enterprise clusters. Our large-scale analysis confirms that VMs are often misconfigured, either overprovisioned or underprovisioned, and that this problem is pervasive across a wide range of private clusters. We then propose ADARES, an adaptive system that dynamically adjusts VM resources using machine learning techniques. In particular, ADARES leverages the contextual bandits framework to effectively manage the adaptations. Our system exploits easily collectible data, at the cluster, node, and VM levels, to make more sensible allocation decisions, and uses transfer learning to safely explore the configurations space and speed up training. Our empirical evaluation shows that ADARES can significantly improve system utilization without sacrificing performance. For instance, when compared to threshold and prediction-based baselines, it achieves more predictable VM-level performance and also reduces the amount of virtual CPUs and memory provisioned by up to 35% and 60% respectively for synthetic workloads on real clusters.
- Published
- 2018
38. Breast Cancer
- Author
-
Khanikar, Duncan, Kamalasanan, Kiran, Krishnamurthy, Arvind, Hazarika, Munlima, Kataki, Amal Chandra, Kataki, Amal Chandra, editor, and Barmon, Debabrata, editor
- Published
- 2022
- Full Text
- View/download PDF
39. Comparison of nutrition screening tools and calf circumference in estimating the preoperative prevalence of malnutrition among patients with aerodigestive tract cancers—a prospective observational cohort study
- Author
-
Srinivasaraghavan, Nivedhyaa, Venketeswaran, Meenakshi. V., Balakrishnan, Kalpana, Ramasamy, Thendral, Ramakrishnan, Aishwarya, Agarwal, Ajit, and Krishnamurthy, Arvind
- Published
- 2022
- Full Text
- View/download PDF
40. COVID-19, Race, and Mass Incarceration
- Author
-
KRISHNAMURTHY, ARVIND, primary
- Published
- 2022
- Full Text
- View/download PDF
41. 4. COVID-19, Race, and Mass Incarceration
- Author
-
Krishnamurthy, Arvind, primary
- Published
- 2022
- Full Text
- View/download PDF
42. Risk predictors as companion diagnostics for ER positive, HER2 negative early staged breast cancers from India.
- Author
-
Ramshankar, Vijayalakshmi, primary, Gopalakrishnan, Aravinda lochan, additional, Krishnamurthy, Arvind, additional, Radhakrishnan, Venkatraman, additional, and Iyer, Priya, additional
- Published
- 2024
- Full Text
- View/download PDF
43. CC-NIC: a Cache-Coherent Interface to the NIC
- Author
-
Schuh, Henry N., primary, Krishnamurthy, Arvind, additional, Culler, David, additional, Levy, Henry M., additional, Rizzo, Luigi, additional, Khan, Samira, additional, and Stephens, Brent E., additional
- Published
- 2024
- Full Text
- View/download PDF
44. eZNS: Elastic Zoned Namespace for Enhanced Performance Isolation and Device Utilization
- Author
-
Min, Jaehong, primary, Zhao, Chenxingyu, additional, Liu, Ming, additional, and Krishnamurthy, Arvind, additional
- Published
- 2024
- Full Text
- View/download PDF
45. SuperNIC: An FPGA-Based, Cloud-Oriented SmartNIC
- Author
-
Lin, Will, primary, Shan, Yizhou, additional, Kosta, Ryan, additional, Krishnamurthy, Arvind, additional, and Zhang, Yiying, additional
- Published
- 2024
- Full Text
- View/download PDF
46. Do at-large elections reduce black representation? A new baseline for county legislatures
- Author
-
Todd, Jason Douglas, primary, Bram, Curtis, additional, and Krishnamurthy, Arvind, additional
- Published
- 2024
- Full Text
- View/download PDF
47. A Hardware-Software Blueprint for Flexible Deep Learning Specialization
- Author
-
Moreau, Thierry, Chen, Tianqi, Vega, Luis, Roesch, Jared, Yan, Eddie, Zheng, Lianmin, Fromm, Josh, Jiang, Ziheng, Ceze, Luis, Guestrin, Carlos, and Krishnamurthy, Arvind
- Subjects
Computer Science - Machine Learning ,Computer Science - Distributed, Parallel, and Cluster Computing ,Statistics - Machine Learning - Abstract
Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operators, or numerical systems threaten the viability of specialized hardware accelerators. We propose VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads. VTA achieves this flexibility via a parametrizable architecture, two-level ISA, and a JIT compiler. The two-level ISA is based on (1) a task-ISA that explicitly orchestrates concurrent compute and memory tasks and (2) a microcode-ISA which implements a wide variety of operators with single-cycle tensor-tensor operations. Next, we propose a runtime system equipped with a JIT compiler for flexible code-generation and heterogeneous execution that enables effective use of the VTA architecture. VTA is integrated and open-sourced into Apache TVM, a state-of-the-art deep learning compilation stack that provides flexibility for diverse models and divergent hardware backends. We propose a flow that performs design space exploration to generate a customized hardware architecture and software operator library that can be leveraged by mainstream learning frameworks. We demonstrate our approach by deploying optimized deep learning models used for object classification and style transfer on edge-class FPGAs., Comment: 6 pages plus references, 8 figures
- Published
- 2018
48. Revisiting Network Support for RDMA
- Author
-
Mittal, Radhika, Shpiner, Alexander, Panda, Aurojit, Zahavi, Eitan, Krishnamurthy, Arvind, Ratnasamy, Sylvia, and Shenker, Scott
- Subjects
Computer Science - Networking and Internet Architecture - Abstract
The advent of RoCE (RDMA over Converged Ethernet) has led to a significant increase in the use of RDMA in datacenter networks. To achieve good performance, RoCE requires a lossless network which is in turn achieved by enabling Priority Flow Control (PFC) within the network. However, PFC brings with it a host of problems such as head-of-the-line blocking, congestion spreading, and occasional deadlocks. Rather than seek to fix these issues, we instead ask: is PFC fundamentally required to support RDMA over Ethernet? We show that the need for PFC is an artifact of current RoCE NIC designs rather than a fundamental requirement. We propose an improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packet losses. We show that IRN (without PFC) outperforms RoCE (with PFC) by 6-83% for typical network scenarios. Thus not only does IRN eliminate the need for PFC, it improves performance in the process! We further show that the changes that IRN introduces can be implemented with modest overheads of about 3-10% to NIC resources. Based on our results, we argue that research and industry should rethink the current trajectory of network support for RDMA., Comment: Extended version of the paper appearing in ACM SIGCOMM 2018
- Published
- 2018
49. Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training
- Author
-
Luo, Liang, Nelson, Jacob, Ceze, Luis, Phanishayee, Amar, and Krishnamurthy, Arvind
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning ,Computer Science - Neural and Evolutionary Computing - Abstract
Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art distributed training techniques for cloud-based ImageNet workloads, with 25% better throughput per dollar.
- Published
- 2018
- Full Text
- View/download PDF
50. Learning to Optimize Tensor Programs
- Author
-
Chen, Tianqi, Zheng, Lianmin, Yan, Eddie, Jiang, Ziheng, Moreau, Thierry, Ceze, Luis, Guestrin, Carlos, and Krishnamurthy, Arvind
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware-specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU., Comment: NeurIPS 2018
- Published
- 2018
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.