Author: "Zhuo, Danyang" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhuo, Danyang"' showing total 49 results

Start Over Author "Zhuo, Danyang"

49 results on '"Zhuo, Danyang"'

1. Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

Author: Wu, Yongji, Qu, Wenjie, Tao, Tianyang, Wang, Zhuang, Bai, Wei, Li, Zhuohao, Tian, Yuan, Zhang, Jiaheng, Lentz, Matthew, and Zhuo, Danyang
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs) due to its sub-linear scaling for computation costs. However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to wait idle until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture. We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds-up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace. more...
Published: 2024

2. VcLLM: Video Codecs are Secretly Tensor Codecs

Author: Xu, Ceyu, Wu, Yongji, Yang, Xinyu, Chen, Beidi, Lentz, Matthew, Zhuo, Danyang, and Wills, Lisa Wu
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: As the parameter size of large language models (LLMs) continues to expand, the need for a large memory footprint and high communication bandwidth have become significant bottlenecks for the training and inference of LLMs. To mitigate these bottlenecks, various tensor compression techniques have been proposed to reduce the data size, thereby alleviating memory requirements and communication pressure. Our research found that video codecs, despite being originally designed for compressing videos, show excellent efficiency when compressing various types of tensors. We demonstrate that video codecs can be versatile and general-purpose tensor codecs while achieving the state-of-the-art compression efficiency in various tasks. We further make use of the hardware video encoding and decoding module available on GPUs to create a framework capable of both inference and training with video codecs repurposed as tensor codecs. This greatly reduces the requirement for memory capacity and communication bandwidth, enabling training and inference of large models on consumer-grade GPUs. more...
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

3. Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

Author: Xu, Yechen, Kong, Xinhao, Chen, Tingjun, and Zhuo, Danyang
Subjects: Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: The complexity of large language model (LLM) serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%., Comment: 11 pages, 8 figures more...
Published: 2024

4. Adaptive Skeleton Graph Decoding

Author: Jin, Shuowei, Wu, Yongji, Zheng, Haizhong, Zhang, Qingzhao, Lentz, Matthew, Mao, Z. Morley, Prakash, Atul, Qian, Feng, and Zhuo, Danyang
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have seen significant adoption for natural language tasks, owing their success to massive numbers of model parameters (e.g., 70B+); however, LLM inference incurs significant computation and memory costs. Recent approaches propose parallel decoding strategies, such as Skeleton-of-Thought (SoT), to improve performance by breaking prompts down into sub-problems that can be decoded in parallel; however, they often suffer from reduced response quality. Our key insight is that we can request additional information, specifically dependencies and difficulty, when generating the sub-problems to improve both response quality and performance. In this paper, we propose Skeleton Graph Decoding (SGD), which uses dependencies exposed between sub-problems to support information forwarding between dependent sub-problems for improved quality while exposing parallelization opportunities for decoding independent sub-problems. Additionally, we leverage difficulty estimates for each sub-problem to select an appropriately-sized model, improving performance without significantly reducing quality. Compared to standard autoregressive generation and SoT, SGD achieves a 1.69x speedup while improving quality by up to 51%. more...
Published: 2024

5. Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native

Author: Lu, Yao, Bian, Song, Chen, Lequn, He, Yongjun, Hui, Yulong, Lentz, Matthew, Li, Beibin, Liu, Fei, Li, Jialin, Liu, Qi, Liu, Rui, Liu, Xiaoxuan, Ma, Lin, Rong, Kexin, Wang, Jianguo, Wu, Yingjun, Wu, Yongji, Zhang, Huanchen, Zhang, Minjia, Zhang, Qizhen, Zhou, Tianyi, and Zhuo, Danyang more...
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: In this paper, we investigate the intersection of large generative AI models and cloud-native computing architectures. Recent large models such as ChatGPT, while revolutionary in their capabilities, face challenges like escalating costs and demand for high-end GPUs. Drawing analogies between large-model-as-a-service (LMaaS) and cloud database-as-a-service (DBaaS), we describe an AI-native computing paradigm that harnesses the power of both cloud-native technologies (e.g., multi-tenancy and serverless computing) and advanced machine learning runtime (e.g., batched LoRA inference). These joint efforts aim to optimize costs-of-goods-sold (COGS) and improve resource accessibility. The journey of merging these two domains is just at the beginning and we hope to stimulate future research and development in this area. more...
Published: 2024

6. Fairness in Serving Large Language Models

Author: Sheng, Ying, Cao, Shiyi, Li, Dacheng, Zhu, Banghua, Li, Zhuohan, Zhuo, Danyang, Gonzalez, Joseph E., and Stoica, Ion
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Performance
Abstract: High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2x tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact more...
Published: 2023

7. Punica: Multi-Tenant LoRA Serving

Author: Chen, Lequn, Ye, Zihao, Wu, Yongji, Zhuo, Danyang, Ceze, Luis, and Krishnamurthy, Arvind
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. Punica is open source at https://github.com/punica-ai/punica . more...
Published: 2023

8. Symphony: Optimized DNN Model Serving using Deferred Batch Scheduling

Author: Chen, Lequn, Deng, Weixin, Canumalla, Anirudh, Xin, Yu, Zhuo, Danyang, Philipose, Matthai, and Krishnamurthy, Arvind
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Having large batch sizes is one of the most critical aspects of increasing the accelerator efficiency and the performance of DNN model inference. However, existing model serving systems cannot achieve adequate batch sizes while meeting latency objectives as these systems eagerly dispatch requests to accelerators to minimize the accelerator idle time. We propose Symphony, a DNN serving system that explores deferred batch scheduling to optimize system efficiency and throughput. Further, unlike other prior systems, Symphony's GPU usage is load-proportional: it consolidates workloads on the appropriate number of GPUs and works smoothly with cluster auto-scaling tools. Symphony consists of two core design points. First, Symphony defines a schedulable window in which a batch of inference requests can be dispatched. This window is computed in order to improve accelerator efficiency while meeting the request's SLO. Second, Symphony implements a scalable, low-latency, fine-grained coordination scheme across accelerators to dispatch and execute requests in the schedulable window. Through extensive scheduler-only benchmarks, we demonstrate that Symphony can schedule millions of requests per second and coordinate thousands of GPUs while also enabling robust autoscaling that adapts to workload changes. Symphony outperforms prior systems by achieving 5x higher goodput when given the same number of GPUs and 60% reduction in GPUs when given the same workload. more...
Published: 2023

9. Agile Development of Linux Schedulers with Ekiben

Author: Miller, Samantha, Kumar, Anirudh, Vakharia, Tanay, Anderson, Tom, Chen, Ang, and Zhuo, Danyang
Subjects: Computer Science - Operating Systems
Abstract: Kernel task scheduling is important for application performance, adaptability to new hardware, and complex user requirements. However, developing, testing, and debugging new scheduling algorithms in Linux, the most widely used cloud operating system, is slow and difficult. We developed Ekiben, a framework for high velocity development of Linux kernel schedulers. Ekiben schedulers are written in safe Rust, and the system supports live upgrade of new scheduling policies into the kernel, userspace debugging, and bidirectional communication with applications. A scheduler implemented with Ekiben achieved near identical performance (within 1% on average) to the default Linux scheduler CFS on a wide range of benchmarks. Ekiben is also able to support a range of research schedulers, specifically the Shinjuku scheduler, a locality aware scheduler, and the Arachne core arbiter, with good performance., Comment: 13 pages, 5 figures, submitted to Eurosys 2024 more...
Published: 2023

10. Query Complexity of Active Learning for Function Family With Nearly Orthogonal Basis

Author: Chen, Xiang, Song, Zhao, Sun, Baocheng, Yin, Junze, and Zhuo, Danyang
Subjects: Computer Science - Machine Learning
Abstract: Many machine learning algorithms require large numbers of labeled data to deliver state-of-the-art results. In applications such as medical diagnosis and fraud detection, though there is an abundance of unlabeled data, it is costly to label the data by experts, experiments, or simulations. Active learning algorithms aim to reduce the number of required labeled data points while preserving performance. For many convex optimization problems such as linear regression and $p$-norm regression, there are theoretical bounds on the number of required labels to achieve a certain accuracy. We call this the query complexity of active learning. However, today's active learning algorithms require the underlying learned function to have an orthogonal basis. For example, when applying active learning to linear regression, the requirement is the target function is a linear composition of a set of orthogonal linear functions, and active learning can find the coefficients of these linear functions. We present a theoretical result to show that active learning does not need an orthogonal basis but rather only requires a nearly orthogonal basis. We provide the corresponding theoretical proofs for the function family of nearly orthogonal basis, and its applications associated with the algorithmically efficient active learning framework. more...
Published: 2023

11. Collie: Finding Performance Anomalies in RDMA Subsystems

Author: Kong, Xinhao, Zhu, Yibo, Zhou, Huaping, Jiang, Zhuo, Ye, Jianxi, Guo, Chuanxiong, and Zhuo, Danyang
Subjects: Computer Science - Networking and Internet Architecture
Abstract: High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentially trigger abnormal performance behaviors (e.g., unexpected low throughput, PFC pause frame storm). We design and implement Collie, a tool for users to systematically uncover performance anomalies in RDMA subsystems without the need to access hardware internal designs. Instead of individually testing each hardware device (e.g., NIC, memory, PCIe), Collie is holistic, constructing a comprehensive search space for application workloads. Collie then uses simulated annealing to drive RDMA-related performance and diagnostic counters to extreme value regions to find workloads that can trigger performance anomalies. We evaluate Collie on combinations of various RDMA NIC, CPU, and other hardware components. Collie found 15 new performance anomalies. All of them are acknowledged by the hardware vendors. 7 of them are already fixed after we reported them. We also present our experience in using Collie to avoid performance anomalies for an RDMA RPC library and an RDMA distributed machine learning framework., Comment: NSDI 2022 more...
Published: 2023

12. Remote Procedure Call as a Managed System Service

Author: Chen, Jingrong, Wu, Yongji, Lin, Shihan, Xu, Yechen, Kong, Xinhao, Anderson, Thomas, Lentz, Matthew, Yang, Xiaowei, and Zhuo, Danyang
Subjects: Computer Science - Networking and Internet Architecture, Computer Science - Operating Systems
Abstract: Remote Procedure Call (RPC) is a widely used abstraction for cloud computing. The programmer specifies type information for each remote procedure, and a compiler generates stub code linked into each application to marshal and unmarshal arguments into message buffers. Increasingly, however, application and service operations teams need a high degree of visibility and control over the flow of RPCs between services, leading many installations to use sidecars or service mesh proxies for manageability and policy flexibility. These sidecars typically involve inspection and modification of RPC data that the stub compiler had just carefully assembled, adding needless overhead. Further, upgrading diverse application RPC stubs to use advanced hardware capabilities such as RDMA or DPDK is a long and involved process, and often incompatible with sidecar policy control. In this paper, we propose, implement, and evaluate a novel approach, where RPC marshalling and policy enforcement are done as a system service rather than as a library linked into each application. Applications specify type information to the RPC system as before, while the RPC service executes policy engines and arbitrates resource use, and then marshals data customized to the underlying network hardware capabilities. Our system, mRPC, also supports live upgrades so that both policy and marshalling code can be updated transparently to application code. Compared with using a sidecar, mRPC speeds up a standard microservice benchmark, DeathStarBench, by up to 2.5$\times$ while having a higher level of policy flexibility and availability., Comment: NSDI 2023 more...
Published: 2023

13. Adaptive and Dynamic Multi-Resolution Hashing for Pairwise Summations

Author: Qin, Lianke, Reddy, Aravind, Song, Zhao, Xu, Zhaozhuo, and Zhuo, Danyang
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning
Abstract: In this paper, we propose Adam-Hash: an adaptive and dynamic multi-resolution hashing data-structure for fast pairwise summation estimation. Given a data-set $X \subset \mathbb{R}^d$, a binary function $f:\mathbb{R}^d\times \mathbb{R}^d\to \mathbb{R}$, and a point $y \in \mathbb{R}^d$, the Pairwise Summation Estimate $\mathrm{PSE}_X(y) := \frac{1}{|X|} \sum_{x \in X} f(x,y)$. For any given data-set $X$, we need to design a data-structure such that given any query point $y \in \mathbb{R}^d$, the data-structure approximately estimates $\mathrm{PSE}_X(y)$ in time that is sub-linear in $|X|$. Prior works on this problem have focused exclusively on the case where the data-set is static, and the queries are independent. In this paper, we design a hashing-based PSE data-structure which works for the more practical \textit{dynamic} setting in which insertions, deletions, and replacements of points are allowed. Moreover, our proposed Adam-Hash is also robust to adaptive PSE queries, where an adversary can choose query $q_j \in \mathbb{R}^d$ depending on the output from previous queries $q_1, q_2, \dots, q_{j-1}$., Comment: BigData 2022 more...
Published: 2022

14. Adore: Differentially Oblivious Relational Database Operators

Author: Qin, Lianke, Jayaram, Rajesh, Shi, Elaine, Song, Zhao, Zhuo, Danyang, and Chu, Shumo
Subjects: Computer Science - Databases, Computer Science - Cryptography and Security
Abstract: There has been a recent effort in applying differential privacy on memory access patterns to enhance data privacy. This is called differential obliviousness. Differential obliviousness is a promising direction because it provides a principled trade-off between performance and desired level of privacy. To date, it is still an open question whether differential obliviousness can speed up database processing with respect to full obliviousness. In this paper, we present the design and implementation of three new major database operators: selection with projection, grouping with aggregation, and foreign key join. We prove that they satisfy the notion of differential obliviousness. Our differentially oblivious operators have reduced cache complexity, runtime complexity, and output size compared to their state-of-the-art fully oblivious counterparts. We also demonstrate that our implementation of these differentially oblivious operators can outperform their state-of-the-art fully oblivious counterparts by up to $7.4\times$., Comment: VLDB 2023 more...
Published: 2022

15. A Faster $k$-means++ Algorithm

Author: Liang, Jiehao, Sarkhel, Somdeb, Song, Zhao, Yin, Chenbo, Yin, Junze, and Zhuo, Danyang
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning
Abstract: $k$-means++ is an important algorithm for choosing initial cluster centers for the $k$-means clustering algorithm. In this work, we present a new algorithm that can solve the $k$-means++ problem with nearly optimal running time. Given $n$ data points in $\mathbb{R}^d$, the current state-of-the-art algorithm runs in $\widetilde{O}(k )$ iterations, and each iteration takes $\widetilde{O}(nd k)$ time. The overall running time is thus $\widetilde{O}(n d k^2)$. We propose a new algorithm \textsc{FastKmeans++} that only takes in $\widetilde{O}(nd + nk^2)$ time, in total. more...
Published: 2022

16. Bypass Exponential Time Preprocessing: Fast Neural Network Training via Weight-Data Correlation Preprocessing

Author: Alman, Josh, Liang, Jiehao, Song, Zhao, Zhang, Ruizhe, and Zhuo, Danyang
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Statistics - Machine Learning
Abstract: Over the last decade, deep neural networks have transformed our society, and they are already widely applied in various machine learning applications. State-of-art deep neural networks are becoming larger in size every year to deliver increasing model accuracy, and as a result, model training consumes substantial computing resources and will only consume more in the future. Using current training methods, in each iteration, to process a data point $x \in \mathbb{R}^d$ in a layer, we need to spend $\Theta(md)$ time to evaluate all the $m$ neurons in the layer. This means processing the entire layer takes $\Theta(nmd)$ time for $n$ data points. Recent work [Song, Yang and Zhang, NeurIPS 2021] reduces this time per iteration to $o(nmd)$, but requires exponential time to preprocess either the data or the neural network weights, making it unlikely to have practical usage. In this work, we present a new preprocessing method that simply stores the weight-data correlation in a tree data structure in order to quickly, dynamically detect which neurons fire at each iteration. Our method requires only $O(nmd)$ time in preprocessing and still achieves $o(nmd)$ time per iteration. We complement our new algorithm with a lower bound, proving that assuming a popular conjecture from complexity theory, one could not substantially speed up our algorithm for dynamic detection of firing neurons. more...
Published: 2022

17. Training Overparametrized Neural Networks in Sublinear Time

Author: Deng, Yichuan, Hu, Hang, Song, Zhao, Weinstein, Omri, and Zhuo, Danyang
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Statistics - Machine Learning
Abstract: The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI). Despite the popularity and low cost-per-iteration of traditional backpropagation via gradient decent, stochastic gradient descent (SGD) has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a typical neural network with $m=\mathrm{poly}(n)$ parameters and input batch of $n$ datapoints in $\mathbb{R}^d$, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires $\sim mnd + n^3$ time per iteration. In this paper, we present a novel training method that requires only $m^{1-\alpha} n d + n^3$ amortized time in the same overparametrized regime, where $\alpha \in (0.01,1)$ is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of deep neural networks (DNNs). more...
Published: 2022

18. Dynamic Maintenance of Kernel Density Estimation Data Structure: From Practice to Theory

Author: Liang, Jiehao, Song, Zhao, Xu, Zhaozhuo, Yin, Junze, and Zhuo, Danyang
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Kernel density estimation (KDE) stands out as a challenging task in machine learning. The problem is defined in the following way: given a kernel function $f(x,y)$ and a set of points $\{x_1, x_2, \cdots, x_n \} \subset \mathbb{R}^d$, we would like to compute $\frac{1}{n}\sum_{i=1}^{n} f(x_i,y)$ for any query point $y \in \mathbb{R}^d$. Recently, there has been a growing trend of using data structures for efficient KDE. However, the proposed KDE data structures focus on static settings. The robustness of KDE data structures over dynamic changing data distributions is not addressed. In this work, we focus on the dynamic maintenance of KDE data structures with robustness to adversarial queries. Especially, we provide a theoretical framework of KDE data structures. In our framework, the KDE data structures only require subquadratic spaces. Moreover, our data structure supports the dynamic update of the dataset in sublinear time. Furthermore, we can perform adaptive queries with the potential adversary in sublinear time. more...
Published: 2022

19. Sublinear Time Algorithm for Online Weighted Bipartite Matching

Author: Hu, Hang, Song, Zhao, Tao, Runzhou, Xu, Zhaozhuo, Yin, Junze, and Zhuo, Danyang
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning
Abstract: Online bipartite matching is a fundamental problem in online algorithms. The goal is to match two sets of vertices to maximize the sum of the edge weights, where for one set of vertices, each vertex and its corresponding edge weights appear in a sequence. Currently, in the practical recommendation system or search engine, the weights are decided by the inner product between the deep representation of a user and the deep representation of an item. The standard online matching needs to pay $nd$ time to linear scan all the $n$ items, computing weight (assuming each representation vector has length $d$), and then deciding the matching based on the weights. However, in reality, the $n$ could be very large, e.g. in online e-commerce platforms. Thus, improving the time of computing weights is a problem of practical significance. In this work, we provide the theoretical foundation for computing the weights approximately. We show that, with our proposed randomized data structures, the weights can be computed in sublinear time while still preserving the competitive ratio of the matching algorithm. more...
Published: 2022

20. Dissecting Service Mesh Overheads

Author: Zhu, Xiangfeng, She, Guozhen, Xue, Bowen, Zhang, Yu, Zhang, Yongsu, Zou, Xuan Kelvin, Duan, Xiongchun, He, Peng, Krishnamurthy, Arvind, Lentz, Matthew, Zhuo, Danyang, and Mahajan, Ratul
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Networking and Internet Architecture
Abstract: Service meshes play a central role in the modern application ecosystem by providing an easy and flexible way to connect different services that form a distributed application. However, because of the way they interpose on application traffic, they can substantially increase application latency and resource consumption. We develop a decompositional approach and a tool, called MeshInsight, to systematically characterize the overhead of service meshes and to help developers quantify overhead in deployment scenarios of interest. Using MeshInsight, we confirm that service meshes can have high overhead -- up to 185% higher latency and up to 92% more virtual CPU cores for our benchmark applications -- but the severity is intimately tied to how they are configured and the application workload. The primary contributors to overhead vary based on the configuration too. IPC (inter-process communication) and socket writes dominate when the service mesh operates as a TCP proxy, but protocol parsing dominates when it operates as an HTTP proxy. MeshInsight also enables us to study the end-to-end impact of optimizations to service meshes. We show that not all seemingly-promising optimizations lead to a notable overhead reduction in realistic settings. more...
Published: 2022

21. Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures

Author: Wu, Yongji, Lentz, Matthew, Zhuo, Danyang, and Lu, Yao
Subjects: Computer Science - Machine Learning, Computer Science - Databases, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneous infrastructure and do not take into account the more complex and tiered computing infrastructure that includes edge devices, local hubs, edge datacenters, and cloud datacenters. On the other hand, recent AutoML efforts have provided viable solutions for model compression, pruning and quantization for heterogeneous environments; for a machine learning model, now we may easily find or even generate a series of models with different tradeoffs between accuracy and efficiency. We design and implement JellyBean, a system for serving and optimizing machine learning inference workflows on heterogeneous infrastructures. Given service-level objectives (e.g., throughput, accuracy), JellyBean picks the most cost-efficient models that meet the accuracy target and decides how to deploy them across different tiers of infrastructures. Evaluations show that JellyBean reduces the total serving cost of visual question answering by up to 58%, and vehicle tracking from the NVIDIA AI City Challenge by up to 36% compared with state-of-the-art model selection and worker assignment solutions. JellyBean also outperforms prior ML serving systems (e.g., Spark on the cloud) up to 5x in serving costs. more...
Published: 2022

22. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Author: Zheng, Lianmin, Li, Zhuohan, Zhang, Hao, Zhuang, Yonghao, Chen, Zhifeng, Huang, Yanping, Wang, Yida, Xu, Yuanzhong, Zhuo, Danyang, Xing, Eric P., Gonzalez, Joseph E., and Stoica, Ion
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Programming Languages
Abstract: Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa, Comment: OSDI 2022 more...
Published: 2022

23. Fast Graph Neural Tangent Kernel via Kronecker Sketching

Author: Jiang, Shunhua, Man, Yunze, Song, Zhao, Yu, Zheng, and Zhuo, Danyang
Subjects: Computer Science - Machine Learning
Abstract: Many deep learning tasks have to deal with graphs (e.g., protein structures, social networks, source code abstract syntax trees). Due to the importance of these tasks, people turned to Graph Neural Networks (GNNs) as the de facto method for learning on graphs. GNNs have become widely applied due to their convincing performance. Unfortunately, one major barrier to using GNNs is that GNNs require substantial time and resources to train. Recently, a new method for learning on graph data is Graph Neural Tangent Kernel (GNTK) [Du, Hou, Salakhutdinov, Poczos, Wang and Xu 19]. GNTK is an application of Neural Tangent Kernel (NTK) [Jacot, Gabriel and Hongler 18] (a kernel method) on graph data, and solving NTK regression is equivalent to using gradient descent to train an infinite-wide neural network. The key benefit of using GNTK is that, similar to any kernel method, GNTK's parameters can be solved directly in a single step. This can avoid time-consuming gradient descent. Meanwhile, sketching has become increasingly used in speeding up various optimization problems, including solving kernel regression. Given a kernel matrix of $n$ graphs, using sketching in solving kernel regression can reduce the running time to $o(n^3)$. But unfortunately such methods usually require extensive knowledge about the kernel matrix beforehand, while in the case of GNTK we find that the construction of the kernel matrix is already $O(n^2N^4)$, assuming each graph has $N$ nodes. The kernel matrix construction time can be a major performance bottleneck when the size of graphs $N$ increases. A natural question to ask is thus whether we can speed up the kernel matrix construction to improve GNTK regression's end-to-end running time. This paper provides the first algorithm to construct the kernel matrix in $o(n^2N^3)$ running time., Comment: AAAI 2022 more...
Published: 2021

24. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Author: Li, Zhuohan, Zhuang, Siyuan, Guo, Shiyuan, Zhuo, Danyang, Zhang, Hao, Song, Dawn, and Stoica, Ion
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe, Comment: ICML 2021 more...
Published: 2021

25. InstaHide's Sample Complexity When Mixing Two Private Images

Author: Huang, Baihe, Song, Zhao, Tao, Runzhou, Yin, Junze, Zhang, Ruizhe, and Zhuo, Danyang
Subjects: Computer Science - Machine Learning, Computer Science - Computational Complexity, Computer Science - Cryptography and Security, Computer Science - Data Structures and Algorithms, Statistics - Machine Learning
Abstract: Training neural networks usually require large numbers of sensitive training data, and how to protect the privacy of training data has thus become a critical topic in deep learning research. InstaHide is a state-of-the-art scheme to protect training data privacy with only minor effects on test accuracy, and its security has become a salient question. In this paper, we systematically study recent attacks on InstaHide and present a unified framework to understand and analyze these attacks. We find that existing attacks either do not have a provable guarantee or can only recover a single private image. On the current InstaHide challenge setup, where each InstaHide image is a mixture of two private images, we present a new algorithm to recover all the private images with a provable guarantee and optimal sample complexity. In addition, we also provide a computational hardness result on retrieving all InstaHide images. Our results demonstrate that InstaHide is not information-theoretically secure but computationally secure in the worst case, even when mixing two private images. more...
Published: 2020

26. On InstaHide, Phase Retrieval, and Sparse Matrix Factorization

Author: Chen, Sitan, Li, Xiaoxiao, Song, Zhao, and Zhuo, Danyang
Subjects: Computer Science - Machine Learning, Computer Science - Cryptography and Security, Computer Science - Data Structures and Algorithms, Statistics - Machine Learning
Abstract: In this work, we examine the security of InstaHide, a scheme recently proposed by [Huang, Song, Li and Arora, ICML'20] for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the distributed learners, InstaHide takes a convex combination of private feature vectors and randomly flips the sign of each entry of the resulting vector with probability 1/2. A salient question is whether this scheme is secure in any provable sense, perhaps under a plausible hardness assumption and assuming the distributions generating the public and private data satisfy certain properties. We show that the answer to this appears to be quite subtle and closely related to the average-case complexity of a new multi-task, missing-data version of the classic problem of phase retrieval. Motivated by this connection, we design a provable algorithm that can recover private vectors using only the public vectors and synthetic vectors generated by InstaHide, under the assumption that the private and public vectors are isotropic Gaussian., Comment: 30 pages, to appear in ICLR 2021, v2: updated discussion of follow-up work more...
Published: 2020

27. Ansor: Generating High-Performance Tensor Programs for Deep Learning

Author: Zheng, Lianmin, Jia, Chengfan, Sun, Minmin, Wu, Zhao, Yu, Cody Hao, Haj-Ali, Ameer, Wang, Yida, Yang, Jun, Zhuo, Danyang, Sen, Koushik, Gonzalez, Joseph E., and Stoica, Ion
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Performance, Computer Science - Programming Languages, Statistics - Machine Learning
Abstract: High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering effort to develop platform-specific optimization code or fall short of finding high-performance programs due to restricted search space and ineffective exploration strategy. We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. In addition, Ansor utilizes a task scheduler to simultaneously optimize multiple subgraphs in deep neural networks. We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to $3.8\times$, $2.6\times$, and $1.7\times$, respectively., Comment: OSDI 2020 more...
Published: 2020

28. High Velocity Kernel File Systems with Bento

Author: Miller, Samantha, Zhang, Kaiyuan, Chen, Mengqi, Jennings, Ryan, Chen, Ang, Zhuo, Danyang, and Anderson, Tom
Subjects: Computer Science - Operating Systems
Abstract: High development velocity is critical for modern systems. This is especially true for Linux file systems which are seeing increased pressure from new storage devices and new demands on storage systems. However, high velocity Linux kernel development is challenging due to the ease of introducing bugs, the difficulty of testing and debugging, and the lack of support for redeployment without service disruption. Existing approaches to high-velocity development of file systems for Linux have major downsides, such as the high performance penalty for FUSE file systems, slowing the deployment cycle for new file system functionality. We propose Bento, a framework for high velocity development of Linux kernel file systems. It enables file systems written in safe Rust to be installed in the Linux kernel, with errors largely sandboxed to the file system. Bento file systems can be replaced with no disruption to running applications, allowing daily or weekly upgrades in a cloud server setting. Bento also supports userspace debugging. We implement a simple file system using Bento and show that it performs similarly to VFS-native ext4 on a variety of benchmarks and outperforms a FUSE version by 7x on 'git clone'. We also show that we can dynamically add file provenance tracking to a running kernel file system with only 15ms of service interruption., Comment: 14 pages, 6 figures, to be published in FAST 2021 more...
Published: 2020

29. Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems

Author: Zhuang, Siyuan, Li, Zhuohan, Zhuo, Danyang, Wang, Stephanie, Liang, Eric, Nishihara, Robert, Moritz, Philipp, and Stoica, Ion
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Networking and Internet Architecture
Abstract: Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance. We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively., Comment: SIGCOMM 2021 more...
Published: 2020

30. Volur: Concurrent Edge/Core Route Control in Data Center Networks

Author: Zhang, Qiao, Zhuo, Danyang, Liu, Vincent, Lapukhov, Petr, Peter, Simon, Krishnamurthy, Arvind, and Anderson, Thomas
Subjects: Computer Science - Networking and Internet Architecture
Abstract: A perennial question in computer networks is where to place functionality among components of a distributed computer system. In data centers, one option is to move all intelligence to the edge, essentially relegating switches and middleboxes, regardless of their programmability, to simple static routing policies. Another is to add more intelligence to the middle of the network in the hopes that it can handle any issue that arises. This paper presents an architecture, called Volur, that provides a third option by facilitating the co-existence of an intelligent network with an intelligent edge. The key architectural principle of Volur is predictability of the network. We describe the key design requirements, and show through case studies how our approach facilitates more democratic innovation of all parts of the network. We also demonstrate the practicality of our architecture by describing how to implement the architecture on top of existing hardware and by deploying a prototype on top of a large production data center. more...
Published: 2018

31. Curator: Efficient Indexing for Multi-Tenant Vector Databases

Author: Jin, Yicheng, Wu, Yongji, Hu, Wenjun, Maggs, Bruce M., Zhang, Xiao, Zhuo, Danyang, Jin, Yicheng, Wu, Yongji, Hu, Wenjun, Maggs, Bruce M., Zhang, Xiao, and Zhuo, Danyang
Abstract: Vector databases have emerged as key enablers for bridging intelligent applications with unstructured data, providing generic search and management support for embedding vectors extracted from the raw unstructured data. As multiple data users can share the same database infrastructure, multi-tenancy support for vector databases is increasingly desirable. This hinges on an efficient filtered search operation, i.e., only querying the vectors accessible to a particular tenant. Multi-tenancy in vector databases is currently achieved by building either a single, shared index among all tenants, or a per-tenant index. The former optimizes for memory efficiency at the expense of search performance, while the latter does the opposite. Instead, this paper presents Curator, an in-memory vector index design tailored for multi-tenant queries that simultaneously achieves the two conflicting goals, low memory overhead and high performance for queries, vector insertion, and deletion. Curator indexes each tenant's vectors with a tenant-specific clustering tree and encodes these trees compactly as sub-trees of a shared clustering tree. Each tenant's clustering tree adapts dynamically to its unique vector distribution, while maintaining a low per-tenant memory footprint. Our evaluation, based on two widely used data sets, confirms that Curator delivers search performance on par with per-tenant indexing, while maintaining memory consumption at the same level as metadata filtering on a single, shared index. more...
Published: 2024

32. Application Defined Networks

Author: Zhu, Xiangfeng, primary, Deng, Weixin, additional, Liu, Banruo, additional, Chen, Jingrong, additional, Wu, Yongji, additional, Anderson, Thomas, additional, Krishnamurthy, Arvind, additional, Mahajan, Ratul, additional, and Zhuo, Danyang, additional more...
Published: 2023
Full Text: View/download PDF

33. Dissecting Overheads of Service Mesh Sidecars

Author: Zhu, Xiangfeng, primary, She, Guozhen, additional, Xue, Bowen, additional, Zhang, Yu, additional, Zhang, Yongsu, additional, Zou, Xuan Kelvin, additional, Duan, XiongChun, additional, He, Peng, additional, Krishnamurthy, Arvind, additional, Lentz, Matthew, additional, Zhuo, Danyang, additional, and Mahajan, Ratul, additional more...
Published: 2023
Full Text: View/download PDF

34. Towards a Manageable Intra-Host Network

Author: Kong, Xinhao, primary, Lou, Jiaqi, additional, Bai, Wei, additional, Kim, Nan Sung, additional, and Zhuo, Danyang, additional
Published: 2023
Full Text: View/download PDF

35. Fast Graph Neural Tangent Kernel via Kronecker Sketching

Author: Jiang, Shunhua, Man, Yunze, Song, Zhao, Yu, Zheng, and Zhuo, Danyang
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, General Medicine, Machine Learning (cs.LG)
Abstract: Many deep learning tasks have to deal with graphs (e.g., protein structures, social networks, source code abstract syntax trees). Due to the importance of these tasks, people turned to Graph Neural Networks (GNNs) as the de facto method for learning on graphs. GNNs have become widely applied due to their convincing performance. Unfortunately, one major barrier to using GNNs is that GNNs require substantial time and resources to train. Recently, a new method for learning on graph data is Graph Neural Tangent Kernel (GNTK) [Du, Hou, Salakhutdinov, Poczos, Wang and Xu 19]. GNTK is an application of Neural Tangent Kernel (NTK) [Jacot, Gabriel and Hongler 18] (a kernel method) on graph data, and solving NTK regression is equivalent to using gradient descent to train an infinite-wide neural network. The key benefit of using GNTK is that, similar to any kernel method, GNTK's parameters can be solved directly in a single step. This can avoid time-consuming gradient descent. Meanwhile, sketching has become increasingly used in speeding up various optimization problems, including solving kernel regression. Given a kernel matrix of $n$ graphs, using sketching in solving kernel regression can reduce the running time to $o(n^3)$. But unfortunately such methods usually require extensive knowledge about the kernel matrix beforehand, while in the case of GNTK we find that the construction of the kernel matrix is already $O(n^2N^4)$, assuming each graph has $N$ nodes. The kernel matrix construction time can be a major performance bottleneck when the size of graphs $N$ increases. A natural question to ask is thus whether we can speed up the kernel matrix construction to improve GNTK regression's end-to-end running time. This paper provides the first algorithm to construct the kernel matrix in $o(n^2N^3)$ running time., Comment: AAAI 2022 more...
Published: 2022
Full Text: View/download PDF

36. RDMA Congestion Control: It Is Only for the Compliant

Author: Snyder, John, primary, Lebeck, Alvin R., additional, and Zhuo, Danyang, additional
Published: 2023
Full Text: View/download PDF

37. Adaptive and Dynamic Multi-Resolution Hashing for Pairwise Summations

Author: Qin, Lianke, primary, Reddy, Aravind, additional, Song, Zhao, additional, Xu, Zhaozhuo, additional, and Zhuo, Danyang, additional
Published: 2022
Full Text: View/download PDF

38. Adore

Author: Qin, Lianke, primary, Jayaram, Rajesh, additional, Shi, Elaine, additional, Song, Zhao, additional, Zhuo, Danyang, additional, and Chu, Shumo, additional
Published: 2022
Full Text: View/download PDF

39. Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures

Author: Wu, Yongji, primary, Lentz, Matthew, additional, Zhuo, Danyang, additional, and Lu, Yao, additional
Published: 2022
Full Text: View/download PDF

40. Rearchitecting in-memory object stores for low latency

Author: Zhuo, Danyang, primary, Zhang, Kaiyuan, additional, Li, Zhuohan, additional, Zhuang, Siyuan, additional, Wang, Stephanie, additional, Chen, Ang, additional, and Stoica, Ion, additional
Published: 2021
Full Text: View/download PDF

41. Hoplite

Author: Zhuang, Siyuan, primary, Li, Zhuohan, additional, Zhuo, Danyang, additional, Wang, Stephanie, additional, Liang, Eric, additional, Nishihara, Robert, additional, Moritz, Philipp, additional, and Stoica, Ion, additional more...
Published: 2021
Full Text: View/download PDF

42. Differentially Oblivious Database Joins: Overcoming the Worst-Case Curse of Fully Oblivious Algorithms

Author: Chu, Shumo, Zhuo, Danyang, Shi, Elaine, Chan, T-H. Hubert, Chu, Shumo, Zhuo, Danyang, Shi, Elaine, and Chan, T-H. Hubert
Abstract: Numerous high-profile works have shown that access patterns to even encrypted databases can leak secret information and sometimes even lead to reconstruction of the entire database. To thwart access pattern leakage, the literature has focused on oblivious algorithms, where obliviousness requires that the access patterns leak nothing about the input data. In this paper, we consider the Join operator, an important database primitive that has been extensively studied and optimized. Unfortunately, any fully oblivious Join algorithm would require always padding the result to the worst-case length which is quadratic in the data size N. In comparison, an insecure baseline incurs only O(R + N) cost where R is the true result length, and in the common case in practice, R is relatively short. As a typical example, when R = O(N), any fully oblivious algorithm must inherently incur a prohibitive, N-fold slowdown relative to the insecure baseline. Indeed, the (non-private) database and algorithms literature invariably focuses on studying the instance-specific rather than worst-case performance of database algorithms. Unfortunately, the stringent notion of full obliviousness precludes the design of efficient algorithms with non-trivial instance-specific performance. To overcome this worst-case performance barrier of full obliviousness and enable algorithms with good instance-specific performance, we consider a relaxed notion of access pattern privacy called (?, ?)-differential obliviousness (DO), originally proposed in the seminal work of Chan et al. (SODA'19). Rather than insisting that the access patterns leak no information whatsoever, the relaxed DO notion requires that the access patterns satisfy (?, ?)-differential privacy. We show that by adopting the relaxed DO notion, we can obtain efficient database Join mechanisms whose instance-specific performance approximately matches the insecure baseline, while still offering a meaningful notion of privacy to individual users. Com more...
Published: 2021
Full Text: View/download PDF

43. Gallium

Author: Zhang, Kaiyuan, primary, Zhuo, Danyang, additional, and Krishnamurthy, Arvind, additional
Published: 2020
Full Text: View/download PDF

44. Practical Safe Linux Kernel Extensibility

Author: Miller, Samantha, primary, Zhang, Kaiyuan, additional, Zhuo, Danyang, additional, Xu, Shibin, additional, Krishnamurthy, Arvind, additional, and Anderson, Thomas, additional
Published: 2019
Full Text: View/download PDF

45. Understanding and Mitigating Packet Corruption in Data Center Networks

Author: Zhuo, Danyang, primary, Ghobadi, Monia, additional, Mahajan, Ratul, additional, Förster, Klaus-Tycho, additional, Krishnamurthy, Arvind, additional, and Anderson, Thomas, additional
Published: 2017
Full Text: View/download PDF

46. Rack-level Congestion Control

Author: Zhuo, Danyang, primary, Zhang, Qiao, additional, Liu, Vincent, additional, Krishnamurthy, Arvind, additional, and Anderson, Thomas, additional
Published: 2016
Full Text: View/download PDF

47. Canaries in the Network

Author: Zhuo, Danyang, primary, Zhang, Qiao, additional, Yang, Xin, additional, and Liu, Vincent, additional
Published: 2016
Full Text: View/download PDF

48. Subways

Author: Liu, Vincent, primary, Zhuo, Danyang, additional, Peter, Simon, additional, Krishnamurthy, Arvind, additional, and Anderson, Thomas, additional
Published: 2015
Full Text: View/download PDF

49. Machine fault tolerance for reliable datacenter systems

Author: Zhuo, Danyang, primary, Zhang, Qiao, additional, Ports, Dan R. K., additional, Krishnamurthy, Arvind, additional, and Anderson, Thomas, additional
Published: 2014
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

49 results on '"Zhuo, Danyang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources