Author: "Zhu, Yibo" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhu, Yibo"' showing total 465 results

Start Over Author "Zhu, Yibo"

465 results on '"Zhu, Yibo"'

1. InfinitePOD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

Author: Shou, Chenchen, Liu, Guyue, Nie, Hao, Meng, Huaiyu, Zhou, Yu, Jiang, Yimin, Lv, Wenqing, Xu, Yelong, Lu, Yuanwei, Chen, Zhang, Yu, Yanbo, Shen, Yichen, Zhu, Yibo, and Jiang, Daxin
Subjects: Computer Science - Networking and Internet Architecture, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs such as TPUv4 takes a middle-ground approach by leveraging Optical Circuit Switches, but the fault explosion radius remains large at the cube level (e.g., 64 TPUs). We propose InfinitePOD, a novel transceiver-centric HBD architecture that unifies connectivity and dynamic switching at the transceiver level using Optical Circuit Switching (OCS). By embedding OCS within each transceiver, InfinitePOD achieves reconfigurable point-to-multipoint connectivity, allowing the topology to adapt into variable-size rings. This design provides: i) datacenter-wide scalability without cost explosion; ii) fault resilience by isolating failures to a single node, and iii) full bandwidth utilization for fault-free GPUs. Key innovations include a Silicon Photonic (SiPh) based low-cost OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology co-designed with intra-/inter-node communication, and an HBD-DCN orchestration algorithm maximizing GPU utilization while minimizing cross-ToR datacenter network traffic. The evaluation demonstrates that InfinitePOD achieves 31% of the cost of NVL-72, near-zero GPU waste ratio (over one order of magnitude lower than NVL-72 and TPUv4), near-zero cross-ToR traffic when node fault ratios under 7%, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs per Node).
Published: 2025

2. RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion

Author: Zhong, Yinmin, Zhang, Zili, Wu, Bingyang, Liu, Shengyu, Chen, Yukun, Wan, Changyi, Hu, Hanpeng, Xia, Lei, Ming, Ranchen, Zhu, Yibo, and Jin, Xin
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems.
Published: 2024

3. DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models

Author: Zhang, Zili, Zhong, Yinmin, Ming, Ranchen, Hu, Hanpeng, Sun, Jianjian, Ge, Zheng, Zhu, Yibo, and Jin, Xin
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Multimodal large language models (LLMs) have demonstrated significant potential in a wide range of AI applications. Yet, training multimodal LLMs suffers from low efficiency and scalability, due to the inherent model heterogeneity and data heterogeneity across different modalities. We present DistTrain, an efficient and adaptive framework to reform the training of multimodal large language models on large-scale clusters. The core of DistTrain is the disaggregated training technique that exploits the characteristics of multimodal LLM training to achieve high efficiency and scalability. Specifically, it leverages disaggregated model orchestration and disaggregated data reordering to address model and data heterogeneity respectively. We also tailor system optimization for multimodal LLM training to overlap GPU communication and computation. We evaluate DistTrain across different sizes of multimodal LLMs on a large-scale production cluster with thousands of GPUs. The experimental results show that DistTrain achieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM on 1172 GPUs and outperforms Megatron-LM by up to 2.2$\times$ on throughput. The ablation study shows the main techniques of DistTrain are both effective and lightweight.
Published: 2024

4. QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

Author: Zhao, Juntao, Wan, Borui, Peng, Yanghua, Lin, Haibin, Zhu, Yibo, and Wu, Chuan
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: A number of production deep learning clusters have attempted to explore inference hardware for DNN training, at the off-peak serving hours with many inference GPUs idling. Conducting DNN training with a combination of heterogeneous training and inference GPUs, known as hybrid device training, presents considerable challenges due to disparities in compute capability and significant differences in memory capacity. We propose QSync, a training system that enables efficient synchronous data-parallel DNN training over hybrid devices by strategically exploiting quantized operators. According to each device's available resource capacity, QSync selects a quantization-minimized setting for operators in the distributed DNN training graph, minimizing model accuracy degradation but keeping the training efficiency brought by quantization. We carefully design a predictor with a bi-directional mixed-precision indicator to reflect the sensitivity of DNN layers on fixed-point and floating-point low-precision operators, a replayer with a neighborhood-aware cost mapper to accurately estimate the latency of distributed hybrid mixed-precision training, and then an allocator that efficiently synchronizes workers with minimized model accuracy degradation. QSync bridges the computational graph on PyTorch to an optimized backend for quantization kernel performance and flexible support for various GPU architectures. Extensive experiments show that QSync's predictor can accurately simulate distributed mixed-precision training with <5% error, with a consistent 0.27-1.03% accuracy improvement over the from-scratch training tasks compared to uniform precision., Comment: IPDPS 24
Published: 2024

5. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Author: Zhong, Yinmin, Liu, Shengyu, Chen, Junda, Hu, Jianbo, Zhu, Yibo, Liu, Xuanzhe, Jin, Xin, and Zhang, Hao
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests., Comment: OSDI 2024
Published: 2024

6. MuxFlow: efficient GPU sharing in production-level clusters with more than 10000 GPUs

Author: Liu, Xuanzhe, Zhao, Yihao, Liu, Shufan, Li, Xiang, Zhu, Yibo, Liu, Xin, and Jin, Xin
Published: 2024
Full Text: View/download PDF

7. CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs

Author: Hu, Hanpeng, Su, Junwei, Zhao, Juntao, Peng, Yanghua, Zhu, Yibo, Lin, Haibin, and Wu, Chuan
Subjects: Computer Science - Machine Learning, Computer Science - Performance
Abstract: Deep Neural Networks (DNNs) have shown excellent performance in a wide range of machine learning applications. Knowing the latency of running a DNN model or tensor program on a specific device is useful in various tasks, such as DNN graph- or tensor-level optimization and device selection. Considering the large space of DNN models and devices that impede direct profiling of all combinations, recent efforts focus on building a predictor to model the performance of DNN models on different devices. However, none of the existing attempts have achieved a cost model that can accurately predict the performance of various tensor programs while supporting both training and inference accelerators. We propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to learn from different domains (i.e., different DNN operators and devices). Our extensive experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. The implementation and the expanded dataset are available at https://github.com/joapolarbear/cdmpp., Comment: Accepted by EuroSys 2024
Published: 2023
Full Text: View/download PDF

8. Collie: Finding Performance Anomalies in RDMA Subsystems

Author: Kong, Xinhao, Zhu, Yibo, Zhou, Huaping, Jiang, Zhuo, Ye, Jianxi, Guo, Chuanxiong, and Zhuo, Danyang
Subjects: Computer Science - Networking and Internet Architecture
Abstract: High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentially trigger abnormal performance behaviors (e.g., unexpected low throughput, PFC pause frame storm). We design and implement Collie, a tool for users to systematically uncover performance anomalies in RDMA subsystems without the need to access hardware internal designs. Instead of individually testing each hardware device (e.g., NIC, memory, PCIe), Collie is holistic, constructing a comprehensive search space for application workloads. Collie then uses simulated annealing to drive RDMA-related performance and diagnostic counters to extreme value regions to find workloads that can trigger performance anomalies. We evaluate Collie on combinations of various RDMA NIC, CPU, and other hardware components. Collie found 15 new performance anomalies. All of them are acknowledged by the hardware vendors. 7 of them are already fixed after we reported them. We also present our experience in using Collie to avoid performance anomalies for an RDMA RPC library and an RDMA distributed machine learning framework., Comment: NSDI 2022
Published: 2023

9. MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters

Author: Zhao, Yihao, Liu, Xin, Liu, Shufan, Li, Xiang, Zhu, Yibo, Huang, Gang, Liu, Xuanzhe, and Jin, Xin
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Large-scale GPU clusters are widely-used to speed up both latency-critical (online) and best-effort (offline) deep learning (DL) workloads. However, most DL clusters either dedicate each GPU to one workload or share workloads in time, leading to very low GPU resource utilization. We present MuxFlow, the first production cluster system that supports efficient and safe space-sharing for DL workloads. NVIDIA MPS provides an opportunity to share multiple workloads in space on widely-deployed NVIDIA GPUs, but it cannot guarantee the performance and safety of online workloads. MuxFlow introduces a two-level protection mechanism for memory and computation to guarantee the performance of online workloads. Based on our practical error analysis, we design a mixed error-handling mechanism to guarantee the safety of online workloads. MuxFlow further proposes dynamic streaming multiprocessor (SM) allocation and matching-based scheduling to improve the efficiency of offline workloads. MuxFlow has been deployed at CompanyX's clusters with more than 20,000 GPUs. The deployment results indicate that MuxFlow substantially improves the GPU utilization from 26$\%$ to 76$\%$, SM activity from 16$\%$ to 33$\%$, and GPU memory from 42$\%$ to 48$\%$.
Published: 2023

10. Pseudomonas aeruginosa regulator PvrA binds simultaneously to multiple pseudo-palindromic sites for efficient transcription activation

Author: Zhu, Yibo, Luo, Bingnan, Mou, Xingyu, Song, Yingjie, Zhou, Yonghong, Luo, Yongbo, Sun, Bo, Luo, Youfu, Tang, Hong, Su, Zhaoming, and Bao, Rui
Published: 2024
Full Text: View/download PDF

11. Accelerating Distributed MoE Training and Inference with Lina

Author: Li, Jiamin, Jiang, Yimin, Zhu, Yibo, Wang, Cong, and Xu, Hong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Scaling model parameters improves model quality at the price of high computation overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) architecture, have sub-linear scaling of computation cost with model size, thus providing opportunities to train and serve a larger model at lower cost than their dense counterparts. However, distributed MoE training and inference is inefficient, mainly due to the interleaved all-to-all communication during model computation. This paper makes two main contributions. First, we systematically analyze all-to-all overhead in distributed MoE and present the main causes for it to be the bottleneck in training and inference, respectively. Second, we design and build Lina to address the all-to-all bottleneck head-on. Lina opportunistically prioritizes all-to-all over the concurrent allreduce whenever feasible using tensor partitioning, so all-to-all and training step time is improved. Lina further exploits the inherent pattern of expert selection to dynamically schedule resources during inference, so that the transfer size and bandwidth of all-to-all across devices are balanced amid the highly skewed expert popularity in practice. Experiments on an A100 GPU testbed show that Lina reduces the training step time by up to 1.73x and reduces the 95%ile inference time by an average of 1.63x over the state-of-the-art systems.
Published: 2022

12. ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Author: Zhai, Yujia, Jiang, Chengquan, Wang, Leyuan, Jia, Xiaoying, Zhang, Shang, Chen, Zizhong, Liu, Xin, and Zhu, Yibo
Subjects: Computer Science - Machine Learning
Abstract: Transformers have become keystone models in natural language processing over the past decade. They have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces required by transformer models generate a commensurate need to accelerate performance. Natural language processing problems are also routinely faced with variable-length sequences, as word counts commonly vary among sentences. Existing deep learning frameworks pad variable-length sequences to a maximal length, which adds significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a padding-free algorithm that liberates the entire transformer from redundant computations on zero padded tokens. In addition to algorithmic-level optimization, we provide architecture-aware optimizations for transformer functional modules, especially the performance-critical algorithm Multi-Head Attention (MHA). Experimental results on an NVIDIA A100 GPU with variable-length sequence inputs validate that our fused MHA outperforms PyTorch by 6.13x. The end-to-end performance of ByteTransformer for a forward BERT transformer surpasses state-of-the-art transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer, Microsoft DeepSpeed-Inference and NVIDIA FasterTransformer, by 87\%, 131\%, 138\%, 74\% and 55\%, respectively. We also demonstrate the general applicability of our optimization methods to other BERT-like models, including ALBERT, DistilBERT, and DeBERTa., Comment: Accepted at IPDPS 2023
Published: 2022

13. A comprehensive review on treatment and recovery of rare earth elements from wastewater: Current knowledge and future perspectives

Author: Li, Zhonghong, Zhu, Yibo, and Yao, Jiaqi
Published: 2024
Full Text: View/download PDF

14. Insights into the catalytic mechanism of archaeal peptidoglycan endoisopeptidases from methanogenic phages

Author: Guo, Leizhou, Zhu, Yibo, Zhao, Ninglin, Leng, Huan, Wang, Shuxin, Yang, Qing, Zhao, Pengyan, Chen, Yi, Cha, Guihong, Bai, Liping, and Bao, Rui
Published: 2025
Full Text: View/download PDF

15. AQP3-liposome@GelMA promotes overloaded-induced degenerated disc regeneration via IBSP/ITG αVβ3/AKT pathway

Author: Hu, Junxian, Zhu, Yibo, Li, Xiaoxiao, Pang, Zeyu, Li, Xiangwei, Zhang, Huilin, Wang, Yiyang, Li, Pei, and Zhou, Qiang
Published: 2025
Full Text: View/download PDF

16. ByteComp: Revisiting Gradient Compression in Distributed Training

Author: Wang, Zhuang, Lin, Haibin, Zhu, Yibo, and Ng, T. S. Eugene
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Gradient compression (GC) is a promising approach to addressing the communication bottleneck in distributed deep learning (DDL). However, it is challenging to find the optimal compression strategy for applying GC to DDL because of the intricate interactions among tensors. To fully unleash the benefits of GC, two questions must be addressed: 1) How to express all compression strategies and the corresponding interactions among tensors of any DDL training job? 2) How to quickly select a near-optimal compression strategy? In this paper, we propose ByteComp to answer these questions. It first designs a decision tree abstraction to express all the compression strategies and develops empirical models to timeline tensor computation, communication, and compression to enable ByteComp to derive the intricate interactions among tensors. It then designs a compression decision algorithm that analyzes tensor interactions to eliminate and prioritize strategies and optimally offloads compression to CPUs. Experimental evaluations show that ByteComp can improve the training throughput over the start-of-the-art compression-enabled system by up to 77% for representative DDL training jobs. Moreover, the computational time needed to select the compression strategy is measured in milliseconds, and the selected strategy is only a few percent from optimal.
Published: 2022

17. dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training

Author: Hu, Hanpeng, Jiang, Chenyu, Zhong, Yuchen, Peng, Yanghua, Wu, Chuan, Zhu, Yibo, Lin, Haibin, and Guo, Chuanxiong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Performance
Abstract: Distributed training using multiple devices (e.g., GPUs) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice. Given the complexity of distributed systems, it is challenging to identify the root cause(s) of inefficiency and exercise effective performance optimizations when unexpected low training speed occurs. To date, there exists no software tool which diagnoses performance issues and helps expedite distributed DNN training, while the training can be run using different deep learning frameworks. This paper proposes dPRO, a toolkit that includes: (1) an efficient profiler that collects runtime traces of distributed DNN training across multiple frameworks, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an optimizer that effectively identifies performance bottlenecks and explores optimization strategies (from computation, communication, and memory aspects) for training acceleration. We implement dPRO on multiple deep learning frameworks (TensorFlow, MXNet) and representative communication schemes (AllReduce and Parameter Server). Extensive experiments show that dPRO predicts the performance of distributed training in various settings with < 5% errors in most cases and finds optimization strategies with up to 3.48x speed-up over the baselines., Comment: Accepted by MLSys 2022
Published: 2022
Full Text: View/download PDF

18. Aryl: An Elastic Cluster Scheduler for Deep Learning

Author: Li, Jiamin, Xu, Hong, Zhu, Yibo, Liu, Zherui, Guo, Chuanxiong, and Wang, Cong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT). Aryl addresses these combinatorial problems using principled heuristics. It introduces the notion of server preemption cost which it greedily reduces during server reclaiming. It further relies on the JCT reduction value defined for each additional worker for an elastic job to solve the scheduling problem as a multiple-choice knapsack problem. Prototype implementation on a 64-GPU testbed and large-scale simulation with 15-day traces of over 50,000 production jobs show that Aryl brings 1.53x and 1.50x reductions in average queuing time and JCT, and improves cluster usage by up to 26.9% over the cluster scheduler without capacity loaning or elastic scaling.
Published: 2022
Full Text: View/download PDF

19. BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

Author: Liu, Tianfeng, Chen, Yangrui, Li, Dan, Wu, Chuan, Zhu, Yibo, He, Jun, Peng, Yanghua, Chen, Hongzheng, Chen, Hongzhi, and Guo, Chuanxiong
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Graph neural networks (GNNs) have extended the success of deep neural networks (DNNs) to non-Euclidean graph data, achieving ground-breaking performance on various tasks such as node classification and graph property prediction. Nonetheless, existing systems are inefficient to train large graphs with billions of nodes and edges with GPUs. The main bottlenecks are the process of preparing data for GPUs - subgraph sampling and feature retrieving. This paper proposes BGL, a distributed GNN training system designed to address the bottlenecks with a few key ideas. First, we propose a dynamic cache engine to minimize feature retrieving traffic. By a co-design of caching policy and the order of sampling, we find a sweet spot of low overhead and high cache hit ratio. Second, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 20.68x on average., Comment: Under Review
Published: 2021

20. Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

Author: Xing, Jiarong, Wang, Leyuan, Zhang, Shang, Chen, Jack, Chen, Ang, and Zhu, Yibo
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence
Abstract: Today's auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor programs by navigating a large search space to identify effective implementations, but they do so with opaque hardware details. Thus, their performance could fall behind that of hardware-native libraries (e.g., cuBLAS, cuDNN), which are hand-optimized by device vendors to extract high performance. On the other hand, these vendor libraries have a fixed set of supported functions and lack the customization and automation support afforded by auto-tuners. Bolt is based on the recent trend that vendor libraries are increasingly modularized and reconfigurable via declarative control (e.g., CUTLASS). It enables a novel approach that bridges this gap and achieves the best of both worlds, via hardware-native templated search. Bolt provides new opportunities to rethink end-to-end tensor optimizations at the graph, operator, and model levels. Bolt demonstrates this concept by prototyping on a popular auto-tuner in TVM and a class of widely-used platforms (i.e., NVIDIA GPUs) -- both in large deployment in our production environment. Bolt improves the inference speed of common convolutional neural networks by 2.5x on average over the state of the art, and it auto-tunes these models within 20 minutes.
Published: 2021

21. Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Author: Tan, Cheng, Li, Zhichao, Zhang, Jian, Cao, Yu, Qi, Sikai, Liu, Zherui, Zhu, Yibo, and Guo, Chuanxiong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs that partitions one physical GPU into multiple GPU instances. With MIG, A100 can be the most cost-efficient GPU ever for serving Deep Neural Networks (DNNs). However, discovering the most efficient GPU partitions is challenging. The underlying problem is NP-hard; moreover, it is a new abstract problem, which we define as the Reconfigurable Machine Scheduling Problem (RMS). This paper studies serving DNNs with MIG, a new case of RMS. We further propose a solution, MIG-serving. MIG- serving is an algorithm pipeline that blends a variety of newly designed algorithms and customized classic algorithms, including a heuristic greedy algorithm, Genetic Algorithm (GA), and Monte Carlo Tree Search algorithm (MCTS). We implement MIG-serving on Kubernetes. Our experiments show that compared to using A100 as-is, MIG-serving can save up to 40% of GPUs while providing the same throughput.
Published: 2021

22. DeepCC: Bridging the Gap Between Congestion Control and Applications via Multi-Objective Optimization

Author: Zhang, Lei, Cui, Yong, Wang, Mowei, Zhu, Kewei, Zhu, Yibo, and Jiang, Yong
Subjects: Computer Science - Networking and Internet Architecture
Abstract: The increasingly complicated and diverse applications have distinct network performance demands, e.g., some desire high throughput while others require low latency. Traditional congestion controls (CC) have no perception of these demands. Consequently, literatures have explored the objective-specific algorithms, which are based on either offline training or online learning, to adapt to certain application demands. However, once generated, such algorithms are tailored to a specific performance objective function. Newly emerged performance demands in a changeable network environment require either expensive retraining (in the case of offline training), or manually redesigning a new objective function (in the case of online learning). To address this problem, we propose a novel architecture, DeepCC. It generates a CC agent that is generically applicable to a wide range of application requirements and network conditions. The key idea of DeepCC is to leverage both offline deep reinforcement learning and online fine-tuning. In the offline phase, instead of training towards a specific objective function, DeepCC trains its deep neural network model using multi-objective optimization. With the trained model, DeepCC offers near Pareto optimal policies w.r.t different user-specified trade-offs between throughput, delay, and loss rate without any redesigning or retraining. In addition, a quick online fine-tuning phase further helps DeepCC achieve the application-specific demands under dynamic network conditions. The simulation and real-world experiments show that DeepCC outperforms state-of-the-art schemes in a wide range of settings. DeepCC gains a higher target completion ratio of application requirements up to 67.4% than that of other schemes, even in an untrained environment.
Published: 2021

23. AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly

Author: Jin, Yuchen, Zhou, Tianyi, Zhao, Liangyu, Zhu, Yibo, Guo, Chuanxiong, Canini, Marco, and Krishnamurthy, Arvind
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The learning rate (LR) schedule is one of the most important hyper-parameters needing careful tuning in training DNNs. However, it is also one of the least automated parts of machine learning systems and usually costs significant manual effort and computing. Though there are pre-defined LR schedules and optimizers with adaptive LR, they introduce new hyperparameters that need to be tuned separately for different tasks/datasets. In this paper, we consider the question: Can we automatically tune the LR over the course of training without human involvement? We propose an efficient method, AutoLRS, which automatically optimizes the LR for each training stage by modeling training dynamics. AutoLRS aims to find an LR applied to every $\tau$ steps that minimizes the resulted validation loss. We solve this black-box optimization on the fly by Bayesian optimization (BO). However, collecting training instances for BO requires a system to evaluate each LR queried by BO's acquisition function for $\tau$ steps, which is prohibitively expensive in practice. Instead, we apply each candidate LR for only $\tau'\ll\tau$ steps and train an exponential model to predict the validation loss after $\tau$ steps. This mutual-training process between BO and the loss-prediction model allows us to limit the training steps invested in the BO search. We demonstrate the advantages and the generality of AutoLRS through extensive experiments of training DNNs for tasks from diverse domains using different optimizers. The LR schedules auto-generated by AutoLRS lead to a speedup of $1.22\times$, $1.43\times$, and $1.5\times$ when training ResNet-50, Transformer, and BERT, respectively, compared to the LR schedules in their original papers, and an average speedup of $1.31\times$ over state-of-the-art heavily-tuned LR schedules., Comment: Published as a conference paper at ICLR 2021
Published: 2021

24. Advances in the adenylation domain: discovery of diverse non-ribosomal peptides

Author: Xu, Delei, Zhang, Zihan, Yao, Luye, Wu, LingTian, Zhu, Yibo, Zhao, Meilin, and Xu, Hong
Published: 2023
Full Text: View/download PDF

25. Scalable graphene platform for Tbits/s data transmission

Author: Lee, Brian S., Freitas, Alexandre P., Gil-Molina, Andres, Shim, Euijae, Zhu, Yibo, Hone, James, and Lipson, Michal
Subjects: Physics - Applied Physics, Physics - Optics
Abstract: To date, no electro-optic platform enables devices with high bandwidth, small footprint, and low power consumption, while also enabling mass production. Here we demonstrate high-yield fabrication of high-speed graphene electro-absorption modulators using CVD-grown graphene. We minimize variation in device performance from graphene inhomogeneity over large area by engineering graphene-mode overlap and device capacitance to ensure high extinction ratio. We fabricate an 8 mm x 1 mm chip with 32 graphene electro-absorption modulators and measure 94% yield with bit error rate below the hard-decision forward error correction limit at 7 Gbits/s, amounting to a total aggregated data rate of 210 Gbits/s. Monte Carlo simulations show that data rates > 0.6 Tbits/s are within reach by further optimizing device cross-section, paving the way for graphene-based ultra-high data rate applications.
Published: 2020

26. High performance integrated graphene electro-optic modulator at cryogenic temperature

Author: Lee, Brian S., Kim, Bumho, Freitas, Alexandre P., Mohanty, Aseema, Zhu, Yibo, Bhatt, Gaurang R., Hone, James, and Lipson, Michal
Subjects: Physics - Applied Physics, Physics - Optics
Abstract: High performance integrated electro-optic modulators operating at low temperature are critical for optical interconnects in cryogenic applications. Existing integrated modulators, however, suffer from reduced modulation efficiency or bandwidth at low temperatures because they rely on tuning mechanisms that degrade with decreasing temperature. Graphene modulators are a promising alternative, since graphene's intrinsic carrier mobility increases at low temperature. Here we demonstrate an integrated graphene-based electro-optic modulator whose 14.7 GHz bandwidth at 4.9 K exceeds the room-temperature bandwidth of 12.6 GHz. The bandwidth of the modulator is limited only by high contact resistance, and its intrinsic RC-limited bandwidth is 200 GHz at 4.9 K., Comment: 10 pages, 4 figures, Supplementary Information
Published: 2020
Full Text: View/download PDF

27. High yield production of ultrathin fibroid semiconducting nanowire of Ta$_2$Pd$_3$Se$_8$

Author: Liu, Xue, Liu, Sheng, Antipina, Liubov Yu., Zhu, Yibo, Ning, Jinliang, Liu, Jinyu, Yue, Chunlei, Joshy, Abin, Zhu, Yu, Sun, Jianwei, Sanchez, Ana M, Sorokin, Pavel B., Mao, Zhiqiang, Xiong, Qihua, and Wei, Jiang
Subjects: Physics - Applied Physics, Condensed Matter - Materials Science
Abstract: Immediately after the demonstration of the high-quality electronic properties in various two dimensional (2D) van der Waals (vdW) crystals fabricated with mechanical exfoliation, many methods have been reported to explore and control large scale fabrications. Comparing with recent advancements in fabricating 2D atomic layered crystals, large scale production of one dimensional (1D) nanowires with thickness approaching molecular or atomic level still remains stagnant. Here, we demonstrate the high yield production of a 1D vdW material, semiconducting Ta2Pd3Se8 nanowires, by means of liquid-phase exfoliation. The thinnest nanowire we have readily achieved is around 1 nm, corresponding to a bundle of one or two molecular ribbons. Transmission electron microscopy and transport measurements reveal the as-fabricated Ta2Pd3Se8 nanowires exhibit unexpected high crystallinity and chemical stability. Our low frequency Raman spectroscopy reveals clear evidence of the existing of weak inter-ribbon bindings. The fabricated nanowire transistors exhibit high switching performance and promising applications for photodetectors.
Published: 2019

28. Experimental study on flexural behavior of modular prefabricated channel–concrete composite beams with dry connections

Author: Fang, Jiaopeng, Zhou, Lingyu, Zhu, Yibo, Li, Fengui, Dai, Chaohu, Zhou, Quan, and Liao, Fei
Published: 2023
Full Text: View/download PDF

29. SP-GNN: Learning structure and position information from graphs

Author: Chen, Yangrui, You, Jiaxuan, He, Jun, Lin, Yuan, Peng, Yanghua, Wu, Chuan, and Zhu, Yibo
Published: 2023
Full Text: View/download PDF

30. 007: Democratically Finding The Cause of Packet Drops

Author: Arzani, Behnaz, Ciraci, Selim, Chamon, Luiz, Zhu, Yibo, Liu, Hingqiang, Padhye, Jitu, Loo, Boon Thau, and Outhred, Geoff
Subjects: Computer Science - Networking and Internet Architecture
Abstract: Network failures continue to plague datacenter operators as their symptoms may not have direct correlation with where or why they occur. We introduce 007, a lightweight, always-on diagnosis application that can find problematic links and also pinpoint problems for each TCP connection. 007 is completely contained within the end host. During its two month deployment in a tier-1 datacenter, it detected every problem found by previously deployed monitoring tools while also finding the sources of other problems previously undetected., Comment: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18).USENIX Association, 2018
Published: 2018

31. Molecular cloning, characterization, and expression analysis of TIPE1 in chicken (Gallus gallus): Its applications in fatty liver hemorrhagic syndrome

Author: Cheng, Xinyi, Liu, Jiuyue, Zhu, Yibo, Guo, Xiaoquan, Liu, Ping, Zhang, Caiying, Cao, Huabin, Xing, Chenghong, Zhuang, Yu, and Hu, Guoliang
Published: 2022
Full Text: View/download PDF

32. Evaluating the efficiency of three methods to clean and disinfect screw- and cement-retained prostheses

Author: Yin, Lina, Zhu, Yibo, Yu, Huajie, and Qiu, Lixin
Published: 2022
Full Text: View/download PDF

33. High-fidelity biosensing of dNTPs and nucleic acids by controllable subnanometer channel PaMscS

Author: Zhao, Changjian, Li, Kaiju, Mou, Xingyu, Zhu, Yibo, Chen, Chuan, Zhang, Ming, Wang, Yu, Zhou, Ke, Sheng, Yingying, Liu, Hao, Bai, Yunjin, Li, Xinqiong, Zhou, Cuisong, Deng, Dong, Wu, Jianping, Wu, Hai-Chen, Bao, Rui, and Geng, Jia
Published: 2022
Full Text: View/download PDF

34. Neural and biomechanical tradeoffs associated with human-exoskeleton interactions

Author: Zhu, Yibo, Weston, Eric B., Mehta, Ranjana K., and Marras, William S.
Published: 2021
Full Text: View/download PDF

35. Utilizing MOFs Melt-Foaming to Design Functionalized Carbon Foams for 100% Deep-Discharge and Ultrahigh Capacity Sodium Metal Anodes

Author: Liu, Peng, Zhao, Simin, Gao, Shengyong, Yang, Liting, Fang, Kuan, Zhu, Yibo, Niu, Heping, Jia, Xiaolong, and Zhou, Jisheng
Abstract: Meltable metal–organic frameworks (MOFs) offer significant accessibility to chemistry and moldability for developing carbon-based materials. However, the scarcity of low melting point MOFs poses challenges for related design. Here, we propose a MOFs melt-foaming strategy toward Ni single atoms/quantum dots-functionalized carbon foams (NiSA/QD@CFs). Melt-foaming highly depends on two factors: flexible metal–phosphorus bonds with cage-like ligands bridged in a zipper configuration via hydrogen bonds, facilitating MOFs conformational melting below 200 °C, and high annealing rates that lead to MOFs foaming by reducing the energy barrier and enhancing the pyrolysis enthalpy. When used as hosts for sodium metal anodes, foam structure regulates metallic Na to preferentially deposit inside the pores, while Ni SA and QD synergistically enhance Na absorption. Consequently, NiSA/QD@CF electrodes exhibit stable cyclic performance for 1000 h in symmetrical cells, with a low hysteresis voltage of 98 mV at 100 mA/cm2, 100 mAh/cm2, and 100% depth of discharge. Moreover, both full cells and anode-free ones exhibit excellent rate and cyclic performances. This strategy enriches the liquid MOFs family and their applications in mild-processing CFs for electrochemical energy storage.
Published: 2025
Full Text: View/download PDF

36. Structural characterization of PaFkbA: A periplasmic chaperone from Pseudomonas aeruginosa

Author: Huang, Qin, Yang, Jing, Li, Changcheng, Song, Yingjie, Zhu, Yibo, Zhao, Ninglin, Mou, Xingyu, Tang, Xinyue, Luo, Guihua, Tong, Aiping, Sun, Bo, Tang, Hong, Li, Hong, Bai, Lang, and Bao, Rui
Published: 2021
Full Text: View/download PDF

37. An Internally Attached Aptameric Graphene Nanosensor for Sensitive Vasopressin Measurement in Critical Patient Monitoring.

Author: Yu, Shifeng, Dai, Wenting, Su, Chao, Milosavic, Nenad, Wang, Ziran, Wang, Xuejun, Zhu, Yibo, He, Maogang, Landry, Donald W., Stojanovic, Milan N., and Lin, Qiao
Published: 2024
Full Text: View/download PDF

38. Potential molecular mechanisms of Huangqin Tang for liver cancer treatment by network pharmacology and molecular dynamics simulations

Author: Wei, Liliang, primary, Lv, Qiuqiong, additional, Wang, Qiong, additional, Zhu, Yibo, additional, and Ding, Feng, additional
Published: 2024
Full Text: View/download PDF

39. The association between circulating CD34+CD133+ endothelial progenitor cells and reduced risk of Alzheimer’s disease in the Framingham Heart Study

Author: Wang, Yixuan, primary, Huang, Jinghan, additional, Ang, Ting Fang Alvin, additional, Zhu, Yibo, additional, Tao, Qiushan, additional, Mez, Jesse, additional, Alosco, Michael, additional, Denis, Gerald V., additional, Belkina, Anna, additional, Gurnani, Ashita, additional, Ross, Mark, additional, Gong, Bin, additional, Han, Jingyan, additional, Lunetta, Kathryn L., additional, Stein, Thor D., additional, Au, Rhoda, additional, Farrer, Lindsay A., additional, Zhang, Xiaoling, additional, and Qiu, Wei Qiao, additional
Published: 2024
Full Text: View/download PDF

40. Molecular mechanism of the one-component regulator RccR on bacterial metabolism and virulence

Author: Zhu, Yibo, primary, Mou, Xingyu, additional, Song, Yingjie, additional, Zhang, Qianqian, additional, Sun, Bo, additional, Liu, Huanxiang, additional, Tang, Hong, additional, and Bao, Rui, additional
Published: 2024
Full Text: View/download PDF

41. Direct and Continuous Monitoring of Multicomponent Antibiotic Gentamicin in Blood at Single-Molecule Resolution

Author: Zhao, Changjian, primary, Wang, Yu, additional, Chen, Chen, additional, Zhu, Yibo, additional, Miao, Zhuang, additional, Mou, Xingyu, additional, Yuan, Weidan, additional, Zhang, Zhihao, additional, Li, Kaiju, additional, Chen, Mutian, additional, Liang, Weibo, additional, Zhang, Ming, additional, Miao, Wenqian, additional, Dong, Yuhan, additional, Deng, Dong, additional, Wu, Jianping, additional, Ke, Bowen, additional, Bao, Rui, additional, and Geng, Jia, additional
Published: 2024
Full Text: View/download PDF

42. Efficient expression of chondroitinase ABC I for specific disaccharides detection of chondroitin sulfate

Author: Lu, Xingyu, Zhong, Qian, Liu, Jian, Yang, Fulin, Lu, Chenghui, Xiong, Huan, Li, Sha, Zhu, Yibo, and Wu, Lingtian
Published: 2020
Full Text: View/download PDF

43. Optimal Selection of Sampling Points within Sewer Networks for Wastewater-Based Epidemiology Applications

Author: Yao, Yao, Zhu, Yibo, Nogueira, Regina, Klawonn, Frank, Wallner, Markus, Yao, Yao, Zhu, Yibo, Nogueira, Regina, Klawonn, Frank, and Wallner, Markus
Abstract: Wastewater-based epidemiology (WBE) has great potential to monitor community public health, especially during pandemics. However, it faces substantial hurdles in pathogen surveillance through WBE, encompassing data representativeness, spatiotemporal variability, population estimates, pathogen decay, and environmental factors. This paper aims to enhance the reliability of WBE data, especially for early outbreak detection and improved sampling strategies within sewer networks. The tool implemented in this paper combines a monitoring model and an optimization model to facilitate the optimal selection of sampling points within sewer networks. The monitoring model utilizes parameters such as feces density and average water consumption to define the detectability of the virus that needs to be monitored. This allows for standardization and simplicity in the process of moving from the analysis of wastewater samples to the identification of infection in the source area. The entropy-based model can select optimal sampling points in a sewer network to obtain the most specific information at a minimum cost. The practicality of our tool is validated using data from Hildesheim, Germany, employing SARS-CoV-2 as a pilot pathogen. It is important to note that the tool’s versatility empowers its extension to monitor other pathogens in the future.
Published: 2024

44. High yield production of ultrathin fibroid semiconducting nanowire of Ta2Pd3Se8

Author: Liu, Xue, Liu, Sheng, Antipina, Liubov Yu., Zhu, Yibo, Ning, Jinliang, Liu, Jinyu, Yue, Chunlei, Joshy, Abin, Zhu, Yu, Sun, Jianwei, Sanchez, Ana M., Sorokin, Pavel B., Mao, Zhiqiang, Xiong, Qihua, and Wei, Jiang
Published: 2020
Full Text: View/download PDF

45. Bone augmentation with autologous tooth shell in the esthetic zone for dental implant restoration: a pilot study

Author: Li, Shuyi, Gao, Ming, Zhou, Miao, and Zhu, Yibo
Published: 2021
Full Text: View/download PDF

46. High-performance integrated graphene electro-optic modulator at cryogenic temperature

Author: Lee, Brian S., primary, Kim, Bumho, additional, Freitas, Alexandre P., additional, Mohanty, Aseema, additional, Zhu, Yibo, additional, Bhatt, Gaurang R., additional, Hone, James, additional, and Lipson, Michal, additional
Published: 2021
Full Text: View/download PDF

47. Development of sugarcane resource for efficient fermentation of exopolysaccharide by using a novel strain of Kosakonia cowanii LT-1

Author: Wang, Liying, Wu, Lingtian, Chen, Qiaoyu, Li, Sha, Zhu, Yibo, Wu, Jinnan, Chu, Jianlin, and Wu, Shanshan
Published: 2019
Full Text: View/download PDF

48. Exploiting electrostatic shielding-effect of metal nanoparticles to recognize uncharged small molecule affinity with label-free graphene electronic biosensor

Author: Wang, Cheng, Ye, Weixiang, Li, Yijun, Zhu, Yibo, Lin, Qiao, and He, Miao
Published: 2019
Full Text: View/download PDF

49. Serf and Turf: Crowdturfing for Fun and Profit

Author: Wang, Gang, Wilson, Christo, Zhao, Xiaohan, Zhu, Yibo, Mohanlal, Manish, Zheng, Haitao, and Zhao, Ben Y.
Subjects: Computer Science - Social and Information Networks, Computer Science - Cryptography and Security, H.3.5, J.4
Abstract: Popular Internet services in recent years have shown that remarkable things can be achieved by harnessing the power of the masses using crowd-sourcing systems. However, crowd-sourcing systems can also pose a real challenge to existing security mechanisms deployed to protect Internet services. Many of these techniques make the assumption that malicious activity is generated automatically by machines, and perform poorly or fail if users can be organized to perform malicious tasks using crowd-sourcing systems. Through measurements, we have found surprising evidence showing that not only do malicious crowd-sourcing systems exist, but they are rapidly growing in both user base and total revenue. In this paper, we describe a significant effort to study and understand these "crowdturfing" systems in today's Internet. We use detailed crawls to extract data about the size and operational structure of these crowdturfing systems. We analyze details of campaigns offered and performed in these sites, and evaluate their end-to-end effectiveness by running active, non-malicious campaigns of our own. Finally, we study and compare the source of workers on crowdturfing sites in different countries. Our results suggest that campaigns on these systems are highly effective at reaching users, and their continuing growth poses a concrete threat to online communities such as social networks, both in the US and elsewhere., Comment: Proceedings of WWW 2012 Conference, 10 pages, 23 figures, 4 tables
Published: 2011

50. Optimal Selection of Sampling Points within Sewer Networks for Wastewater-Based Epidemiology Applications

Author: Yao, Yao, primary, Zhu, Yibo, additional, Nogueira, Regina, additional, Klawonn, Frank, additional, and Wallner, Markus, additional
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

465 results on '"Zhu, Yibo"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources