Author: "Gong, Weibao" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Gong, Weibao"' showing total 8 results

Start Over Author "Gong, Weibao"

8 results on '"Gong, Weibao"'

1. MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

Author: Yu, Dianhai, Shen, Liang, Hao, Hongxiang, Gong, Weibao, Wu, Huachao, Bian, Jiang, Dai, Lirong, and Xiong, Haoyi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence
Abstract: While modern internet services, such as chatbots, search engines, and online advertising, demand the use of large-scale deep neural networks (DNNs), distributed training and inference over heterogeneous computing systems are desired to facilitate these DNN models. Mixture-of-Experts (MoE) is one the most common strategies to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present a novel MoESys that boosts efficiency in both large-scale training and inference. Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms. For scalable inference in a single node, especially when the model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate MoESys, where MoESys successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that MoESys outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, MoESys achieved 64% higher throughput with 18% lower memory footprints.
Published: 2022

2. Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters

Author: Xiang, Yang, Wu, Zhihua, Gong, Weibao, Ding, Siyu, Mo, Xianjie, Liu, Yuang, Wang, Shuohuan, Liu, Peng, Hou, Yongshuai, Li, Long, Wang, Bin, Shi, Shaohuai, Han, Yaqian, Yu, Yue, Li, Ge, Sun, Yu, Ma, Yanjun, and Yu, Dianhai
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The ever-growing model size and scale of compute have attracted increasing interests in training deep learning models over multiple nodes. However, when it comes to training on cloud clusters, especially across remote clusters, huge challenges are faced. In this work, we introduce a general framework, Nebula-I, for collaboratively training deep learning models over remote heterogeneous clusters, the connections between which are low-bandwidth wide area networks (WANs). We took natural language processing (NLP) as an example to show how Nebula-I works in different training phases that include: a) pre-training a multilingual language model using two remote clusters; and b) fine-tuning a machine translation model using knowledge distilled from pre-trained models, which run through the most popular paradigm of recent deep learning. To balance the accuracy and communication efficiency, in Nebula-I, parameter-efficient training strategies, hybrid parallel computing methods and adaptive communication acceleration techniques are jointly applied. Meanwhile, security strategies are employed to guarantee the safety, reliability and privacy in intra-cluster computation and inter-cluster communication. Nebula-I is implemented with the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware, e.g. GPU and NPU. Experiments demonstrate that the proposed framework could substantially maximize the training efficiency while preserving satisfactory NLP performance. By using Nebula-I, users can run large-scale training tasks over cloud clusters with minimum developments, and the utility of existed large pre-trained models could be further promoted. We also introduced new state-of-the-art results on cross-lingual natural language inference tasks, which are generated based upon a novel learning framework and Nebula-I., Comment: 20 pages, 10 figures, technical report
Published: 2022

3. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Author: Wang, Shuohuan, Sun, Yu, Xiang, Yang, Wu, Zhihua, Ding, Siyu, Gong, Weibao, Feng, Shikun, Shang, Junyuan, Zhao, Yanbin, Pang, Chao, Liu, Jiaxiang, Chen, Xuyi, Lu, Yuxiang, Liu, Weixin, Wang, Xi, Bai, Yangfan, Chen, Qiuliang, Zhao, Li, Li, Shiyong, Sun, Peng, Yu, Dianhai, Ma, Yanjun, Tian, Hao, Wu, Hua, Wu, Tian, Zeng, Wei, Li, Ge, Gao, Wen, and Wang, Haifeng
Subjects: Computer Science - Computation and Language
Abstract: Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of scaling up ERNIE 3.0, we train a hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters on the PaddlePaddle platform. Furthermore, we design a self-supervised adversarial loss and a controllable language modeling loss to make ERNIE 3.0 Titan generate credible and controllable texts. To reduce the computation overhead and carbon emission, we propose an online distillation framework for ERNIE 3.0 Titan, where the teacher model will teach students and train itself simultaneously. ERNIE 3.0 Titan is the largest Chinese dense pre-trained model so far. Empirical results show that the ERNIE 3.0 Titan outperforms the state-of-the-art models on 68 NLP datasets., Comment: arXiv admin note: text overlap with arXiv:2107.02137
Published: 2021

4. End-to-end Adaptive Distributed Training on PaddlePaddle

Author: Ao, Yulong, Wu, Zhihua, Yu, Dianhai, Gong, Weibao, Kui, Zhiqing, Zhang, Minxu, Ye, Zilingfeng, Shen, Liang, Ma, Yanjun, Wu, Tian, Wang, Haifeng, Zeng, Wei, and Yang, Chao
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Distributed training has become a pervasive and effective approach for training a large neural network (NN) model with processing massive data. However, it is very challenging to satisfy requirements from various NN models, diverse computing resources, and their dynamic changes during a training job. In this study, we design our distributed training framework in a systematic end-to-end view to provide the built-in adaptive ability for different scenarios, especially for industrial applications and production environments, by fully considering resource allocation, model partition, task placement, and distributed execution. Based on the unified distributed graph and the unified cluster object, our adaptive framework is equipped with a global cost model and a global planner, which can enable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and elastic distributed training. The experiments demonstrate that our framework can satisfy various requirements from the diversity of applications and the heterogeneity of resources with highly competitive performance. The ERNIE language model with 260 billion parameters is efficiently trained on thousands of AI processors with 91.7% weak scalability. The throughput of the model from the recommender system by employing the heterogeneous pipeline asynchronous execution can be increased up to 2.1 times and 3.3 times that of the GPU-only and CPU-only training respectively. Moreover, the fault-tolerant and elastic distributed training have been successfully applied to the online industrial applications, which give a reduction of 34.49% in the number of failed long-term training jobs and an increase of 33.91% for the global scheduling efficiency in the production environment., Comment: 16 pages, 10 figures, 4 tables
Published: 2021

5. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Author: Sun, Yu, Wang, Shuohuan, Feng, Shikun, Ding, Siyu, Pang, Chao, Shang, Junyuan, Liu, Jiaxiang, Chen, Xuyi, Zhao, Yanbin, Lu, Yuxiang, Liu, Weixin, Wu, Zhihua, Gong, Weibao, Liang, Jianzhong, Shang, Zhizhou, Sun, Peng, Liu, Wei, Ouyang, Xuan, Yu, Dianhai, Tian, Hao, Wu, Hua, and Wang, Haifeng
Subjects: Computer Science - Computation and Language
Abstract: Pre-trained models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. Recent works such as T5 and GPT-3 have shown that scaling up pre-trained language models can improve their generalization abilities. Particularly, the GPT-3 model with 175 billion parameters shows its strong task-agnostic zero-shot/few-shot learning capabilities. Despite their success, these large-scale models are trained on plain texts without introducing knowledge such as linguistic knowledge and world knowledge. In addition, most large-scale models are trained in an auto-regressive way. As a result, this kind of traditional fine-tuning approach demonstrates relatively weak performance when solving downstream language understanding tasks. In order to solve the above problems, we propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks with zero-shot learning, few-shot learning or fine-tuning. We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph. Empirical results show that the model outperforms the state-of-the-art models on 54 Chinese NLP tasks, and its English version achieves the first place on the SuperGLUE benchmark (July 3, 2021), surpassing the human performance by +0.8% (90.6% vs. 89.8%).
Published: 2021

6. Elastic Deep Learning Using Knowledge Distillation with Heterogeneous Computing Resources

Author: Dong, Daxiang, Liu, Ji, Wang, Xi, Gong, Weibao, Qin, An, Li, Xingjian, Yu, Dianhai, Valduriez, Patrick, Dou, Dejing, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Chaves, Ricardo, editor, B. Heras, Dora, editor, Ilic, Aleksandar, editor, Unat, Didem, editor, Badia, Rosa M., editor, Bracciali, Andrea, editor, Diehl, Patrick, editor, Dubey, Anshu, editor, Sangyoon, Oh, editor, L. Scott, Stephen, editor, and Ricci, Laura, editor
Published: 2022
Full Text: View/download PDF

7. MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

Author: Yu, Dianhai, primary, Shen, Liang, additional, Hao, Hongxiang, additional, Gong, Weibao, additional, Wu, Huachao, additional, Bian, Jiang, additional, Dai, Lirong, additional, and Xiong, Haoyi, additional
Published: 2024
Full Text: View/download PDF

8. SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System

Author: Shen, Liang, Wu, Zhihua, Gong, WeiBao, Hao, Hongxiang, Bai, Yangfan, Wu, HuaChao, Wu, Xinxuan, Bian, Jiang, Xiong, Haoyi, Yu, Dianhai, Ma, Yanjun, Shen, Liang, Wu, Zhihua, Gong, WeiBao, Hao, Hongxiang, Bai, Yangfan, Wu, HuaChao, Wu, Xinxuan, Bian, Jiang, Xiong, Haoyi, Yu, Dianhai, and Ma, Yanjun
Abstract: With the increasing diversity of ML infrastructures nowadays, distributed training over heterogeneous computing systems is desired to facilitate the production of big models. Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate SE-MoE, where SE-MoE successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that SE-MoE outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64% higher throughput with 18% lower memory footprints. The code of the framework will be released on: https://github.com/PaddlePaddle/Paddle.
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

8 results on '"Gong, Weibao"'

1. MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

2. Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters

3. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

4. End-to-end Adaptive Distributed Training on PaddlePaddle

5. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

6. Elastic Deep Learning Using Knowledge Distillation with Heterogeneous Computing Resources

7. MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

8. SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

8 results on '"Gong, Weibao"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources