Author: "Mingcong Song" / Topic: 020202 computer hardware & architecture - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Mingcong Song"' showing total 11 results

Start Over Author "Mingcong Song" Topic 020202 computer hardware & architecture

11 results on '"Mingcong Song"'

1. CoExe: An Efficient Co-execution Architecture for Real-Time Neural Network Services

Author: Chubo Liu, Zihao Zeng, Tao Li, Kenli Li, Keqin Li, Mingcong Song, and Jiechen Zhao
Subjects: 010302 applied physics, Queueing theory, Artificial neural network, Dataflow, Computer science, Distributed computing, Workload, 02 engineering and technology, 01 natural sciences, Execution time, 020202 computer hardware & architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Architecture, Latency (engineering)
Abstract: End-to-end latency is sensitive for user-interactive neural network (NN) services on clouds. For periods of high request load, co-locating multiple NN requests has the potential to reduce end-to-end latency. However, current batch-based accelerators lack request-level parallelism support, leaving the queuing time non-optimized. Meanwhile, naively partitioning resources for simultaneous requests suffers from longer execution time as well as lower resource efficiency because different applications utilize separate resources without sharing. To effectively reduce the end-to-end latency for real-time NN requests, we propose CoExe architecture, equipped with a pipeline implementation of a sparsity-driven real-time co-execution model. By leveraging the non-trivial amount of sparse operations during concurrent NNs execution, the end-to-end latency is decreased by up to 12.3× and 2.4× over Eyeriss-like and SCNN at peak workload mode. Besides, we propose row cross (RC) dataflow to reduce data movement cost, and avoid memory duplication.
Published: 2020

2. Towards 'Full Containerization' in Containerized Network Function Virtualization

Author: Mingcong Song, Tao Li, and Yang Hu
Subjects: Memory buffer register, Cache coloring, Computer science, Linux kernel, 02 engineering and technology, Cache pollution, computer.software_genre, 01 natural sciences, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Resource management, Cache algorithms, General Environmental Science, 010302 applied physics, Hardware_MEMORYSTRUCTURES, Locality, Provisioning, General Medicine, Computer Graphics and Computer-Aided Design, 020202 computer hardware & architecture, Memory management, Operating system, General Earth and Planetary Sciences, Page cache, Cache, computer, Software
Abstract: With exploding traffic stuffing existing network infra-structure, today's telecommunication and cloud service providers resort to Network Function Virtualization (NFV) for greater agility and economics. Pioneer service provider such as AT&T proposes to adopt container in NFV to achieve shorter Virtualized Network Function (VNF) provisioning time and better runtime performance. However, we characterize typical NFV work-loads on the containers and find that the performance is unsatisfactory. We observe that the shared host OS net-work stack is the main bottleneck, where the traffic flow processing involves a large amount of intermediate memory buffers and results in significant last level cache pollution. Existing OS memory allocation policies fail to exploit the locality and data sharing information among buffers. In this paper, we propose NetContainer, a software framework that achieves fine-grained hardware resource management for containerized NFV platform. NetContainer employs a cache access overheads guided page coloring scheme to coordinately address the inter-flow cache access overheads and intra-flow cache access overheads. It maps the memory buffer pages that manifest low cache access overheads (across a flow or among the flows) to the same last level cache partition. NetContainer exploits a footprint theory based method to estimate the cache access overheads and a Min-Cost Max-Flow model to guide the memory buffer mappings. We implement the NetContainer in Linux kernel and extensively evaluate it with real NFV workloads. Exper-imental results show that NetContainer outperforms conventional page coloring-based memory allocator by 48% in terms of successful call rate.
Published: 2017

3. Eager pruning

Author: Mingcong Song, Tao Li, Jiaqi Zhang, and Xiangru Chen
Subjects: 010302 applied physics, Speedup, Artificial neural network, Computer science, business.industry, Computation, Big data, 02 engineering and technology, Machine learning, computer.software_genre, 01 natural sciences, 020202 computer hardware & architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Redundancy (engineering), Hardware acceleration, Artificial intelligence, Performance improvement, Architecture, business, computer
Abstract: Today's big and fast data and the changing circumstance require fast training of Deep Neural Networks (DNN) in various applications. However, training a DNN with tons of parameters involves intensive computation. Enlightened by the fact that redundancy exists in DNNs and the observation that the ranking of the significance of the weights changes slightly during training, we propose Eager Pruning, which speeds up DNN training by moving pruning to an early stage. Eager Pruning is supported by an algorithm and architecture co-design. The proposed algorithm dictates the architecture to identify and prune insignificant weights during training without accuracy loss. A novel architecture is designed to transform the reduced training computation into performance improvement. Our proposed Eager Pruning system gains an average of 1.91x speedup over state-of-the-art hardware accelerator and 6.31x energy-efficiency over Nvidia GPUs.
Published: 2019

4. LerGAN: A Zero-Free, Low Data Movement and PIM-Based GAN Architecture

Author: Jiwu Shu, Haiyu Mao, Tao Li, Yuting Dai, and Mingcong Song
Subjects: 010302 applied physics, Speedup, Computer architecture, Artificial neural network, Computer science, Computation, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Unsupervised learning, 02 engineering and technology, 01 natural sciences, Bottleneck, 020202 computer hardware & architecture
Abstract: As a powerful unsupervised learning method, Generative Adversarial Network (GAN) plays an important role in many domains such as video prediction and autonomous driving. It is one of the ten breakthrough technologies in 2018 reported in MIT Technology Review. However, training a GAN imposes three more challenges: (1) intensive communication caused by complex train phases of GAN, (2) much more ineffectual computations caused by special convolutions, and (3) more frequent off-chip memory accesses for exchanging inter-mediate data between the generator and the discriminator. In this paper, we propose LerGAN1, a PIM-based GAN accelerator to address the challenges of training GAN. We first propose a zero-free data reshaping scheme for ReRAM-based PIM, which removes the zero-related computations. We then propose a 3D-connected PIM, which can reconfigure connections inside PIM dynamically according to dataflows of propagation and updating. Our proposed techniques reduce data movement to a great extent, avoiding I/O to become a bottleneck of training GANs. Finally, we propose LerGAN based on these two techniques, providing different levels of accelerating GAN for programmers. Experiments shows that LerGAN achieves 47.2x, 21.42x and 7.46x speedup over FPGA-based GAN accelerator, GPU platform, and ReRAM-based neural network accelerator respectively. Moreover, LerGAN achieves 9.75x, 7.68x energy saving on average over GPU platform, ReRAM-based neural network accelerator respectively, and has 1.04x energy consuming over FPGA-based GAN accelerator.
Published: 2018

5. Prediction Based Execution on Deep Neural Networks

Author: Tao Li, Jiechen Zhao, Jiaqi Zhang, Yang Hu, and Mingcong Song
Subjects: 010302 applied physics, Speedup, business.industry, Computer science, Deep learning, 02 engineering and technology, Operand, 01 natural sciences, 020202 computer hardware & architecture, Computer engineering, Parallel processing (DSP implementation), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Artificial intelligence, business, Throughput (business), Execution model, Block (data storage)
Abstract: Recently, deep neural network based approaches have emerged as indispensable tools in many fields, ranging from image and video recognition to natural language processing. However, the large size of such newly developed networks poses both throughput and energy challenges to the underlying processing hardware. This could be the major stumbling block to many promising applications such as self-driving cars and smart cities. Existing work proposes to weed zeros from input neurons to avoid unnecessary DNN computation (zero-valued operand multiplications). However, we observe that many output neurons are still ineffectual even if the zero-removal technique has been applied. These ineffectual output neurons could not pass their values to the subsequent layer, which means all the computations (including zero-valued and non-zero-valued operand multiplications) related to these output neurons are futile and wasteful. Therefore, there is an opportunity to significantly improve the performance and efficiency of DNN execution by predicting the ineffectual output neurons and thus completely avoid the futile computations by skipping over these ineffectual output neurons. To do so, we propose a two-stage, prediction-based DNN execution model without accuracy loss. We also propose a uniform serial processing element (USPE), for both prediction and execution stages to improve the flexibility and minimize the area overhead. To improve the processing throughput, we further present a scale-out design for USPE. Evaluation results over a set of state-of-the-art DNNs show that our proposed design achieves 2.5X speedup and 1.9X energy-efficiency on average over the traditional accelerator. Moreover, by stacking with our design, we can improve Cnvlutin and Stripes by 1.9X and 2.0X on average, respectively.
Published: 2018

6. Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning

Author: Huixiang Chen, Mingcong Song, Jiaqi Zhang, and Tao Li
Subjects: 010302 applied physics, Speedup, business.industry, Computer science, Deep learning, Big data, 02 engineering and technology, Machine learning, computer.software_genre, 01 natural sciences, Backpropagation, 020202 computer hardware & architecture, Microarchitecture, Memory management, 0103 physical sciences, Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, Unsupervised learning, Artificial intelligence, business, computer
Abstract: Recently, deep learning based approaches have emerged as indispensable tools to perform big data analytics. Normally, deep learning models are first trained with a supervised method and then deployed to execute various tasks. The supervised method involves extensive human efforts to collect and label the large-scale dataset, which becomes impractical in the big data era where raw data is largely un-labeled and uncategorized. Fortunately, the adversarial learning, represented by Generative Adversarial Network (GAN), enjoys a great success on the unsupervised learning. However, the distinct features of GAN, such as massive computing phases and non-traditional convolutions challenge the existing deep learning accelerator designs. In this work, we propose the first holistic solution for accelerating the unsupervised GAN-based Deep Learning. We overcome the above challenges with an algorithm and architecture co-design approach. First, we optimize the training procedure to reduce on-chip memory consumption. We then propose a novel time-multiplexed design to efficiently map the abundant computing phases to our microarchitecture. Moreover, we design high-efficiency dataflows to achieve high data reuse and skip the zero-operand multiplications in the non-traditional convolutions. Compared with traditional deep learning accelerators, our proposed design achieves the best performance (average 4.3X) with the same computing resource. Our design also has an average of 8.3X speedup over CPU and 6.2X energy-efficiency over NVIDIA GPU.
Published: 2018

7. In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems

Author: Tao Li, Duo Liu, Jing Wang, Kan Zhong, Mingcong Song, Weigong Zhang, Yang Hu, and Jiaqi Zhang
Subjects: 010302 applied physics, Speedup, business.industry, Group method of data handling, Computer science, Deep learning, Distributed computing, Dynamic data, Cloud computing, 02 engineering and technology, 01 natural sciences, 020202 computer hardware & architecture, Data modeling, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Task analysis, Artificial intelligence, business, Raw data
Abstract: Recent years have seen an exploration of data volumes from a myriad of IoT devices, such as various sensors and ubiquitous cameras. The deluge of IoT data creates enormous opportunities for us to explore the physical world, especially with the help of deep learning techniques. Traditionally, the Cloud is the option for deploying deep learning based applications. However, the challenges of Cloud-centric IoT systems are increasing due to significant data movement overhead, escalating energy needs, and privacy issues. Rather than constantly moving a tremendous amount of raw data to the Cloud, it would be beneficial to leverage the emerging powerful IoT devices to perform the inference task. Nevertheless, the statically trained model could not efficiently handle the dynamic data in the real in-situ environments, which leads to low accuracy. Moreover, the big raw IoT data challenges the traditional supervised training method in the Cloud. To tackle the above challenges, we propose In-situ AI, the first Autonomous and Incremental computing framework and architecture for deep learning based IoT applications. We equip deep learning based IoT system with autonomous IoT data diagnosis (minimize data movement), and incremental and unsupervised training method (tackle the big raw IoT data generated in ever-changing in-situ environments). To provide efficient architectural support for this new computing paradigm, we first characterize the two In-situ AI tasks (i.e. inference and diagnosis tasks) on two popular IoT devices (i.e. mobile GPU and FPGA) and explore the design space and tradeoffs. Based on the characterization results, we propose two working modes for the In-situ AI tasks, including Single-running and Co-running modes. Moreover, we craft analytical models for these two modes to guide the best configuration selection. We also develop a novel two-level weight shared In-situ AI architecture to efficiently deploy In-situ tasks to IoT node. Compared with traditional IoT systems, our In-situ AI can reduce data movement by 28-71%, which further yields 1.4X-3.3X speedup on model update and contributes to 30-70% energy saving.
Published: 2018

8. GaaS workload characterization under NUMA architecture for virtualized GPU

Author: Tao Li, Meng Wang, Mingcong Song, Yang Hu, and Huixiang Chen
Subjects: Computer science, business.industry, Clock rate, Cloud computing, 02 engineering and technology, Parallel computing, Virtualization, computer.software_genre, 020202 computer hardware & architecture, Uncore, Server, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), General-purpose computing on graphics processing units, Frequency scaling, business, computer
Abstract: Graphics-as-a-service (GaaS) is gaining popularity in cloud computing community. There is an emerging trend of running GaaS workload using virtualized GPU in current data center deployment. This paper provides a detailed characterization of GaaS workload under virtualized GPU NUMA environment, and found that: (1) GaaS workloads exhibit different behavior with GPGPU workloads by having more frequent real-time data exchange between CPU and GPU; (2) GaaS workloads have no NUMA overhead, whether considering the influence of remote memory access or the resource contention of CPU uncore. We also test the performance and power tradeoff among the frequency scaling of CPU clock, GPU core clock, and GPU memory clock. Characterization results show that (1) ondemand CPU frequency scaling achieves the best balance between performance and power consumption; (2) GaaS workloads are GPU-computation intensive. GPU memory frequency can be set lower to save energy with little performance sacrifice.
Published: 2017

9. Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures

Author: Huixiang Chen, Mingcong Song, Tao Li, and Yang Hu
Subjects: 010302 applied physics, Computer science, User satisfaction, Real-time computing, Feature extraction, Inference, 02 engineering and technology, 01 natural sciences, Convolutional neural network, Two stages, 020202 computer hardware & architecture, Support vector machine, Computer engineering, Server, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Entropy (information theory)
Abstract: Accelerating Convolutional Neural Networks (CNNs) on GPUs usually involves two stages: training and inference. Traditionally, this two-stage process is deployed on high-end GPU-equipped servers. Driven by the increase in compute power of desktop and mobile GPUs, there is growing interest in performing inference on various kinds of platforms. In contrast to the requirements of high throughput and accuracy during the training stage, end-users will face diverse requirements related to inference tasks. To address this emerging trend and new requirements, we propose Pervasive CNN (P-CNN), a user satisfaction-aware CNN inference framework. P-CNN is composed of two phases: cross-platform offline compilation and run-time management. Based on users' requirements, offline compilation generates the optimal kernel using architecture-independent techniques, such as adaptive batch size selection and coordinated fine-tuning. The runtime management phase consists of accuracy tuning, execution, and calibration. First, accuracy tuning dynamically identifies the fastest kernels with acceptable accuracy. Next, the run-time kernel scheduler partitions the optimal computing resource for each layer and schedules the GPU thread blocks. If its accuracy is not acceptable to the end-user, the calibration stage selects a slower but more precise kernel to improve the accuracy. Finally, we design a user satisfaction metric for CNNs to evaluate ourPervasive deign. Our evaluation results show P-CNN can provide the best user satisfaction for different inference tasks.
Published: 2017

10. Bridging the Semantic Gaps of GPU Acceleration for Scale-out CNN-based Big Data Processing

Author: Tao Li, Yunlong Xu, Chao Li, Huixiang Chen, Yang Hu, Mingcong Song, and Jingling Yuan
Subjects: 010302 applied physics, Data processing, Artificial neural network, business.industry, Computer science, Deep learning, 02 engineering and technology, Parallel computing, Video processing, 01 natural sciences, Convolutional neural network, 020202 computer hardware & architecture, CUDA, Computer engineering, 0103 physical sciences, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Performance improvement, business
Abstract: Convolutional Neural Networks (CNNs) have substantially advanced the state-of-the-art accuracies of object recognition, which is the core function of a myriad of modern multimedia processing techniques such as image/video processing, speech recognition, and natural language processing. GPU-based accelerators gained increasing attention because a large amount of highly parallel neurons in CNN naturally matches the GPU computation pattern. In this work, we perform comprehensive experiments to investigate the performance bottlenecks and overheads of current GPU acceleration platform for scale-out CNN-based big data processing. In our characterization, we observe two significant semantic gaps: framework gap that lies between CNN-based data processing workflow and data processing manner in distributed framework; and the standalone gap that lies between the uneven computation loads at different CNN layers and fixed computing capacity provisioning of current GPU acceleration library. To bridge these gaps, we propose D3NN, a Distributed, Decoupled, and Dynamically tuned GPU acceleration framework for modern CNN architectures. In particular, D3NN features a novel analytical model that enables accurate time estimation of GPU accelerated CNN processing with only 5–10% error. Our evaluation results show the throughput of standalone processing node using D3NN gains up to 3.7× performance improvement over current standalone GPU acceleration platform. Our CNN-oriented GPU acceleration library with built-in dynamic batching scheme achieves up to 1.5× performance improvement over the non-batching scheme and outperforms the state-of-the-art deep learning library by up to 28% (performance mode) ∼ 67% (memory-efficient mode).
Published: 2016

11. Scheduling Tasks with Mixed Timing Constraints in GPU-Powered Real-Time Systems

Author: Zhongzhi Luan, Lan Gao, Yunlong Xu, Tao Li, Depei Qian, Mingcong Song, and Rui Wang
Subjects: 010302 applied physics, Computer science, business.industry, Real-time computing, Automotive industry, Workload, 02 engineering and technology, 01 natural sciences, Turnaround time, Driving safety, 020202 computer hardware & architecture, Scheduling (computing), Automotive systems, Embedded system, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Graphics, business
Abstract: Due to the cost-effective, massive computational power of graphics processing units (GPUs), there is a growing interest of utilizing GPUs in real-time systems. For example GPUs have been applied to automotive systems to enable new advanced and intelligent driver assistance technologies, accelerating the path to self-driving cars. In such systems, GPUs are shared among tasks with mixed timing constraints: real-time (RT) tasks that have to be accomplished before specified deadlines, and non-real-time, best-effort (BE) tasks. In this paper, (1) we propose resource-aware non-uniform slack distribution to enhance the schedulability of RT tasks (the total amount of work of RT tasks whose deadlines can be satisfied on a given amount of resources) in GPU-enabled systems; (2) we propose deadline-aware dynamic GPU partitioning to allow RT and BE tasks to run on a GPU simultaneously, such that BE tasks are not blocked for a long time.We evaluate the effectiveness of the proposed approaches by using both synthetic benchmarks and a real-world workload that consists of a set of emerging automotive tasks. Experimental results show that the proposed approaches yield significant schedulability improvement for RT tasks and turnaround time decrement for BE tasks. Moreover, the analysis of two driving scenarios shows that such schedulability improvement and turnaround time decrement can significantly enhance the driving safety and experience. For example, when the resource-aware non-uniform slack distribution approach is used, the distance that a car travels during the time between a traffic sign (pedestrian) is "seen and recognized" is decreased from 44.4m to 22.2m (from 4.4m to 2.2m); when the deadline-aware dynamic GPU partitioning approach is used, the distance that the car has traveled before a drowsy driver is woken up is reduced from 56.2m to 29.2m.
Published: 2016

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

11 results on '"Mingcong Song"'

1. CoExe: An Efficient Co-execution Architecture for Real-Time Neural Network Services

2. Towards 'Full Containerization' in Containerized Network Function Virtualization

3. Eager pruning

4. LerGAN: A Zero-Free, Low Data Movement and PIM-Based GAN Architecture

5. Prediction Based Execution on Deep Neural Networks

6. Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning

7. In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems

8. GaaS workload characterization under NUMA architecture for virtualized GPU

9. Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures

10. Bridging the Semantic Gaps of GPU Acceleration for Scale-out CNN-based Big Data Processing

11. Scheduling Tasks with Mixed Timing Constraints in GPU-Powered Real-Time Systems

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

11 results on '"Mingcong Song"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources