31 results on '"Mingcong Song"'
Search Results
2. Eager pruning: algorithm and architecture support for fast training of deep neural networks.
- Author
-
Jiaqi Zhang 0002, Xiangru Chen, Mingcong Song, and Tao Li 0006
- Published
- 2019
- Full Text
- View/download PDF
3. Exploiting Dynamic Thermal Energy Harvesting for Reusing in Smartphone with Mobile Applications.
- Author
-
Yuting Dai, Tao Li 0006, Benyong Liu, Mingcong Song, and Huixiang Chen 0001
- Published
- 2018
- Full Text
- View/download PDF
4. LerGAN: A Zero-Free, Low Data Movement and PIM-Based GAN Architecture.
- Author
-
Haiyu Mao, Mingcong Song, Tao Li 0006, Yuting Dai, and Jiwu Shu
- Published
- 2018
- Full Text
- View/download PDF
5. Prediction Based Execution on Deep Neural Networks.
- Author
-
Mingcong Song, Jiechen Zhao 0003, Yang Hu 0001, Jiaqi Zhang 0002, and Tao Li 0006
- Published
- 2018
- Full Text
- View/download PDF
6. In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems.
- Author
-
Mingcong Song, Kan Zhong, Jiaqi Zhang 0002, Yang Hu 0001, Duo Liu, Weigong Zhang, Jing Wang 0055, and Tao Li 0006
- Published
- 2018
- Full Text
- View/download PDF
7. Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning.
- Author
-
Mingcong Song, Jiaqi Zhang 0002, Huixiang Chen 0001, and Tao Li 0006
- Published
- 2018
- Full Text
- View/download PDF
8. Towards 'Full Containerization' in Containerized Network Function Virtualization.
- Author
-
Yang Hu 0001, Mingcong Song, and Tao Li 0006
- Published
- 2017
- Full Text
- View/download PDF
9. Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures.
- Author
-
Mingcong Song, Yang Hu 0001, Huixiang Chen 0001, and Tao Li 0006
- Published
- 2017
- Full Text
- View/download PDF
10. GaaS workload characterization under NUMA architecture for virtualized GPU.
- Author
-
Huixiang Chen 0001, Meng Wang, Yang Hu 0001, Mingcong Song, and Tao Li 0006
- Published
- 2017
- Full Text
- View/download PDF
11. Bridging the Semantic Gaps of GPU Acceleration for Scale-out CNN-based Big Data Processing: Think Big, See Small.
- Author
-
Mingcong Song, Yang Hu 0001, Yunlong Xu, Chao Li 0009, Huixiang Chen 0001, Jingling Yuan, and Tao Li 0006
- Published
- 2016
- Full Text
- View/download PDF
12. Scheduling Tasks with Mixed Timing Constraints in GPU-Powered Real-Time Systems.
- Author
-
Yunlong Xu, Rui Wang 0014, Tao Li 0006, Mingcong Song, Lan Gao, Zhongzhi Luan, and Depei Qian
- Published
- 2016
- Full Text
- View/download PDF
13. Towards sustainable in-situ server systems in the big data era.
- Author
-
Chao Li 0009, Yang Hu 0001, Longjun Liu, Juncheng Gu, Mingcong Song, Xiaoyao Liang, Jingling Yuan, and Tao Li 0006
- Published
- 2015
- Full Text
- View/download PDF
14. LrGAN: A Compact and Energy Efficient PIM-Based Architecture for GAN Training
- Author
-
Mingcong Song, Haiyu Mao, Jiwu Shu, and Tao Li
- Subjects
Discriminator ,Speedup ,Artificial neural network ,Computer science ,Approximation algorithm ,Theoretical Computer Science ,Computational Theory and Mathematics ,Computer architecture ,Hardware and Architecture ,Unsupervised learning ,Field-programmable gate array ,Software ,Energy (signal processing) ,Efficient energy use - Abstract
As a powerful unsupervised learning method, Generative Adversarial Network (GAN) plays an essential role in many domains. However, training a GAN imposes four more challenges: (1) intensive communication caused by complex train phases of GAN; (2) much more ineffectual computations caused by peculiar convolutions; (3) more frequent off-chip memory accesses for exchanging intermediate data between the generator and the discriminator; and (4) high energy consumption of unnecessary fine-grained MLC programming. In this article, we propose LrGAN, a PIM-based GAN accelerator, to address the challenges of training GAN. We first propose a zero-free data reshaping scheme for ReRAM-based PIM, which removes the zero-related computations. We then propose a 3D-connected PIM, which can reconfigure connections inside PIM dynamically according to dataflows of propagation and updating. After that, we propose an approximate weight update algorithm to avoid unnecessary fine-grain MLC programming. Finally, we propose LrGAN based on these three techniques, providing different levels of accelerating GAN for programmers. Experiments show that LrGAN achieves 47.2×, 21.42×, and 7.46× speedup over FPGA-based GAN accelerator, GPU platform, and ReRAM-based neural network accelerator respectively. Besides, LrGAN achieves 13.65×, 10.75×, and 1.34× energy saving on average over GPU platform, PRIME, and FPGA-based GAN accelerator, respectively.
- Published
- 2021
15. Democratic learning: hardware/software co-design for lightweight blockchain-secured on-device machine learning
- Author
-
Yuting Dai, Xiaoguang Liu, Mingcong Song, Tao Li, Zhibin Yu, Rui Zhang, and Gang Wang
- Subjects
010302 applied physics ,Blockchain ,060102 archaeology ,Edge device ,Exploit ,Computer science ,business.industry ,06 humanities and the arts ,Reuse ,Machine learning ,computer.software_genre ,01 natural sciences ,Software ,Hardware and Architecture ,0103 physical sciences ,Overhead (computing) ,0601 history and archaeology ,Artificial intelligence ,Architecture ,business ,computer ,5G - Abstract
Recently, the trending 5G technology encourages extensive applications of on-device machine learning, which collects user data for model training. This requires cost-effective techniques to preserve the privacy and the security of model training within the resource-constrained environment. Traditional learning methods rely on the trust among the system for privacy and security. However, with the increase of the learning scale, maintaining every edge device’s trustworthiness could be expensive. To cost-effectively establish trust in a trustless environment, this paper proposes democratic learning (DemL), which makes the first step to explore hardware/software co-design for blockchain-secured decentralized on-device learning. By utilizing blockchain’s decentralization and tamper-proofing, our design secures AI learning in a trustless environment. To tackle the extra overhead introduced by blockchain, we propose PoMC (an algorithm and architecture co-design) as a novel blockchain consensus mechanism, which first exploits cross-domain reuse (AI learning and blockchain consensus) in AI learning architecture. Evaluation results show our DemL can protect AI learning from privacy leakage and model pollution, and demonstrated that privacy and security come with trivial hardware overhead and power consumption (2%). We believe that our work will open the door of synergizing blockchain and on-device learning for security and privacy.
- Published
- 2021
16. Towards 'Full Containerization' in Containerized Network Function Virtualization
- Author
-
Mingcong Song, Tao Li, and Yang Hu
- Subjects
Memory buffer register ,Cache coloring ,Computer science ,Linux kernel ,02 engineering and technology ,Cache pollution ,computer.software_genre ,01 natural sciences ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Resource management ,Cache algorithms ,General Environmental Science ,010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Locality ,Provisioning ,General Medicine ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Memory management ,Operating system ,General Earth and Planetary Sciences ,Page cache ,Cache ,computer ,Software - Abstract
With exploding traffic stuffing existing network infra-structure, today's telecommunication and cloud service providers resort to Network Function Virtualization (NFV) for greater agility and economics. Pioneer service provider such as AT&T proposes to adopt container in NFV to achieve shorter Virtualized Network Function (VNF) provisioning time and better runtime performance. However, we characterize typical NFV work-loads on the containers and find that the performance is unsatisfactory. We observe that the shared host OS net-work stack is the main bottleneck, where the traffic flow processing involves a large amount of intermediate memory buffers and results in significant last level cache pollution. Existing OS memory allocation policies fail to exploit the locality and data sharing information among buffers. In this paper, we propose NetContainer, a software framework that achieves fine-grained hardware resource management for containerized NFV platform. NetContainer employs a cache access overheads guided page coloring scheme to coordinately address the inter-flow cache access overheads and intra-flow cache access overheads. It maps the memory buffer pages that manifest low cache access overheads (across a flow or among the flows) to the same last level cache partition. NetContainer exploits a footprint theory based method to estimate the cache access overheads and a Min-Cost Max-Flow model to guide the memory buffer mappings. We implement the NetContainer in Linux kernel and extensively evaluate it with real NFV workloads. Exper-imental results show that NetContainer outperforms conventional page coloring-based memory allocator by 48% in terms of successful call rate.
- Published
- 2017
17. Retracted on January 26, 2021
- Author
-
Huixiang Chen, Jiechen Zhao, Tao Li, Mingcong Song, and Yuting Dai
- Subjects
Speedup ,Artificial neural network ,Computer science ,Dataflow ,business.industry ,Locality ,Locality of reference ,Leverage (statistics) ,Pattern recognition ,Artificial intelligence ,business ,Convolutional neural network ,Convolution - Abstract
Recent years have seen an explosion of domain-specific accelerator for Convolutional Neural Networks (CNN). Most of the prior CNN accelerators target neural networks on image recognition, such as AlexNet, VGG, GoogleNet, ResNet, etc. In this paper, we take a different route and study the acceleration of 3D CNN, which are more computational-intensive than 2D CNN and exhibits more opportunities. After our characterization on representative 3D CNNs, we leverage differential convolution across the temporal dimension, which operates on the temporal delta of imaps for each layer and process the computation bit-serially using only the effectual bits of the temporal delta. To further leverage the spatial locality and temporal locality, and make the architecture general to all CNNs, we propose a control mechanism to dynamically switch across spatial delta dataflow and temporal delta dataflow. We call our design temporal-spatial value aware accelerator (TSVA). Evaluation on a set of representation NN networks shows that TSVA can achieve an average of 4.24× speedup and 1.42× energy efficiency. While we target 3D CNN for video recognition, TSVA could also benefit other general CNNs for continuous batch processing.
- Published
- 2019
18. Eager pruning
- Author
-
Mingcong Song, Tao Li, Jiaqi Zhang, and Xiangru Chen
- Subjects
010302 applied physics ,Speedup ,Artificial neural network ,Computer science ,business.industry ,Computation ,Big data ,02 engineering and technology ,Machine learning ,computer.software_genre ,01 natural sciences ,020202 computer hardware & architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Redundancy (engineering) ,Hardware acceleration ,Artificial intelligence ,Performance improvement ,Architecture ,business ,computer - Abstract
Today's big and fast data and the changing circumstance require fast training of Deep Neural Networks (DNN) in various applications. However, training a DNN with tons of parameters involves intensive computation. Enlightened by the fact that redundancy exists in DNNs and the observation that the ranking of the significance of the weights changes slightly during training, we propose Eager Pruning, which speeds up DNN training by moving pruning to an early stage. Eager Pruning is supported by an algorithm and architecture co-design. The proposed algorithm dictates the architecture to identify and prune insignificant weights during training without accuracy loss. A novel architecture is designed to transform the reduced training computation into performance improvement. Our proposed Eager Pruning system gains an average of 1.91x speedup over state-of-the-art hardware accelerator and 6.31x energy-efficiency over Nvidia GPUs.
- Published
- 2019
19. LerGAN: A Zero-Free, Low Data Movement and PIM-Based GAN Architecture
- Author
-
Jiwu Shu, Haiyu Mao, Tao Li, Yuting Dai, and Mingcong Song
- Subjects
010302 applied physics ,Speedup ,Computer architecture ,Artificial neural network ,Computer science ,Computation ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Unsupervised learning ,02 engineering and technology ,01 natural sciences ,Bottleneck ,020202 computer hardware & architecture - Abstract
As a powerful unsupervised learning method, Generative Adversarial Network (GAN) plays an important role in many domains such as video prediction and autonomous driving. It is one of the ten breakthrough technologies in 2018 reported in MIT Technology Review. However, training a GAN imposes three more challenges: (1) intensive communication caused by complex train phases of GAN, (2) much more ineffectual computations caused by special convolutions, and (3) more frequent off-chip memory accesses for exchanging inter-mediate data between the generator and the discriminator. In this paper, we propose LerGAN1, a PIM-based GAN accelerator to address the challenges of training GAN. We first propose a zero-free data reshaping scheme for ReRAM-based PIM, which removes the zero-related computations. We then propose a 3D-connected PIM, which can reconfigure connections inside PIM dynamically according to dataflows of propagation and updating. Our proposed techniques reduce data movement to a great extent, avoiding I/O to become a bottleneck of training GANs. Finally, we propose LerGAN based on these two techniques, providing different levels of accelerating GAN for programmers. Experiments shows that LerGAN achieves 47.2x, 21.42x and 7.46x speedup over FPGA-based GAN accelerator, GPU platform, and ReRAM-based neural network accelerator respectively. Moreover, LerGAN achieves 9.75x, 7.68x energy saving on average over GPU platform, ReRAM-based neural network accelerator respectively, and has 1.04x energy consuming over FPGA-based GAN accelerator.
- Published
- 2018
20. Prediction Based Execution on Deep Neural Networks
- Author
-
Tao Li, Jiechen Zhao, Jiaqi Zhang, Yang Hu, and Mingcong Song
- Subjects
010302 applied physics ,Speedup ,business.industry ,Computer science ,Deep learning ,02 engineering and technology ,Operand ,01 natural sciences ,020202 computer hardware & architecture ,Computer engineering ,Parallel processing (DSP implementation) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Artificial intelligence ,business ,Throughput (business) ,Execution model ,Block (data storage) - Abstract
Recently, deep neural network based approaches have emerged as indispensable tools in many fields, ranging from image and video recognition to natural language processing. However, the large size of such newly developed networks poses both throughput and energy challenges to the underlying processing hardware. This could be the major stumbling block to many promising applications such as self-driving cars and smart cities. Existing work proposes to weed zeros from input neurons to avoid unnecessary DNN computation (zero-valued operand multiplications). However, we observe that many output neurons are still ineffectual even if the zero-removal technique has been applied. These ineffectual output neurons could not pass their values to the subsequent layer, which means all the computations (including zero-valued and non-zero-valued operand multiplications) related to these output neurons are futile and wasteful. Therefore, there is an opportunity to significantly improve the performance and efficiency of DNN execution by predicting the ineffectual output neurons and thus completely avoid the futile computations by skipping over these ineffectual output neurons. To do so, we propose a two-stage, prediction-based DNN execution model without accuracy loss. We also propose a uniform serial processing element (USPE), for both prediction and execution stages to improve the flexibility and minimize the area overhead. To improve the processing throughput, we further present a scale-out design for USPE. Evaluation results over a set of state-of-the-art DNNs show that our proposed design achieves 2.5X speedup and 1.9X energy-efficiency on average over the traditional accelerator. Moreover, by stacking with our design, we can improve Cnvlutin and Stripes by 1.9X and 2.0X on average, respectively.
- Published
- 2018
21. Exploiting Dynamic Thermal Energy Harvesting for Reusing in Smartphone with Mobile Applications
- Author
-
Huixiang Chen, Yuting Dai, Tao Li, Mingcong Song, and Benyong Liu
- Subjects
010302 applied physics ,Battery (electricity) ,Thermoelectric cooling ,Computer science ,02 engineering and technology ,Reuse ,021001 nanoscience & nanotechnology ,7. Clean energy ,01 natural sciences ,Automotive engineering ,Thermoelectric generator ,0103 physical sciences ,Thermal ,0210 nano-technology ,Cooling down - Abstract
Recently, mobile applications have gradually become performance- and resource- intensive, which results in a massive battery power drain and high surface temperature, and further degrades the user experience. Thus, high power consumption and surface over-heating have been considered as a severe challenge to smartphone design. In this paper, we propose DTEHR, a mobile Dynamic Thermal Energy Harvesting Reusing framework to tackle this challenge. The approach is sustainable in that it generates energy using dynamic Thermoelectric Generators (TEGs). The generated energy not only powers Thermoelectric Coolers (TECs) for cooling down hot-spots, but also recharges micro-supercapacitors (MSCs) for extended smartphone usage. To analyze thermal characteristics and evaluate DTEHR across real-world applications, we build MPPTAT (Multi-comPonent Power and Thermal Analysis Tool), a power and thermal analyzing tool for Android. The result shows that DTEHR reduces the temperature differences between hot areas and cold areas up to 15.4°C (internal) and 7°C (surface). With TEC-based hot-spots cooling, DTEHR reduces the temperature of the surface and internal hot-spots by an average of 8° and 12.8mW respectively. With dynamic TEGs, DTEHR generates 2.7-15mW power, more than hundreds of times of power that TECs need to cool down hot-spots. Thus, extra-generated power can be stored into MSCs to prolong battery life.
- Published
- 2018
22. Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning
- Author
-
Huixiang Chen, Mingcong Song, Jiaqi Zhang, and Tao Li
- Subjects
010302 applied physics ,Speedup ,business.industry ,Computer science ,Deep learning ,Big data ,02 engineering and technology ,Machine learning ,computer.software_genre ,01 natural sciences ,Backpropagation ,020202 computer hardware & architecture ,Microarchitecture ,Memory management ,0103 physical sciences ,Synchronization (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Unsupervised learning ,Artificial intelligence ,business ,computer - Abstract
Recently, deep learning based approaches have emerged as indispensable tools to perform big data analytics. Normally, deep learning models are first trained with a supervised method and then deployed to execute various tasks. The supervised method involves extensive human efforts to collect and label the large-scale dataset, which becomes impractical in the big data era where raw data is largely un-labeled and uncategorized. Fortunately, the adversarial learning, represented by Generative Adversarial Network (GAN), enjoys a great success on the unsupervised learning. However, the distinct features of GAN, such as massive computing phases and non-traditional convolutions challenge the existing deep learning accelerator designs. In this work, we propose the first holistic solution for accelerating the unsupervised GAN-based Deep Learning. We overcome the above challenges with an algorithm and architecture co-design approach. First, we optimize the training procedure to reduce on-chip memory consumption. We then propose a novel time-multiplexed design to efficiently map the abundant computing phases to our microarchitecture. Moreover, we design high-efficiency dataflows to achieve high data reuse and skip the zero-operand multiplications in the non-traditional convolutions. Compared with traditional deep learning accelerators, our proposed design achieves the best performance (average 4.3X) with the same computing resource. Our design also has an average of 8.3X speedup over CPU and 6.2X energy-efficiency over NVIDIA GPU.
- Published
- 2018
23. In-Situ AI: Towards Autonomous and Incremental Deep Learning for IoT Systems
- Author
-
Tao Li, Duo Liu, Jing Wang, Kan Zhong, Mingcong Song, Weigong Zhang, Yang Hu, and Jiaqi Zhang
- Subjects
010302 applied physics ,Speedup ,business.industry ,Group method of data handling ,Computer science ,Deep learning ,Distributed computing ,Dynamic data ,Cloud computing ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Data modeling ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Task analysis ,Artificial intelligence ,business ,Raw data - Abstract
Recent years have seen an exploration of data volumes from a myriad of IoT devices, such as various sensors and ubiquitous cameras. The deluge of IoT data creates enormous opportunities for us to explore the physical world, especially with the help of deep learning techniques. Traditionally, the Cloud is the option for deploying deep learning based applications. However, the challenges of Cloud-centric IoT systems are increasing due to significant data movement overhead, escalating energy needs, and privacy issues. Rather than constantly moving a tremendous amount of raw data to the Cloud, it would be beneficial to leverage the emerging powerful IoT devices to perform the inference task. Nevertheless, the statically trained model could not efficiently handle the dynamic data in the real in-situ environments, which leads to low accuracy. Moreover, the big raw IoT data challenges the traditional supervised training method in the Cloud. To tackle the above challenges, we propose In-situ AI, the first Autonomous and Incremental computing framework and architecture for deep learning based IoT applications. We equip deep learning based IoT system with autonomous IoT data diagnosis (minimize data movement), and incremental and unsupervised training method (tackle the big raw IoT data generated in ever-changing in-situ environments). To provide efficient architectural support for this new computing paradigm, we first characterize the two In-situ AI tasks (i.e. inference and diagnosis tasks) on two popular IoT devices (i.e. mobile GPU and FPGA) and explore the design space and tradeoffs. Based on the characterization results, we propose two working modes for the In-situ AI tasks, including Single-running and Co-running modes. Moreover, we craft analytical models for these two modes to guide the best configuration selection. We also develop a novel two-level weight shared In-situ AI architecture to efficiently deploy In-situ tasks to IoT node. Compared with traditional IoT systems, our In-situ AI can reduce data movement by 28-71%, which further yields 1.4X-3.3X speedup on model update and contributes to 30-70% energy saving.
- Published
- 2018
24. GaaS workload characterization under NUMA architecture for virtualized GPU
- Author
-
Tao Li, Meng Wang, Mingcong Song, Yang Hu, and Huixiang Chen
- Subjects
Computer science ,business.industry ,Clock rate ,Cloud computing ,02 engineering and technology ,Parallel computing ,Virtualization ,computer.software_genre ,020202 computer hardware & architecture ,Uncore ,Server ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,General-purpose computing on graphics processing units ,Frequency scaling ,business ,computer - Abstract
Graphics-as-a-service (GaaS) is gaining popularity in cloud computing community. There is an emerging trend of running GaaS workload using virtualized GPU in current data center deployment. This paper provides a detailed characterization of GaaS workload under virtualized GPU NUMA environment, and found that: (1) GaaS workloads exhibit different behavior with GPGPU workloads by having more frequent real-time data exchange between CPU and GPU; (2) GaaS workloads have no NUMA overhead, whether considering the influence of remote memory access or the resource contention of CPU uncore. We also test the performance and power tradeoff among the frequency scaling of CPU clock, GPU core clock, and GPU memory clock. Characterization results show that (1) ondemand CPU frequency scaling achieves the best balance between performance and power consumption; (2) GaaS workloads are GPU-computation intensive. GPU memory frequency can be set lower to save energy with little performance sacrifice.
- Published
- 2017
25. Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures
- Author
-
Huixiang Chen, Mingcong Song, Tao Li, and Yang Hu
- Subjects
010302 applied physics ,Computer science ,User satisfaction ,Real-time computing ,Feature extraction ,Inference ,02 engineering and technology ,01 natural sciences ,Convolutional neural network ,Two stages ,020202 computer hardware & architecture ,Support vector machine ,Computer engineering ,Server ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Entropy (information theory) - Abstract
Accelerating Convolutional Neural Networks (CNNs) on GPUs usually involves two stages: training and inference. Traditionally, this two-stage process is deployed on high-end GPU-equipped servers. Driven by the increase in compute power of desktop and mobile GPUs, there is growing interest in performing inference on various kinds of platforms. In contrast to the requirements of high throughput and accuracy during the training stage, end-users will face diverse requirements related to inference tasks. To address this emerging trend and new requirements, we propose Pervasive CNN (P-CNN), a user satisfaction-aware CNN inference framework. P-CNN is composed of two phases: cross-platform offline compilation and run-time management. Based on users' requirements, offline compilation generates the optimal kernel using architecture-independent techniques, such as adaptive batch size selection and coordinated fine-tuning. The runtime management phase consists of accuracy tuning, execution, and calibration. First, accuracy tuning dynamically identifies the fastest kernels with acceptable accuracy. Next, the run-time kernel scheduler partitions the optimal computing resource for each layer and schedules the GPU thread blocks. If its accuracy is not acceptable to the end-user, the calibration stage selects a slower but more precise kernel to improve the accuracy. Finally, we design a user satisfaction metric for CNNs to evaluate ourPervasive deign. Our evaluation results show P-CNN can provide the best user satisfaction for different inference tasks.
- Published
- 2017
26. High-Quality 3-D InISAR Imaging of Maneuvering Target Based on a Combined Processing Approach
- Author
-
Yabo Liu, Yunkai Deng, Mingcong Song, Kun Wu, and Robert Wang
- Subjects
Synthetic aperture radar ,Signal processing ,business.industry ,Computer science ,Mode (statistics) ,Geotechnical Engineering and Engineering Geology ,Inverse synthetic aperture radar ,Interferometry ,Range (mathematics) ,Radar imaging ,Coherence (signal processing) ,Computer vision ,Artificial intelligence ,Electrical and Electronic Engineering ,business - Abstract
In order to enhance the target recognition probability in the inverse synthetic aperture radar (ISAR) imaging domain, the interferometric ISAR (InISAR) mode is presented to achieve the 3-D information of a target. However, the real data results of a maneuvering target are not enough, due to the difficult signal processing. In this letter, a combined processing approach is proposed to realize high-quality 3-D imagery of a maneuvering target. In the approach, the range alignment and phase adjustment are implemented together on echoes to avoid destroying the coherence of the echoes. Then, the high-quality 3-D InISAR imagery of the maneuvering target can be reached. Real data results are provided to confirm the effectiveness of the proposal.
- Published
- 2013
27. Bridging the Semantic Gaps of GPU Acceleration for Scale-out CNN-based Big Data Processing
- Author
-
Tao Li, Yunlong Xu, Chao Li, Huixiang Chen, Yang Hu, Mingcong Song, and Jingling Yuan
- Subjects
010302 applied physics ,Data processing ,Artificial neural network ,business.industry ,Computer science ,Deep learning ,02 engineering and technology ,Parallel computing ,Video processing ,01 natural sciences ,Convolutional neural network ,020202 computer hardware & architecture ,CUDA ,Computer engineering ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Performance improvement ,business - Abstract
Convolutional Neural Networks (CNNs) have substantially advanced the state-of-the-art accuracies of object recognition, which is the core function of a myriad of modern multimedia processing techniques such as image/video processing, speech recognition, and natural language processing. GPU-based accelerators gained increasing attention because a large amount of highly parallel neurons in CNN naturally matches the GPU computation pattern. In this work, we perform comprehensive experiments to investigate the performance bottlenecks and overheads of current GPU acceleration platform for scale-out CNN-based big data processing. In our characterization, we observe two significant semantic gaps: framework gap that lies between CNN-based data processing workflow and data processing manner in distributed framework; and the standalone gap that lies between the uneven computation loads at different CNN layers and fixed computing capacity provisioning of current GPU acceleration library. To bridge these gaps, we propose D3NN, a Distributed, Decoupled, and Dynamically tuned GPU acceleration framework for modern CNN architectures. In particular, D3NN features a novel analytical model that enables accurate time estimation of GPU accelerated CNN processing with only 5–10% error. Our evaluation results show the throughput of standalone processing node using D3NN gains up to 3.7× performance improvement over current standalone GPU acceleration platform. Our CNN-oriented GPU acceleration library with built-in dynamic batching scheme achieves up to 1.5× performance improvement over the non-batching scheme and outperforms the state-of-the-art deep learning library by up to 28% (performance mode) ∼ 67% (memory-efficient mode).
- Published
- 2016
28. Scheduling Tasks with Mixed Timing Constraints in GPU-Powered Real-Time Systems
- Author
-
Zhongzhi Luan, Lan Gao, Yunlong Xu, Tao Li, Depei Qian, Mingcong Song, and Rui Wang
- Subjects
010302 applied physics ,Computer science ,business.industry ,Real-time computing ,Automotive industry ,Workload ,02 engineering and technology ,01 natural sciences ,Turnaround time ,Driving safety ,020202 computer hardware & architecture ,Scheduling (computing) ,Automotive systems ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Graphics ,business - Abstract
Due to the cost-effective, massive computational power of graphics processing units (GPUs), there is a growing interest of utilizing GPUs in real-time systems. For example GPUs have been applied to automotive systems to enable new advanced and intelligent driver assistance technologies, accelerating the path to self-driving cars. In such systems, GPUs are shared among tasks with mixed timing constraints: real-time (RT) tasks that have to be accomplished before specified deadlines, and non-real-time, best-effort (BE) tasks. In this paper, (1) we propose resource-aware non-uniform slack distribution to enhance the schedulability of RT tasks (the total amount of work of RT tasks whose deadlines can be satisfied on a given amount of resources) in GPU-enabled systems; (2) we propose deadline-aware dynamic GPU partitioning to allow RT and BE tasks to run on a GPU simultaneously, such that BE tasks are not blocked for a long time.We evaluate the effectiveness of the proposed approaches by using both synthetic benchmarks and a real-world workload that consists of a set of emerging automotive tasks. Experimental results show that the proposed approaches yield significant schedulability improvement for RT tasks and turnaround time decrement for BE tasks. Moreover, the analysis of two driving scenarios shows that such schedulability improvement and turnaround time decrement can significantly enhance the driving safety and experience. For example, when the resource-aware non-uniform slack distribution approach is used, the distance that a car travels during the time between a traffic sign (pedestrian) is "seen and recognized" is decreased from 44.4m to 22.2m (from 4.4m to 2.2m); when the deadline-aware dynamic GPU partitioning approach is used, the distance that the car has traveled before a drowsy driver is woken up is reduced from 56.2m to 29.2m.
- Published
- 2016
29. Towards sustainable in-situ server systems in the big data era
- Author
-
Jingling Yuan, Longjun Liu, Tao Li, Yang Hu, Juncheng Gu, Mingcong Song, Chao Li, and Xiaoyao Liang
- Subjects
Power management ,Server farm ,business.industry ,Computer science ,Distributed computing ,Server ,Real-time computing ,Big data ,Energy source ,Raw data ,business - Abstract
Recent years have seen an explosion of data volumes from a myriad of distributed sources such as ubiquitous cameras and various sensors. The challenges of analyzing these geographically dispersed datasets are increasing due to the significant data movement overhead, time-consuming data aggregation, and escalating energy needs. Rather than constantly move a tremendous amount of raw data to remote warehouse-scale computing systems for processing, it would be beneficial to leverage in-situ server systems (InS) to pre-process data, i.e., bringing computation to where the data is located. This paper takes the first step towards designing server clusters for data processing in the field. We investigate two representative in-situ computing applications, where data is normally generated from environmentally sensitive areas or remote places that lack established utility infrastructure. These very special operating environments of in-situ servers urge us to explore standalone (i.e., off-grid) systems that offer the opportunity to benefit from local, self-generated energy sources. In this work we implement a heavily instrumented proof-of-concept prototype called InSURE: in-situ server systems using renewable energy. We develop a novel energy buffering mechanism and a unique joint spatio-temporal power management strategy to coordinate standalone power supplies and in-situ servers. We present detailed deployment experiences to quantify how our design fits with in-situ processing in the real world. Overall, InSURE yields 20%~60% improvements over a state-of-the-art baseline. It maintains impressive control effectiveness in under-provisioned environment and can economically scale along with the data processing needs. The proposed design is well complementary to today's grid-connected cloud data centers and provides competitive cost-effectiveness.
- Published
- 2015
30. Processing of SAR Data based on the Heterogeneous Architecture of GPU and CPU
- Author
-
Mingcong Song, Yabo Liu, Fengjun Zhao, Hongyu Li, and Robert Wang
- Subjects
Chirp scaling algorithm ,Computer science ,Computer graphics (images) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Process (computing) ,Graphics ,Architecture ,General-purpose computing on graphics processing units ,Imaging processing ,Image (mathematics) - Abstract
SAR imaging process is usually computationally burdensome and it is difficult to achieve SAR image in real-time. Recently, the rapid increase in the performance of Graphics Processing Units (GPU), coupled with powerful CPU, has made heterogeneous framework of GPU and CPU into a compelling platform for computationally demanding tasks. This paper presents the parallel implementation on the heterogeneous architecture using Chirp Scaling algorithm. Experimental results suggest that real-time imaging processing is possible for large-scale SAR data. (5 pages)
- Published
- 2013
31. Towards Sustainable In-Situ Server Systems in the Big Data Era.
- Author
-
Chao Li, Yang Hu, Longjun Liu, Juncheng Gu, Mingcong Song, Xiaoyao Liang, Jingling Yuan, and Tao Li
- Published
- 2015
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.