192 results on '"Caiwen Ding"'
Search Results
102. FTDL: An FPGA-tailored Architecture for Deep Learning Systems.
- Author
-
Runbin Shi, Yuhao Ding, Xuechao Wei, Hang Liu 0001, Hayden Kwok-Hay So, and Caiwen Ding
- Published
- 2020
- Full Text
- View/download PDF
103. Efficient Recurrent Neural Networks using Structured Matrices in FPGAs.
- Author
-
Zhe Li 0001, Shuo Wang 0009, Caiwen Ding, Qinru Qiu, Yanzhi Wang, and Yun Liang 0001
- Published
- 2018
104. Poster
- Author
-
Bingyu Liu, Rujia Wang, Zhongjie Ba, Shanglin Zhou, Caiwen Ding, and Yuan Hong
- Published
- 2022
105. Deep Learning Tackles Temporal Predictions on Charging Loads of Electric Vehicles
- Author
-
Eugenia Cadete, Raul Alva, Albert Zhang, Caiwen Ding, Mimi Xie, Sara Ahmed, and Yufang Jin
- Published
- 2022
106. Design, Sensing, and Control of a Novel UAV Platform for Aerial Drilling and Screwing
- Author
-
Caiwen Ding, Caiwu Ding, Cong Wang, and Lu Lu
- Subjects
0209 industrial biotechnology ,Control and Optimization ,010504 meteorology & atmospheric sciences ,Computer science ,Biomedical Engineering ,02 engineering and technology ,Servomotor ,01 natural sciences ,law.invention ,Contact force ,020901 industrial engineering & automation ,Artificial Intelligence ,law ,Control theory ,Simulation ,0105 earth and related environmental sciences ,Drill ,Mechanical Engineering ,Robot end effector ,Computer Science Applications ,Human-Computer Interaction ,Impedance control ,Control and Systems Engineering ,Computer Vision and Pattern Recognition ,Robust control ,Servo - Abstract
Hole drilling and bolt screwing are frequently performed tasks in construction, decoration, and maintenance. Traditionally, sending human workers to perform these tasks in hard-to-reach locations is both dangerous and costly. In this letter, we present an aerial manipulation platform that allows a human user to remotely conduct omnidirectional drilling and screwing. The design of the platform features a quadrotor UAV with each pair of rotors independently tilted by a servo, forming an “H” configuration, on which a 1-DOF manipulator carrying a motorized drill or screw driver is mounted. With such a design, the end-effector can face any direction on the longitudinal plane and exert a big enough contact force for drilling and screwing without the need of changing the vehicle body's orientation. Compared to previous UAVs that can only drill holes vertically into the ground, the proposed design also allows horizonal drilling/screwing into a wall or a cliff, making it suitable for a vast range of real applications. Based on the dynamic equations of the system, a dual-level control law is proposed. The low-level attitude controller uses an adaptive robust control (ARC) to accurately regulate the attitude angles in the presence of force/torque uncertainties that may occur during the drilling and screwing process, while a selective impedance controller is implemented at high level to indirectly control the contact force commanded by the user. In addition, a vision-based real-time target identification and tracking method integrating a YOLO v3 real-time object detector with feature tracking, and morphological operations is developed to identify and track the target point for drilling and screwing specified by the user. Various in-lab experiments on a self-made prototype demonstrate the feasibility and effectiveness of the proposed approach for aerial drilling and screwing.
- Published
- 2021
107. Pulse Truncation Enabled High Performance and Low Energy Memristor-based Accelerator
- Author
-
Zhiheng Liao, Jingyan Fu, Caiwen Ding, and Jinhui Wang
- Published
- 2022
108. Multi-source energy harvesting management and optimization for non-volatile processors.
- Author
-
Soroush Heidari, Caiwen Ding, Yongpan Liu, Yanzhi Wang, and Jingtong Hu
- Published
- 2015
- Full Text
- View/download PDF
109. Uncertainty Quantification of Collaborative Detection for Self-Driving
- Author
-
Sanbao Su, Yiming Li, Sihong He, Songyang Han, Chen Feng, Caiwen Ding, and Fei Miao
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Sharing information between connected and autonomous vehicles (CAVs) fundamentally improves the performance of collaborative object detection for self-driving. However, CAVs still have uncertainties on object detection due to practical challenges, which will affect the later modules in self-driving such as planning and control. Hence, uncertainty quantification is crucial for safety-critical systems such as CAVs. Our work is the first to estimate the uncertainty of collaborative object detection. We propose a novel uncertainty quantification method, called Double-M Quantification, which tailors a moving block bootstrap (MBB) algorithm with direct modeling of the multivariant Gaussian distribution of each corner of the bounding box. Our method captures both the epistemic uncertainty and aleatoric uncertainty with one inference pass based on the offline Double-M training process. And it can be used with different collaborative object detectors. Through experiments on the comprehensive collaborative perception dataset, we show that our Double-M method achieves more than 4X improvement on uncertainty score and more than 3% accuracy improvement, compared with the state-of-the-art uncertainty quantification methods. Our code is public on https://coperception.github.io/double-m-quantification., Comment: This paper has been accepted by the 2023 IEEE International Conference on Robotics and Automation (ICRA 2023)
- Published
- 2022
- Full Text
- View/download PDF
110. A Unified DNN Weight Pruning Framework Using Reweighted Optimization Methods
- Author
-
Zheng Zhan, Shanglin Zhou, Tianyun Zhang, Xiaolong Ma, Caiwen Ding, Makan Fardad, and Yanzhi Wang
- Subjects
Constraint (information theory) ,Computer science ,business.industry ,Computation ,Bounded function ,Deep learning ,Data compression ratio ,Electronic design automation ,Pruning (decision trees) ,Artificial intelligence ,business ,Algorithm ,Regularization (mathematics) - Abstract
To address the large model size and intensive computation requirement of deep neural networks (DNNs), weight pruning techniques have been proposed and generally fall into two categories, i.e., static regularization-based pruning and dynamic regularization-based pruning. However, the former method currently suffers either complex workloads or accuracy degradation, while the latter one takes a long time to tune the parameters to achieve the desired pruning rate without accuracy loss. In this paper, we propose a unified DNN weight pruning framework with dynamically updated regularization terms bounded by the designated constraint. Our proposed method increases the compression rate, reduces the training time and reduces the number of hyper-parameters compared with state-of-the-art ADMM-based hard constraint method.
- Published
- 2021
111. Real-time Multi-Object Tracking of Ion-irradiation Induced Defects in in situ TEM Videos
- Author
-
Rajat Sainju, Wei-Ying Chen, Samuel Schaefer, Qian Yang, Caiwen Ding, Meimei Li, and Yuanyuan Zhu
- Subjects
Instrumentation - Published
- 2022
112. Prediction of Electric Vehicles Charging Load Using Long Short-Term Memory Model
- Author
-
Eugenia Cadete, Mimi Xie, Sara Ahmed, Caiwen Ding, and Yu-Fang Jin
- Subjects
Long short term memory ,Electric power transmission ,Computer science ,Automotive engineering - Published
- 2021
113. E.T
- Author
-
Hang Liu, Shiyang Chen, Guang R. Gao, Shaoyi Huang, Caiwen Ding, Santosh Pandey, Bingbing Li, and Long Zheng
- Subjects
Sequence ,Computer engineering ,Computer science ,business.industry ,Computation ,Deep learning ,Pruning (decision trees) ,Artificial intelligence ,Architecture ,business ,Turnaround time ,Transformer (machine learning model) ,Variety (cybernetics) - Abstract
Transformer-based deep learning models have become a ubiquitous vehicle to drive a variety of Natural Language Processing (NLP) related tasks beyond their accuracy ceiling. However, these models also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce ET. that rE-thinks self-attention computation for Transformer models on GPUs with the following contributions: First, we introduce a novel self-attention architecture, which encompasses two tailored self-attention operators with corresponding sequence length-aware optimizations, and operation reordering optimizations. Second, we present an attention-aware pruning design which judiciously uses various pruning algorithms to reduce more computations hence achieves significantly shorter turnaround time. For the pruning algorithms, we not only revamp the existing pruning algorithms, but also tailor new ones for transformer models. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistilBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions, i.e., TensorRT and FasterTransformer.
- Published
- 2021
114. Exploration of Quantum Neural Architecture by Mixing Quantum Neuron Designs: (Invited Paper)
- Author
-
Zhepeng Wang, Zhiding Liang, Shanglin Zhou, Caiwen Ding, Yiyu Shi, and Weiwen Jiang
- Published
- 2021
115. Against Membership Inference Attack: Pruning is All You Need
- Author
-
Jinbo Bi, Zigeng Wang, Sanguthevar Rajasekaran, Caiwen Ding, Shanglin Zhou, Chenghong Wang, Hang Liu, and Yijue Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,business.industry ,Deep learning ,Machine Learning (stat.ML) ,Inference attack ,Machine learning ,computer.software_genre ,Machine Learning (cs.LG) ,Statistics - Machine Learning ,Deep neural networks ,Pruning algorithm ,Artificial intelligence ,Pruning (decision trees) ,business ,Mobile device ,Subnetwork ,computer ,Vulnerability (computing) - Abstract
The large model size, high computational operations, and vulnerability against membership inference attack (MIA) have impeded deep learning or deep neural networks (DNNs) popularity, especially on mobile devices. To address the challenge, we envision that the weight pruning technique will help DNNs against MIA while reducing model storage and computational operation. In this work, we propose a pruning algorithm, and we show that the proposed algorithm can find a subnetwork that can prevent privacy leakage from MIA and achieves competitive accuracy with the original DNNs. We also verify our theoretical insights with experiments. Our experimental results illustrate that the attack accuracy using model compression is up to 13.6% and 10% lower than that of the baseline and Min-Max game, accordingly., Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
- Published
- 2021
116. Enabling Retrain-free Deep Neural Network Pruning Using Surrogate Lagrangian Relaxation
- Author
-
Lynn Pepin, Bingbing Li, Caiwen Ding, Mikhail A. Bragin, Shanglin Zhou, Fei Miao, and Deniz Gurevin
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,symbols.namesake ,Artificial neural network ,Lagrangian relaxation ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,symbols ,Pruning (decision trees) ,Algorithm ,Machine Learning (cs.LG) - Abstract
Network pruning is a widely used technique to reduce computation cost and model size for deep neural networks. However, the typical three-stage pipeline, i.e., training, pruning and retraining (fine-tuning) significantly increases the overall training trails. In this paper, we develop a systematic weight-pruning optimization approach based on Surrogate Lagrangian relaxation (SLR), which is tailored to overcome difficulties caused by the discrete nature of the weight-pruning problem while ensuring fast convergence. We further accelerate the convergence of the SLR by using quadratic penalties. Model parameters obtained by SLR during the training phase are much closer to their optimal values as compared to those obtained by other state-of-the-art methods. We evaluate the proposed method on image classification tasks, i.e., ResNet-18 and ResNet-50 using ImageNet, and ResNet-18, ResNet-50 and VGG-16 using CIFAR-10, as well as object detection tasks, i.e., YOLOv3 and YOLOv3-tiny using COCO 2014 and Ultra-Fast-Lane-Detection using TuSimple lane detection dataset. Experimental results demonstrate that our SLR-based weight-pruning optimization approach achieves higher compression rate than state-of-the-arts under the same accuracy requirement. It also achieves a high model accuracy even at the hard-pruning stage without retraining (reduces the traditional three-stage pruning to two-stage). Given a limited budget of retraining epochs, our approach quickly recovers the model accuracy.
- Published
- 2021
117. Binary Complex Neural Network Acceleration on FPGA : (Invited Paper)
- Author
-
Scott Weitze, Minghu Song, Shanglin Zhou, Sahidul Islam, Tong Geng, Hang Liu, Jiaxin Li, Ang Li, Hongwu Peng, Mimi Xie, Caiwen Ding, and Wei Zhang
- Subjects
Complex data type ,Signal processing ,Memory management ,Computer engineering ,Artificial neural network ,Edge device ,Computer science ,Pruning (decision trees) ,Complex network ,Throughput (business) - Abstract
Being able to learn from complex data with phase information is imperative for many signal processing applications. Today’s real-valued deep neural networks (DNNs) have shown efficiency in latent information analysis but fall short when applied to the complex domain. Deep complex networks (DCN), in contrast, can learn from complex data, but have high computational costs; therefore, they cannot satisfy the instant decision-making requirements of many deployable systems dealing with short observations or short signal bursts. Recent, Binarized Complex Neural Network (BCNN), which integrates DCNs with binarized neural networks (BNN), shows great potential in classifying complex data in real-time. In this paper, we propose a structural pruning based accelerator of BCNN, which is able to provide more than 5000 frames/s inference throughput on edge devices. The high performance comes from both the algorithm and hardware sides. On the algorithm side, we conduct structural pruning to the original BCNN models and obtain 20 × pruning rates with negligible accuracy loss; on the hardware side, we propose a novel 2D convolution operation accelerator for the binary complex neural network. Experimental results show that the proposed design works with over 90% utilization and is able to achieve the inference throughput of 5882 frames/s and 4938 frames/s for complex NIN-Net and ResNet-18 using CIFAR-10 dataset and Alveo U280 Board.
- Published
- 2021
118. HMC-T <scp>RAN</scp>
- Author
-
Geng Yuan, Shusen Wang, Hongwu Peng, Shaoyi Huang, Caiwen Ding, Daniel Manu, Lei Yang, Shiyang Chen, Zhenglun Kong, and Hang Liu
- Subjects
Computer science ,business.industry ,Deep learning ,Computation ,05 social sciences ,Process (computing) ,010501 environmental sciences ,01 natural sciences ,Hierarchical database model ,0502 economics and business ,Pruning (decision trees) ,Artificial intelligence ,050207 economics ,business ,Algorithm ,0105 earth and related environmental sciences ,Sparse matrix ,Block (data storage) ,Transformer (machine learning model) - Abstract
Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-the-art.
- Published
- 2021
119. Session details: Session 9B: Emerging Security Topics in Neural Networks
- Author
-
Caiwen Ding
- Subjects
Multimedia ,Artificial neural network ,Computer science ,Session (computer science) ,computer.software_genre ,computer - Published
- 2021
120. Co-Exploration of Graph Neural Network and Network-on-Chip Design Using AutoML
- Author
-
Caiwen Ding, Daniel Manu, Shaoyi Huang, and Lei Yang
- Subjects
Artificial neural network ,Computer science ,Distributed computing ,02 engineering and technology ,Recommender system ,Convolutional neural network ,020202 computer hardware & architecture ,Network on a chip ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Hardware acceleration ,Reinforcement learning ,Graph (abstract data type) - Abstract
Recently, Graph Neural Networks (GNNs) have exhibited high efficiency in several graph-based machine learning tasks. Compared with the neural networks for computer vision or speech tasks (e.g., Convolutional Neural Networks), GNNs have much higher requirements on communication due to the complicated graph structures; however, when applying GNNs for real-world applications, say in recommender systems (e.g. Uber Eats), it commonly has the real-time requirements. To deal with the tradeoff between the complicated architecture and the high-demand timing performance, both GNN architecture and hardware accelerator need to be optimized. Network-on-Chip (NoC), derived for efficiently managing the high-volume of communications, naturally becomes one of the top candidates to accelerate GNNs. However, there is a missing link between the optimize of GNN architecture and the NoC design. In this work, we present an AutoML-based framework GN-NAS, aiming at searching for the optimum GNN architecture, which can be suitable for the NoC accelerator. We devise a robust reinforcement learning based controller to validate the retained best GNN architectures, coupled with a parameter sharing approach, namely ParamShare, to improve search efficiency. Experimental results on four graph-based benchmark datasets, Cora, Citeseer, Pubmed and Protein-Protein Interaction show that the GNN architectures obtained by our framework outperform that of the state-of-the-art and baseline models, whilst reducing model size which makes them easy to deploy onto the NoC platform.
- Published
- 2021
121. Session details: Session 1A: VLSI for Machine Learning and Artificial Intelligence I
- Author
-
Caiwen Ding
- Subjects
Very-large-scale integration ,business.industry ,Computer science ,Artificial intelligence ,Session (computer science) ,business - Published
- 2021
122. HEIF: Highly Efficient Stochastic Computing-Based Inference Framework for Deep Neural Networks
- Author
-
Ruizhe Cai, Zhe Li, Xuehai Qian, Qinru Qiu, Yanzhi Wang, Ji Li, Jeffrey Draper, Jian Tang, Ao Ren, Caiwen Ding, and Bo Yuan
- Subjects
Adder ,Stochastic computing ,Computational complexity theory ,Computer science ,business.industry ,Deep learning ,Pipeline (computing) ,Activation function ,Rectifier (neural networks) ,Computer Graphics and Computer-Aided Design ,Convolutional neural network ,Reduction (complexity) ,Soft error ,Application-specific integrated circuit ,Computer engineering ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,Throughput (business) ,Software ,Efficient energy use - Abstract
Deep convolutional neural networks (DCNNs) are one of the most promising deep learning techniques and have been recognized as the dominant approach for almost all recognition and detection tasks. The computation of DCNNs is memory intensive due to large feature maps and neuron connections, and the performance highly depends on the capability of hardware resources. With the recent trend of wearable devices and Internet of Things, it becomes desirable to integrate the DCNNs onto embedded and portable devices that require low power and energy consumptions and small hardware footprints. Recently stochastic computing (SC)-DCNN demonstrated that SC as a low-cost substitute to binary-based computing radically simplifies the hardware implementation of arithmetic units and has the potential to satisfy the stringent power requirements in embedded devices. In SC, many arithmetic operations that are resource-consuming in binary designs can be implemented with very simple hardware logic, alleviating the extensive computational complexity. It offers a colossal design space for integration and optimization due to its reduced area and soft error resiliency. In this paper, we present HEIF, a highly efficient SC-based inference framework of the large-scale DCNNs, with broad applications including (but not limited to) LeNet-5 and AlexNet , that achieves high energy efficiency and low area/hardware cost. Compared to SC-DCNN, HEIF features: 1) the first (to the best of our knowledge) SC-based rectified linear unit activation function to catch up with the recent advances in software models and mitigate degradation in application-level accuracy; 2) the redesigned approximate parallel counter and optimized stochastic multiplication using transmission gates and inverse mirror adders; and 3) the new optimization of weight storage using clustering. Most importantly, to achieve maximum energy efficiency while maintaining acceptable accuracy, HEIF considers holistic optimizations on cascade connection of function blocks in DCNN, pipelining technique, and bit-stream length reduction. Experimental results show that in large-scale applications HEIF outperforms previous SC-DCNN by the throughput of $4.1\times $ , by area efficiency of up to $6.5\times $ , and achieves up to ${5.6\times }$ energy improvement.
- Published
- 2019
123. Normalization and dropout for stochastic computing-based deep convolutional neural networks
- Author
-
Ao Ren, Bo Yuan, Jeffrey Draper, Zhe Li, Caiwen Ding, Zihao Yuan, Qinru Qiu, Ji Li, Yanzhi Wang, and Shahin Nazarian
- Subjects
Normalization (statistics) ,Stochastic computing ,Computer science ,business.industry ,Deep learning ,Feature extraction ,Pooling ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,Network topology ,01 natural sciences ,Convolutional neural network ,020202 computer hardware & architecture ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,computer ,Software ,Wearable technology ,0105 earth and related environmental sciences - Abstract
Recently, Deep Convolutional Neural Network (DCNN) has been recognized as the most effective model for pattern recognition and classification tasks. With the fast growing Internet of Things (IoTs) and wearable devices, it becomes attractive to implement DCNNs in embedded and portable systems. However, novel computing paradigms are urgently required to deploy DCNNs that have huge power consumptions and complex topologies in systems with limited area and power supply. Recent works have demonstrated that Stochastic Computing (SC) can radically simplify the hardware implementation of arithmetic units and has the potential to bring the success of DCNNs to embedded systems. This paper introduces normalization and dropout, which are essential techniques for the state-of-the-art DCNNs, to the existing SC-based DCNN frameworks. In this work, the feature extraction block of DCNNs is implemented using an approximate parallel counter, a near-max pooling block and an SC-based rectified linear activation unit. A novel SC-based normalization design is proposed, which includes a square and summation unit, an activation unit and a division unit. The dropout technique is integrated into the training phase and the learned weights are adjusted during the hardware implementation. Experimental results on AlexNet with the ImageNet dataset show that the SC-based DCNN with the proposed normalization and dropout techniques achieves 3.26% top-1 accuracy improvement and 3.05% top-5 accuracy improvement compared with the SC-based DCNN without these two essential techniques, confirming the effectiveness of our normalization and dropout designs.
- Published
- 2019
124. Reconfigurable Photovoltaic Systems for Electric Vehicles
- Author
-
Xue Lin, Yanzhi Wang, Hongjia Li, Weiwei Zheng, and Caiwen Ding
- Subjects
Computer science ,020209 energy ,Photovoltaic system ,Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) ,02 engineering and technology ,021001 nanoscience & nanotechnology ,Automotive engineering ,Computer Science::Robotics ,Electricity generation ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,0210 nano-technology ,Software - Abstract
This article discusses the insights for design and implementation of onboard PV systems for electrical vehicles. It proposes novel and efficient approaches to maximize the performance of onboard PV systems. The proposed method has been extensively validated for industrial applications. — Xin Li, Carnegie Mellon University
- Published
- 2018
125. A Compression-Compilation Framework for On-mobile Real-time BERT Applications
- Author
-
Weiwen Jiang, Bin Ren, Yanzhi Wang, Sijia Liu, Caiwen Ding, Jiexiong Guan, Pu Zhao, Zhenglun Kong, Wei Niu, and Geng Yuan
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Speedup ,Computer science ,business.industry ,Computer Science - Artificial Intelligence ,Deep learning ,Latency (audio) ,Machine Learning (cs.LG) ,Resource (project management) ,Artificial Intelligence (cs.AI) ,Computer engineering ,Compression (functional analysis) ,Question answering ,Artificial intelligence ,business ,Mobile device ,Transformer (machine learning model) - Abstract
Transformer-based deep learning models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. In this paper, we propose a compression-compilation co-design framework that can guarantee the identified model to meet both resource and real-time specifications of mobile devices. Our framework applies a compiler-aware neural architecture optimization method (CANAO), which can generate the optimal compressed model that balances both accuracy and latency. We are able to achieve up to 7.8x speedup compared with TensorFlow-Lite with only minor accuracy loss. We present two types of BERT applications on mobile devices: Question Answering (QA) and Text Generation. Both can be executed in real-time with latency as low as 45ms. Videos for demonstrating the framework can be found on https://www.youtube.com/watch?v=_WIRvK_2PZI, arXiv admin note: substantial text overlap with arXiv:2009.06823
- Published
- 2021
126. Structured representation in deep neural network systems
- Author
-
Caiwen Ding
- Published
- 2021
127. Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning
- Author
-
Shaoyi Huang, Ang Li, Shusen Wang, Hang Liu, Tong Geng, Hongwu Peng, Caiwen Ding, and Weiwen Jiang
- Subjects
Speedup ,Computer science ,02 engineering and technology ,Parallel computing ,010501 environmental sciences ,01 natural sciences ,Matrix multiplication ,020202 computer hardware & architecture ,Gate array ,0202 electrical engineering, electronic engineering, information engineering ,Pruning (decision trees) ,Field-programmable gate array ,0105 earth and related environmental sciences ,Transformer (machine learning model) ,Sparse matrix ,Block (data storage) - Abstract
Although Transformer-based language representations achieve state-of-the-art accuracy on various natural language processing (NLP) tasks, the large model size has been challenging the resource constrained computing platforms. Weight pruning, as a popular and effective technique in reducing the number of weight parameters and accelerating the Transformer, has been investigated on GPUs. However, the Transformer acceleration using weight pruning on field-programmable gate array (FPGAs) remains unexplored. This paper investigates the column balanced block-wise pruning on Transformer and designs an FPGA acceleration engine to customize the balanced blockwise matrix multiplication. We implement the Transformer model with proper hardware scheduling, and the experiments show that the Transformer inference on FPGA achieves 10.35 ms latency with the batch size of 32, which is $10.96 \times$ speed up comparing to CPU platform and $2.08 \times$ speed up comparing to GPU platform.
- Published
- 2021
128. An End-to-end Multi-task Object Detection using Embedded GPU in Autonomous Driving
- Author
-
Mimi Xie, Fei Miao, Shanglin Zhou, Caiwen Ding, and Yu-Fang Jin
- Subjects
Computer science ,business.industry ,Reliability (computer networking) ,Deep learning ,Real-time computing ,Point cloud ,02 engineering and technology ,Solid modeling ,010501 environmental sciences ,01 natural sciences ,Object detection ,Task (computing) ,End-to-end principle ,0202 electrical engineering, electronic engineering, information engineering ,Fuse (electrical) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,0105 earth and related environmental sciences - Abstract
Autonomous driving has gained popularity due to its high reliability compared to human drivers. Autonomous vehicles combine variety of sensors to perceive their surroundings, and use deep learning (DL) to extract complicated information from the sensing data. However, there are several challenges: Many DL models have explosive model sizes, and therefore not only time consuming but also power consuming when implementing on embedded systems on vehicles, further degrading the battery life-cycle. The current on-board AI treats lane detection and car location separately. In this paper, we propose an end-to-end multi-task environment detection framework. We fuse the 3D point cloud object detection model and lane detection model, with model compression technique applied. As on-board sensors forward information to the multi-task network, it not only parallel two detection tasks to extract combination information, but also reduces entire running time of the DL model. Experiments show by adding the model compression technique, the running speed of multi-task model improves more than $2\times$. Also, running time of lane detection model on Nvidia Jetson TX2 is almost $6\times$ less comparing with running on CPU, which shows reasonableness of using embedded AI computing device on autonomous vehicle.
- Published
- 2021
129. Dancing along Battery: Enabling Transformer with Run-time Reconfigurability on Mobile Devices
- Author
-
Yiyu Shi, Bingbing Li, Caiwen Ding, Edwin H.-M. Sha, Panjie Qi, Yuhong Song, Weiwen Jiang, Sakyasingha Dasgupta, and Qingfeng Zhuge
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,business.industry ,Computer science ,Reconfigurability ,Machine Learning (cs.LG) ,Software ,Embedded system ,Reinforcement learning ,Pruning (decision trees) ,Frequency scaling ,business ,Mobile device ,Energy (signal processing) ,Transformer (machine learning model) - Abstract
A pruning-based AutoML framework for run-time reconfigurability, namely RT3, is proposed in this work. This enables Transformer-based large Natural Language Processing (NLP) models to be efficiently executed on resource-constrained mobile devices and reconfigured (i.e., switching models for dynamic hardware conditions) at run-time. Such reconfigurability is the key to save energy for battery-powered mobile devices, which widely use dynamic voltage and frequency scaling (DVFS) technique for hardware reconfiguration to prolong battery life. In this work, we creatively explore a hybrid block-structured pruning (BP) and pattern pruning (PP) for Transformer-based models and first attempt to combine hardware and software reconfiguration to maximally save energy for battery-powered mobile devices. Specifically, RT3 integrates two-level optimizations: First, it utilizes an efficient BP as the first-step compression for resource-constrained mobile devices; then, RT3 heuristically generates a shrunken search space based on the first level optimization and searches multiple pattern sets with diverse sparsity for PP via reinforcement learning to support lightweight software reconfiguration, which corresponds to available frequency levels of DVFS (i.e., hardware reconfiguration). At run-time, RT3 can switch the lightweight pattern sets within 45ms to guarantee the required real-time constraint at different frequency levels. Results further show that RT3 can prolong battery life over 4x improvement with less than 1% accuracy loss for Transformer and 1.5% score decrease for DistilBERT., 7 pages, 5 figures
- Published
- 2021
130. TinyADC: Peripheral Circuit-aware Weight Pruning Framework for Mixed-signal DNN Accelerators
- Author
-
Yanzhi Wang, Zhengang Li, Jieren Deng, Jinhui Wang, Caiwen Ding, Ali Shafiee, Zhiheng Liao, Yuxuan Cai, Geng Yuan, Payman Behnam, Mahdi Nazm Bojnordi, Xiaolong Ma, and Jingyan Fu
- Subjects
Artificial neural network ,business.industry ,Computer science ,Overhead (computing) ,Mixed-signal integrated circuit ,Pruning (decision trees) ,Crossbar switch ,business ,Throughput (business) ,Computer hardware ,Electronic circuit ,Power (physics) - Abstract
As the number of weight parameters in deep neural networks (DNNs) continues growing, the demand for ultra-efficient DNN accelerators has motivated research on non-traditional architectures with emerging technologies. Resistive Random-Access Memory (ReRAM) crossbar has been utilized to perform insitu matrix-vector multiplication of DNNs. DNN weight pruning techniques have also been applied to ReRAM-based mixed-signal DNN accelerators, focusing on reducing weight storage and accelerating computation. However, the existing works capture very few peripheral circuits features such as Analog to Digital converters (ADCs) during the neural network design. Unfortunately, ADCs have become the main part of power consumption and area cost of current mixed-signal accelerators, and the large overhead of these peripheral circuits is not solved efficiently. To address this problem, we propose a novel weight pruning framework for ReRAM-based mixed-signal DNN accelerators, named TINYADC, which effectively reduces the required bits for ADC resolution and hence the overall area and power consumption of the accelerator without introducing any computational inaccuracy. Compared to state-of-the-art pruning work on the ImageNet dataset, TINYADC achieves 3.5× and 2.9× power and area reduction, respectively. TINYADC framework optimizes the throughput of state-of-the-art architecture design by 29% and 40% in terms of the throughput per unit of millimeter square and watt (GOPs/s×mm2and GOPs/w), respectively.
- Published
- 2021
131. Tracking and Understanding Nanocatalyst Sintering and Regeneration using Deep Learning-assisted In Situ Environmental TEM
- Author
-
Yuanyuan Zhu, Caiwen Ding, Steven L. Suib, and Rajat Sainju
- Subjects
In situ ,Materials science ,business.industry ,Regeneration (biology) ,Deep learning ,Sintering ,Nanotechnology ,Artificial intelligence ,Tracking (particle physics) ,business ,Instrumentation - Published
- 2021
132. A DNN Compression Framework for SOT-MRAM-based Processing-In-Memory Engine
- Author
-
Jieren Deng, Xiaolong Ma, Sheng Lin, Geng Yuan, Zhengang Li, and Caiwen Ding
- Subjects
Magnetoresistive random-access memory ,Hardware_MEMORYSTRUCTURES ,Computer engineering ,CMOS ,Computer science ,Data compression ratio ,Frame rate ,Standby power ,Quantization (image processing) ,Power (physics) ,Efficient energy use - Abstract
The computing wall and data movement challenges of deep neural networks (DNNs) have exposed the limitations of conventional CMOS-based DNN accelerators. Furthermore, the deep structure and large model size will make DNNs prohibitive to embedded systems and IoT devices, where low power consumption is required. To address these challenges, spin-orbit torque magnetic random-access memory (SOT-MRAM) and SOT-MRAM based Processing-In-Memory (PIM) engines have been used to reduce the power consumption of DNNs since SOT-MRAM has the characteristic of near-zero standby power, high density, non-volatile. However, the drawbacks of SOT-MRAM based PIM engines such as high writing latency and requiring low bit-width data decrease its popularity as a favorable energy-efficient DNN accelerator. To mitigate these drawbacks, we propose an ultra-energy-efficient framework by using model compression techniques including weight pruning and quantization from the software level considering the architecture of SOT-MRAM PIM. And we incorporate the alternating direction method of multipliers (ADMM) into the training phase to further guarantee the solution feasibility and satisfy SOT-MRAM hardware constraints. Thus, the footprint and power consumption of SOT-MRAM PIM can be reduced, while increasing the overall system performance rate (frame per second) in the meantime, making our proposed ADMM-based SOT-MRAM PIM more energy efficient and suitable for embedded systems or IoT devices. Our experimental results show the accuracy and compression rate of our proposed framework is consistently outperforming the reference works, while the efficiency (area & power) and performance rate of SOT-MRAM PIM engine is significantly improved.
- Published
- 2020
133. FTRANS
- Author
-
Santosh Pandey, Lipeng Wan, Caiwen Ding, Hang Liu, Bingbing Li, Ji Li, Yanjun Lyv, Jieyang Chen, Mimi Xie, and Haowen Fang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,business.industry ,Deep learning ,Computation ,020208 electrical & electronic engineering ,02 engineering and technology ,Machine Learning (cs.LG) ,020202 computer hardware & architecture ,law.invention ,C.1.4 ,Recurrent neural network ,Fpga design ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computer engineering ,Gate array ,law ,0202 electrical engineering, electronic engineering, information engineering ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Artificial intelligence ,Field-programmable gate array ,Transformer ,business ,Efficient energy use - Abstract
In natural language processing (NLP), the "Transformer" architecture was proposed as the first transduction model replying entirely on self-attention mechanisms without using sequence-aligned recurrent neural networks (RNNs) or convolution, and it achieved significant improvements for sequence to sequence tasks. The introduced intensive computation and storage of these pre-trained language representations has impeded their popularity into computation and memory-constrained devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its high parallelism and low latency. However, the trained models are still too large to accommodate to an FPGA fabric. In this paper, we propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation, and an acceleration design at the architecture level. Experimental results show that our proposed framework significantly reduces the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07x and 81x improvement in performance and energy efficiency compared to CPU, and up to 8.80x improvement in energy efficiency compared to GPU., Comment: 7 pages, 7 figures
- Published
- 2020
134. FTDL: A Tailored FPGA-Overlay for Deep Learning with High Scalability
- Author
-
Hang Liu, Runbin Shi, Caiwen Ding, Xuechao Wei, Yuhao Ding, He Li, and Hayden K.-H. So
- Subjects
010302 applied physics ,Artificial neural network ,business.industry ,Computer science ,Deep learning ,02 engineering and technology ,Overlay ,01 natural sciences ,020202 computer hardware & architecture ,Convolution ,Computer architecture ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,business ,Field-programmable gate array ,Electrical efficiency - Abstract
Fast inference is of paramount value to a wide range of deep learning applications. This work presents FTDL, a highly-scalable FPGA overlay framework for deep learning applications, to address the architecture and hardware mismatch faced by traditional efforts. The FTDL overlay is specifically optimized for the tiled structure of FPGAs, thereby achieving post-place-and-route operating frequencies exceeding 88 % of the theoretical maximum across different devices and design scales. A flexible compilation framework efficiently schedules matrix multiply and convolution operations of large neural network inference on the overlay and achieved over 80 % hardware efficiency on average. Taking advantage of both high operating frequency and hardware efficiency, FTDL achieves 402.6 and 151.2 FPS with GoogLeNet and ResNet50 on ImageNet, respectively, while operating at a power efficiency of 27.6 GOPS/W, making it up to 7.7 × higher performance and 1.9× more power-efficient than the state-of-the-art.
- Published
- 2020
135. Towards an Efficient and General Framework of Robust Training for Graph Neural Networks
- Author
-
Bhavya Kailkhura, Sijia Liu, Kaidi Xu, Caiwen Ding, Mengshu Sun, Xue Lin, and Pin-Yu Chen
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Graph neural networks ,Computer science ,Inference ,Machine Learning (stat.ML) ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Training (civil) ,Machine Learning (cs.LG) ,Statistics - Machine Learning ,Robustness (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Greedy algorithm ,0105 earth and related environmental sciences ,business.industry ,020206 networking & telecommunications ,Graph ,Scalability ,Graph (abstract data type) ,Artificial intelligence ,business ,computer - Abstract
Graph Neural Networks (GNNs) have made significant advances on several fundamental inference tasks. As a result, there is a surge of interest in using these models for making potentially important decisions in high-regret applications. However, despite GNNs' impressive performance, it has been observed that carefully crafted perturbations on graph structures (or nodes attributes) lead them to make wrong predictions. Presence of these adversarial examples raises serious security concerns. Most of the existing robust GNN design/training methods are only applicable to white-box settings where model parameters are known and gradient based methods can be used by performing convex relaxation of the discrete graph domain. More importantly, these methods are not efficient and scalable which make them infeasible in time sensitive tasks and massive graph datasets. To overcome these limitations, we propose a general framework which leverages the greedy search algorithms and zeroth-order methods to obtain robust GNNs in a generic and an efficient manner. On several applications, we show that the proposed techniques are significantly less computationally expensive and, in some cases, more robust than the state-of-the-art methods making them suitable to large-scale problems which were out of the reach of traditional robust training methods., Comment: Accepted by ICASSP 2020
- Published
- 2020
136. A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration Framework
- Author
-
Zhengang Li, Wei Niu, Bin Ren, Yanzhi Wang, Zheng Zhan, Wenhao Wang, Yifan Gong, Xiaolong Ma, Xiaolin Xu, Caiwen Ding, and Xue Lin
- Subjects
FOS: Computer and information sciences ,Information privacy ,Computer Science - Machine Learning ,Speedup ,Edge device ,Computer science ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Inference ,Machine Learning (stat.ML) ,010501 environmental sciences ,computer.software_genre ,01 natural sciences ,Synthetic data ,Machine Learning (cs.LG) ,03 medical and health sciences ,0302 clinical medicine ,Statistics - Machine Learning ,Pruning (decision trees) ,Neural and Evolutionary Computing (cs.NE) ,0105 earth and related environmental sciences ,Process (computing) ,Computer Science - Neural and Evolutionary Computing ,Artificial Intelligence (cs.AI) ,030220 oncology & carcinogenesis ,Compiler ,Data mining ,computer - Abstract
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices. However, previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data. To mitigate this concern, we propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset. At the algorithm level of the proposed framework, a systematic weight pruning technique based on the alternating direction method of multipliers (ADMM) is designed to iteratively solve the pattern-based pruning problem for each layer with randomly generated synthetic data. In addition, corresponding optimizations at the compiler level are leveraged for inference accelerations on devices. With the proposed framework, users could avoid the time-consuming pruning process for non-experts and directly benefit from compressed models. Experimental results show that the proposed framework outperforms three state-of-art end-to-end DNN frameworks, i.e., TensorFlow-Lite, TVM, and MNN, with speedup up to 4.2×, 2.5×, and 2.0×, respectively, with almost no accuracy loss, while preserving data privacy.
- Published
- 2020
137. An Efficient Deep Reinforcement Learning Framework for UAVs
- Author
-
Lu Lu, Bingbing Li, Caiwu Ding, Caiwen Ding, and Shanglin Zhou
- Subjects
Computer science ,05 social sciences ,Real-time computing ,Control (management) ,010501 environmental sciences ,01 natural sciences ,Attitude control ,Sight ,Dimension (vector space) ,Control theory ,0502 economics and business ,Robot ,Reinforcement learning ,Point (geometry) ,050207 economics ,0105 earth and related environmental sciences - Abstract
3D Dynamic simulator such as Gazebo has become a popular substitution for unmanned aerial vehicle (UAV) because of its user-friendly in real-world scenarios. At this point, well-functioning algorithms on the UAV controller are needed for guidance, navigation, and control for autonomous navigation. Deep reinforcement learning (DRL) comes into sight as its famous self-learning characteristic. This goal-orientated algorithm can learn how to attain a complex objective or maximize along a particular dimension over many steps. In this paper, we propose a general framework to incorporate DRL with the UAV simulation environment. The whole system consists of the DRL algorithm for attitude control, packing algorithm on the Robot Operation System (ROS) to connect DRL with PX4 controller, and a Gazebo simulator that emulates the real-world environment. Experimental results demonstrate the effectiveness of the proposed framework.
- Published
- 2020
138. Session details: Session: High-Level Abstractions and Tools I
- Author
-
Caiwen Ding
- Subjects
Computer science ,Programming language ,Session (computer science) ,Field-programmable gate array ,computer.software_genre ,computer - Published
- 2020
139. FTDL: An FPGA-tailored Architecture for Deep Learning Systems
- Author
-
Hang Liu, Caiwen Ding, Hayden K.-H. So, Yuhao Ding, Xuechao Wei, and Runbin Shi
- Subjects
Flexibility (engineering) ,Computer science ,business.industry ,computer.software_genre ,Computer architecture ,Scalability ,Benchmark (computing) ,Hardware acceleration ,Compiler ,Field-programmable gate array ,business ,computer ,Digital signal processing ,Efficient energy use - Abstract
Hardware acceleration of deep learning (DL) systems has been increasingly studied to achieve desirable performance and energy efficiency. The FPGA strikes a balance between high energy efficiency and fast development cycle and therefore is widely used as a DNN accelerator. However, there exists an architecture-layout mismatch in the current designs, which introduces scalability and flexibility issues, leading to irregular routing and resource imbalance problems. To address these limitations, in this work, we propose FTDL, an FPGA-tailored architecture with a parameterized and hierarchical hardware that is adaptive to different FPGA devices. FTDL has the following novelties: (i) At the architecture level, FTDL consists of Tiled Processing Elements (TPE) and super blocks, to achieve a near-to-theoretical digital signal processing (DSP) operating-frequency of 650 MHz. More importantly, FTDL is configurable and delivers good scalability, i.e., the timing is stabilized even when the design is scaled-up to 100% resource utilization for different deep learning systems. (ii) In workload compilation, FTDL provides a compiler that manages to map the DL workloads to the architecture level in an optimal manner. Experimental results show that for most benchmark layers in MLPerf, FTDL achieves an over 80% hardware efficiency.
- Published
- 2020
140. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning
- Author
-
Hang Liu, Ji Li, Tianyun Zhang, Caiwen Ding, Bingbing Li, Zhengang Li, and Zhenglun Kong
- Subjects
FOS: Computer and information sciences ,010302 applied physics ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,business.industry ,Computer science ,Data compression ratio ,010501 environmental sciences ,01 natural sciences ,Machine Learning (cs.LG) ,Artificial Intelligence (cs.AI) ,0103 physical sciences ,Language model ,business ,Computation and Language (cs.CL) ,Computer hardware ,0105 earth and related environmental sciences ,Transformer (machine learning model) - Abstract
Pre-trained large-scale language models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. However, the limited weight storage and computational speed on hardware platforms have impeded the popularity of pre-trained models, especially in the era of edge computing. In this work, we propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning. We incorporate the reweighted group Lasso into block-structured pruning for optimization. Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates. Experimental results on different models (BERT, RoBERTa, and DistilBERT) on the General Language Understanding Evaluation (GLUE) benchmark tasks show that we achieve up to 5.0x with zero or minor accuracy degradation on certain task(s). Our proposed method is also orthogonal to existing compact pre-trained language models such as DistilBERT using knowledge distillation, since a further 1.79x average compression rate can be achieved on top of DistilBERT with zero or minor accuracy degradation. It is suitable to deploy the final compressed model on resource-constrained edge devices., Comment: Accepted to Findings of EMNLP 2020
- Published
- 2020
141. A Multi-Agent Reinforcement Learning Approach For Safe and Efficient Behavior Planning Of Connected Autonomous Vehicles
- Author
-
Songyang Han, Shanglin Zhou, Jiangwei Wang, Lynn Pepin, Caiwen Ding, Jie Fu, and Fei Miao
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,FOS: Electrical engineering, electronic engineering, information engineering ,Systems and Control (eess.SY) ,Electrical Engineering and Systems Science - Systems and Control ,Machine Learning (cs.LG) - Abstract
The recent advancements in wireless technology enable connected autonomous vehicles (CAVs) to gather information about their environment by vehicle-to-vehicle (V2V) communication. In this work, we design an information-sharing-based multi-agent reinforcement learning (MARL) framework for CAVs, to take advantage of the extra information when making decisions to improve traffic efficiency and safety. The safe actor-critic algorithm we propose has two new techniques: the truncated Q-function and safe action mapping. The truncated Q-function utilizes the shared information from neighboring CAVs such that the joint state and action spaces of the Q-function do not grow in our algorithm for a large-scale CAV system. We prove the bound of the approximation error between the truncated-Q and global Q-functions. The safe action mapping provides a provable safety guarantee for both the training and execution based on control barrier functions. Using the CARLA simulator for experiments, we show that our approach can improve the CAV system's efficiency in terms of average velocity and comfort under different CAV ratios and different traffic densities. We also show that our approach avoids the execution of unsafe actions and always maintains a safe distance from other vehicles. We construct an obstacle-at-corner scenario to show that the shared vision can help CAVs to observe obstacles earlier and take action to avoid traffic jams., Comment: This paper is submitted to IEEE Transactions on Intelligent Transportation Systems
- Published
- 2020
- Full Text
- View/download PDF
142. Graph-Based Shape Analysis for Heterogeneous Geometric Datasets: Similarity, Retrieval and Substructure Matching
- Author
-
Jiangce Chen, Horea T. Ilieş, and Caiwen Ding
- Subjects
Power graph analysis ,Similarity (geometry) ,Theoretical computer science ,Matching (graph theory) ,Computer science ,Graph (abstract data type) ,Translation (geometry) ,Computer Graphics and Computer-Aided Design ,Convolutional neural network ,Industrial and Manufacturing Engineering ,Computer Science Applications ,Geometric data analysis ,Shape analysis (digital geometry) - Abstract
Practically all existing shape analysis and processing algorithms have been developed for specific geometric representations of 3D models. However, the product development process always involves a large number of often incompatible geometric representations tailored to specific computational tasks that take place during this process. Consequently, a substantial effort has been expended to develop robust geometric data translation and conversion algorithms, but the existing methods have well known limitations. The Maximal Disjoint Ball Decomposition (MDBD) was recently defined as a unique and stable geometric construction and used to define universal shape descriptors based on the contact graph associated with MDBD. In this paper, we demonstrate that by applying graph analysis tools to MDBD in conjunction with graph convolutional neural networks and graph kernels, one can effectively develop methods to perform similarity, retrieval and substructure matching from geometric models regardless of their native geometric representation. We show that our representation-agnostic approach achieves comparable performance with state-of-the-art geometric processing methods on standard yet heterogeneous benchmark datasets while supporting all valid geometric representations.
- Published
- 2022
143. Dynamic Reconfiguration of Thermoelectric Generators for Vehicle Radiators Energy Harvesting Under Location-Dependent Temperature Variations
- Author
-
Donkyu Baek, Sheng Lin, Jaemin Kim, Xue Lin, Naehyuck Chang, Sang Hyun Park, Donghwa Shin, Yanzhi Wang, Young Hoo Cho, and Caiwen Ding
- Subjects
Maximum power principle ,Computer science ,020209 energy ,Heat energy ,Control reconfiguration ,02 engineering and technology ,Thermal management of electronic devices and systems ,Automotive engineering ,Coolant ,Electric energy ,Thermoelectric generator ,Internal combustion engine ,Hardware and Architecture ,Waste heat ,Heat exchanger ,0202 electrical engineering, electronic engineering, information engineering ,Fuel efficiency ,Radiator (engine cooling) ,Electrical and Electronic Engineering ,Energy harvesting ,Software ,Power density ,Voltage ,Heat engine - Abstract
Conventional internal combustion engine vehicles generally have less than a 30% of fuel efficiency, and the most wasted energy is dissipated in the form of heat energy. The excessive heat dissipation is a primary reason of poor fuel efficiency, but reclamation of the heat energy has not been a main focus of vehicle design. Thanks to thermoelectric generators (TEGs), wasted heat energy can be directly converted to electric energy. All the heat exchangers, including vehicle radiators, gradually cool down the coolant or gas from the inlet to outlet. TEG modules are commonly mounted throughout the heat exchanger to fulfill the required power density and voltage. Each TEG module has a different hot-side temperature by the mounting location (distance from the inlet) and thus different maximum power point (MPP) voltage and current. Nevertheless, TEG modules are commonly connected in series and parallel, where both the ends are connected to a single power converter. As a result, the whole TEG module array exhibits a significant efficiency degradation even if the power converter has the MPP tracking capability. Although material and device researchers have been putting a lot of effort in enhancing TEG efficiency, such system-level issue has not been deeply investigated. This paper proposes a cross-layer, system-level solution to enhance TEG array efficiency introducing online reconfiguration of TEG modules. The proposed method is useful for any sort of TEG arrays to reclaim wasted heat energy, because heat exchangers generally have different inlet and outlet temperature values. This paper also introduces a complete design and implementation showcase of a reconfigurable TEG module building block. Experimental results show up to a 34% enhancement using the proposed method compared with a fixed array structure, which is a common practice.
- Published
- 2018
144. Multisource Indoor Energy Harvesting for Nonvolatile Processors
- Author
-
Soroush Heidari, Ji Li, Yongpan Liu, Yanzhi Wang, Caiwen Ding, Ning Liu, and Jingtong Hu
- Subjects
Engineering ,Hardware_MEMORYSTRUCTURES ,business.industry ,020208 electrical & electronic engineering ,Photovoltaic system ,02 engineering and technology ,020202 computer hardware & architecture ,Power (physics) ,Non-volatile memory ,Memory management ,Electricity generation ,Hardware and Architecture ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,business ,Energy harvesting ,Software - Abstract
Editor’s note: One promising application of emerging memories is to implement a nonvolatile memory hierarchy that can retain the data when power is removed. In this work, the authors present some design techniques of nonvolatile processors with a multisource energy-harvesting system that combines thermal, kinetic, and indoor photovoltaic sources to provide a stable power supply. —Yiran Chen, Duke University
- Published
- 2017
145. REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs
- Author
-
Ning Liu, Yanzhi Wang, Caiwen Ding, Shuo Wang, Yun Liang, and Kaidi Xu
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Optimization problem ,Exploit ,business.industry ,Computer science ,Quantization (signal processing) ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Memory bandwidth ,Object detection ,Machine Learning (cs.LG) ,Computer engineering ,Hardware Architecture (cs.AR) ,business ,Field-programmable gate array ,Computer Science - Hardware Architecture ,Implementation ,Digital signal processing - Abstract
Deep neural networks (DNNs), as the basis of object detection, will play a key role in the development of future autonomous systems with full autonomy. The autonomous systems have special requirements of real-time, energy-e cient implementations of DNNs on a power-budgeted system. Two research thrusts are dedicated to per- formance and energy e ciency enhancement of the inference phase of DNNs. The first one is model compression techniques while the second is e cient hardware implementations. Recent researches on extremely-low-bit CNNs such as binary neural network (BNN) and XNOR-Net replace the traditional oating point operations with bi- nary bit operations, signi cantly reducing memory bandwidth and storage requirement, whereas suffering non-negligible accuracy loss and waste of digital signal processing (DSP) blocks on FPGAs. To overcome these limitations, this paper proposes REQ-YOLO, a resource aware, systematic weight quantization framework for object detection, considering both algorithm and hardware resource aspects in object detection. We adopt the block-circulant matrix method and propose a heterogeneous weight quantization using Alternative Direction Method of Multipliers (ADMM), an e ective optimization technique for general, non-convex optimization problems. To achieve real-time, highly efficient implementations on FPGA, we present the detailed hardware implementation of block circulant matrices on CONV layers and de- velop an e cient processing element (PE) structure supporting the heterogeneous weight quantization, CONV data ow and pipelining techniques, design optimization, and a template-based automatic synthesis framework to optimally exploit hardware resource. Experimental results show that our proposed REQ-YOLO framework can signi cantly compress the YOLO model while introducing very small accuracy degradation. The related codes are here: https://github.com/Anonymous788/heterogeneous_ADMM_YOLO.
- Published
- 2019
146. Tiny but Accurate: A Pruned, Quantized and Optimized Memristor Crossbar Framework for Ultra Efficient DNN Implementation
- Author
-
Xiaolong Ma, Fuxun Yu, Xiang Chen, Yanzhi Wang, Tao Liu, Geng Yuan, Sheng Lin, Caiwen Ding, and Wujie Wen
- Subjects
Signal Processing (eess.SP) ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,Data path ,Computer Science - Emerging Technologies ,02 engineering and technology ,Memristor ,010501 environmental sciences ,Crossbar array ,computer.software_genre ,01 natural sciences ,law.invention ,Machine Learning (cs.LG) ,law ,Hardware Architecture (cs.AR) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,Neural and Evolutionary Computing (cs.NE) ,Electrical Engineering and Systems Science - Signal Processing ,Computer Science - Hardware Architecture ,0105 earth and related environmental sciences ,Numerical linear algebra ,Hardware_MEMORYSTRUCTURES ,Quantization (signal processing) ,Computer Science - Neural and Evolutionary Computing ,Data compression ratio ,020202 computer hardware & architecture ,Emerging Technologies (cs.ET) ,Crossbar switch ,Algorithm ,computer - Abstract
The state-of-art DNN structures involve intensive computation and high memory storage. To mitigate the challenges, the memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications. However, the high accuracy solution for extreme model compression on memristor crossbar array architecture is still waiting for unraveling. In this paper, we propose a memristor-based DNN framework which combines both structured weight pruning and quantization by incorporating alternating direction method of multipliers (ADMM) algorithm for better pruning and quantization performance. We also discover the non-optimality of the ADMM solution in weight pruning and the unused data path in a structured pruned model. Motivated by these discoveries, we design a software-hardware co-optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms targeting on post-processing a structured pruned model after ADMM steps. By taking memristor hardware constraints into our whole framework, we achieve extreme high compression ratio on the state-of-art neural network structures with minimum accuracy loss. For quantizing structured pruned model, our framework achieves nearly no accuracy loss after quantizing weights to 8-bit memristor weight representation. We share our models at anonymous link https://bit.ly/2VnMUy0.
- Published
- 2019
147. An Ultra-Efficient Memristor-Based DNN Framework with Structured Weight Pruning and Quantization Using ADMM
- Author
-
Tianyun Zhang, Yanzhi Wang, Zeinab S. Jalali, Caiwen Ding, Yilong Zhao, Li Jiang, Sucheta Soundarajan, Sheng Lin, Xiaolong Ma, and Geng Yuan
- Subjects
FOS: Computer and information sciences ,Weight value ,Computer Science - Machine Learning ,Memory hierarchy ,Computer science ,Quantization (signal processing) ,Computation ,020208 electrical & electronic engineering ,Computer Science - Emerging Technologies ,Computer Science - Neural and Evolutionary Computing ,02 engineering and technology ,Memristor ,Machine Learning (cs.LG) ,020202 computer hardware & architecture ,law.invention ,Emerging Technologies (cs.ET) ,Neuromorphic engineering ,Computer engineering ,law ,Hardware Architecture (cs.AR) ,Compression ratio ,0202 electrical engineering, electronic engineering, information engineering ,Neural and Evolutionary Computing (cs.NE) ,Crossbar switch ,Computer Science - Hardware Architecture - Abstract
The high computation and memory storage of large deep neural networks (DNNs) models pose intensive challenges to the conventional Von-Neumann architecture, incurring sub-stantial data movements in the memory hierarchy. The memristor crossbar array has emerged as a promising solution to mitigate the challenges and enable low-power acceleration of DNNs. Memristor-based weight pruning and weight quantization have been seperately investigated and proven effectiveness in reducing area and power consumption compared to the original DNN model. However, there has been no systematic investigation of memristor-based neuromorphic computing (NC) systems considering both weight pruning and weight quantization. In this paper, we propose an unified and systematic memristor-based framework considering both structured weight pruning and weight quantization by incorporating alternating direction method of multipliers (ADMM) into DNNs training. We consider hardware constraints such as crossbar blocks pruning, conductance range, and mismatch between weight value and real devices, to achieve high accuracy and low power and small area footprint. Our framework is mainly integrated by three steps, i.e., memristor-based ADMM regularized optimization, masked mapping and retraining. Experimental results show that our proposed framework achieves 29.81× (20.88×) weight compression ratio, with 98.38% (96.96%) and 98.29% (97.47%) power and area reduction on VGG-16 (ResNet-18) network where only have 0.5% (0.76%) accuracy loss, compared to the original DNN models. We share our models at anonymous link http://bit.ly/2Jp5LHJ.
- Published
- 2019
148. E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs
- Author
-
Qinru Qiu, Yanzhi Wang, Zhe Li, Caiwen Ding, Wenyao Xu, Wujie Wen, Youwei Zhuo, Xuehai Qian, Siyue Wang, Chang Liu, and Xue Lin
- Subjects
Signal Processing (eess.SP) ,FOS: Computer and information sciences ,010302 applied physics ,Computer Science - Machine Learning ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Activation function ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,01 natural sciences ,Machine Learning (cs.LG) ,020202 computer hardware & architecture ,Recurrent neural network ,Computer engineering ,0103 physical sciences ,Compression ratio ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Sensitivity (control systems) ,Electrical Engineering and Systems Science - Signal Processing ,Field-programmable gate array ,Quantization (image processing) ,Block size ,Efficient energy use - Abstract
Recurrent Neural Networks (RNNs) are becoming increasingly important for time series-related applications which require efficient and real-time implementations. The two major types are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. It is a challenging task to have real-time, efficient, and accurate hardware RNN implementations because of the high sensitivity to imprecision accumulation and the requirement of special activation function implementations. A key limitation of the prior works is the lack of a systematic design optimization framework of RNN model and hardware implementations, especially when the block size (or compression ratio) should be jointly optimized with RNN type, layer size, etc. In this paper, we adopt the block-circulant matrix-based framework, and present the Efficient RNN (E-RNN) framework for FPGA implementations of the Automatic Speech Recognition (ASR) application. The overall goal is to improve performance/energy efficiency under accuracy requirement. We use the alternating direction method of multipliers (ADMM) technique for more accurate block-circulant training, and present two design explorations providing guidance on block size and reducing RNN training trials. Based on the two observations, we decompose E-RNN in two phases: Phase I on determining RNN model to reduce computation and storage subject to accuracy requirement, and Phase II on hardware implementations given RNN model, including processing element design/optimization, quantization, activation implementation, etc. Experimental results on actual FPGA deployments show that E-RNN achieves a maximum energy efficiency improvement of 37.4$\times$ compared with ESE, and more than 2$\times$ compared with C-LSTM, under the same accuracy., Comment: In The 25th International Symposium on High-Performance Computer Architecture (HPCA 2019)
- Published
- 2019
149. Deep Compressed Pneumonia Detection for Low-Power Embedded Devices
- Author
-
Ning Liu, Sheng Lin, Caiwen Ding, Hongjia Li, and Yanzhi Wang
- Subjects
Resource (project management) ,Computer science ,business.industry ,Filter (video) ,Computation ,Embedded system ,Compression ratio ,Pruning (decision trees) ,business ,Column (database) ,Power usage ,Power (physics) - Abstract
Deep neural networks (DNNs) have been expanded into medical fields and triggered the revolution of some medical applications by extracting complex features and achieving high accuracy and performance, etc. On the contrast, the large-scale network brings high requirements of both memory storage and computation resource, especially for portable medical devices and other embedded systems. In this work, we first train a DNN for pneumonia detection using the dataset provided by RSNA Pneumonia Detection Challenge [4]. To overcome hardware limitation for implementing large-scale networks, we develop a systematic structured weight pruning method with filter sparsity, column sparsity and combined sparsity. Experiments show that we can achieve up to 36x compression ratio compared to the original model with 106 layers, while maintaining no accuracy degradation. We evaluate the proposed methods on an embedded low-power device, Jetson TX2, and achieve low power usage and high energy efficiency.
- Published
- 2019
150. Towards Budget-Driven Hardware Optimization for Deep Convolutional Neural Networks Using Stochastic Computing
- Author
-
Yanzhi Wang, Qinru Qiu, Bo Yuan, Jeffrey Draper, Zhe Li, Ao Ren, Ji Li, and Caiwen Ding
- Subjects
FOS: Computer and information sciences ,010302 applied physics ,Speedup ,Stochastic computing ,business.industry ,Computer science ,Deep learning ,Computer Science - Neural and Evolutionary Computing ,Computer Science - Emerging Technologies ,02 engineering and technology ,External Data Representation ,01 natural sciences ,Convolutional neural network ,020202 computer hardware & architecture ,Emerging Technologies (cs.ET) ,Computer engineering ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Neural and Evolutionary Computing (cs.NE) ,Artificial intelligence ,business ,Field-programmable gate array ,Mobile device - Abstract
Recently, Deep Convolutional Neural Network (DCNN) has achieved tremendous success in many machine learning applications. Nevertheless, the deep structure has brought significant increases in computation complexity. Largescale deep learning systems mainly operate in high-performance server clusters, thus restricting the application extensions to personal or mobile devices. Previous works on GPU and/or FPGA acceleration for DCNNs show increasing speedup, but ignore other constraints, such as area, power, and energy. Stochastic Computing (SC), as a unique data representation and processing technique, has the potential to enable the design of fully parallel and scalable hardware implementations of large-scale deep learning systems. This paper proposed an automatic design allocation algorithm driven by budget requirement considering overall accuracy performance. This systematic method enables the automatic design of a DCNN where all design parameters are jointly optimized. Experimental results demonstrate that proposed algorithm can achieve a joint optimization of all design parameters given the comprehensive budget of a DCNN., Accepted by IEEE Computer Society Annual Symposium on VLSI 2018
- Published
- 2018
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.