57 results on '"Huazhong Yang"'
Search Results
2. Hidden-ROM
- Author
-
Yiming Chen, Guodong Yin, Mingyen Lee, Wenjun Tang, Zekun Yang, Yongpan Liu, Huazhong Yang, and Xueqing Li
- Published
- 2022
- Full Text
- View/download PDF
3. YOLoC
- Author
-
Yiming Chen, Guodong Yin, Zhanhong Tan, Mingyen Lee, Zekun Yang, Yongpan Liu, Huazhong Yang, Kaisheng Ma, and Xueqing Li
- Subjects
FOS: Computer and information sciences ,Hardware_MEMORYSTRUCTURES ,Hardware Architecture (cs.AR) ,Computer Science - Hardware Architecture - Abstract
Computing-in-memory (CiM) is a promising technique to achieve high energy efficiency in data-intensive matrix-vector multiplication (MVM) by relieving the memory bottleneck. Unfortunately, due to the limited SRAM capacity, existing SRAM-based CiM needs to reload the weights from DRAM in large-scale networks. This undesired fact weakens the energy efficiency significantly. This work, for the first time, proposes the concept, design, and optimization of computing-in-ROM to achieve much higher on-chip memory capacity, and thus less DRAM access and lower energy consumption. Furthermore, to support different computing scenarios with varying weights, a weight fine-tune technique, namely Residual Branch (ReBranch), is also proposed. ReBranch combines ROM-CiM and assisting SRAM-CiM to ahieve high versatility. YOLoC, a ReBranch-assisted ROM-CiM framework for object detection is presented and evaluated. With the same area in 28nm CMOS, YOLoC for several datasets has shown significant energy efficiency improvement by 14.8x for YOLO (Darknet-19) and 4.8x for ResNet-18, with, 6 pages, 14 figures. to be published in DAC 2022
- Published
- 2022
- Full Text
- View/download PDF
4. DIMMining
- Author
-
Guohao Dai, Zhenhua Zhu, Tianyu Fu, Chiyue Wei, Bangyan Wang, Xiangyu Li, Yuan Xie, Huazhong Yang, and Yu Wang
- Published
- 2022
- Full Text
- View/download PDF
5. Capacitive Content-Addressable Memory
- Author
-
Yiming Chen, Huazhong Yang, Guodong Yin, Sumitha George, Nuo Xiu, Xiaoyang Ma, and Xueqing Li
- Subjects
010302 applied physics ,Matching (statistics) ,business.industry ,Computer science ,Reliability (computer networking) ,Capacitive sensing ,02 engineering and technology ,Content-addressable memory ,01 natural sciences ,020202 computer hardware & architecture ,Component (UML) ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Pattern matching ,business ,Computer hardware ,Efficient energy use - Abstract
Content-addressable memory (CAM) has been a critical component in pattern matching and also machine-learning applications. Recently emerged CAM that is capable of delivering multi-level distance calculation is promising for applications that need matching results beyond Boolean results of ?matched" and ?not matched". However, existing multi-level CAM designs are constrained by the bit-cell device discharging current mismatch and the strict timing of sensing operations for distance calculation. This fact results in the challenge of further improving the accuracy and scalability towards higher-resolution and higher-dimension matching. This work presents a multi-level CAM design that is capable of delivering high-accuracy and high-scalability search, which is immune to the discharging device mismatch and needs no strict timing for result sensing. The inherent enabler is the charge-domain computing mechanism. This work will present the operating mechanisms, the circuit simulation, and content-matching evaluation results, showing the promise towards high reliability, high energy efficiency, and high scalability.
- Published
- 2021
- Full Text
- View/download PDF
6. 3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud
- Author
-
Hongren Zheng, Jun Liu, Huazhong Yang, Hanbo Sun, Yi Cai, Guohao Dai, Yusong Wu, Shulin Zeng, Fan Zhang, Xinhao Yang, and Yu Wang
- Subjects
Multi-core processor ,Heuristic (computer science) ,Hardware virtualization ,Computer science ,business.industry ,Cloud computing ,Virtualization ,computer.software_genre ,Task (computing) ,Embedded system ,Overhead (computing) ,Data synchronization ,business ,computer - Abstract
With the ever-growing demands for online Artificial Intelligence (AI), the hardware virtualization support for deep learning accelerators is vital for providing AI capability in the cloud. Three basic features, multi-task, dynamic workload, and remote access, are fundamental for hardware virtualization. However, most of the deep learning accelerators do not support concurrent execution of multiple tasks. Besides, the SOTA multi-DNN scheduling algorithm for NN accelerators neither consider the multi-task concurrent execution and resources allocation for the multi-core DNN accelerators. Moreover, existing GPU virtualized solutions could introduce a huge remote access latency overhead, resulting in a severe system performance drop. In order to tackle these challenges, we propose 3M-AI, a Multi-task and Multi-core virtualization framework for Multi-FPGA AI systems in the cloud. 3M-AI enables model parallelism on multi-FPGA by optimizing data synchronization and movement between FPGAs. 3M-AI exploits heuristic hardware resource allocation algorithm and accurate multi-core latency prediction model. 3M-AI significantly reduces the remote API access overhead to nearly 1%, and achieves better NN inference latency with a batch size 1 compared with GPU virtualization solutions.
- Published
- 2021
- Full Text
- View/download PDF
7. Block-Circulant Neural Network Accelerator Featuring Fine-Grained Frequency-Domain Quantization and Reconfigurable FFT Modules
- Author
-
Yifan He, Huazhong Yang, Yongpan Liu, and Jinshan Yue
- Subjects
010302 applied physics ,Artificial neural network ,Computer science ,Fast Fourier transform ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,Frequency domain ,Compression (functional analysis) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Quantization (image processing) ,Circulant matrix ,Efficient energy use ,Block (data storage) - Abstract
Block-circulant based compression is a popular technique to accelerate neural network inference. Though storage and computing costs can be reduced by transforming weights into block-circulant matrices, this method incurs uneven data distribution in the frequency domain and imbalanced workload. In this paper, we propose RAB: a Reconfigurable Architecture Block-Circulant Neural Network Accelerator to solve the problems via two techniques. First, a fine-grained frequency-domain quantization is proposed to accelerate MAC operations. Second, a reconfigurable architecture is designed to transform FFT/IFFT modules into MAC modules, which alleviates the imbalanced workload and further improves efficiency. Experimental results show that RAB can achieve 1.9x/1.8x area/energy efficiency improvement compared with the state-of-the-art block-circulant compression based accelerator.
- Published
- 2021
- Full Text
- View/download PDF
8. A Non-Volatile Computing-In-Memory Framework With Margin Enhancement Based CSA and Offset Reduction Based ADC
- Author
-
Yifan He, Huazhong Yang, Yongpan Liu, Yuxuan Huang, and Jinshan Yue
- Subjects
010302 applied physics ,Offset (computer science) ,Artificial neural network ,Sense amplifier ,Computer science ,020208 electrical & electronic engineering ,02 engineering and technology ,01 natural sciences ,Reduction (complexity) ,Margin (machine learning) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,Electronic design automation ,Network model ,Efficient energy use - Abstract
Nowadays, deep neural network (DNN) has played an important role in machine learning. Non-volatile computing-in-memory (nvCIM) for DNN has become a new architecture to optimize hardware performance and energy efficiency. However, the existing nvCIM accelerators focus on system-level performance but ignore analog factors. In this paper, the sense margin and offset are considered in the proposed nvCIM framework. The margin enhancement based current-mode sense amplifier (MECSA) and the offset reduction based analog-to-digital converter (ORADC) are proposed to improve the accuracy of the ADC. Based on the above methods, the nvCIM framework is displayed and the experiment results show that the proposed framework has an improvement on area, power, and latency with the high accuracy of network models, and the energy efficiency is 2.3 - 20.4× compared to the existing RRAM based nvCIM accelerators.
- Published
- 2021
- Full Text
- View/download PDF
9. Reliability-Aware Training and Performance Modeling for Processing-In-Memory Systems
- Author
-
Hanbo Sun, Kaizhong Qiu, Yi Cai, Shulin Zeng, Zhenhua Zhu, Yu Wang, and Huazhong Yang
- Subjects
010302 applied physics ,Computer science ,Interface (computing) ,Reliability (computer networking) ,02 engineering and technology ,Memristor ,Converters ,01 natural sciences ,Convolutional neural network ,020202 computer hardware & architecture ,law.invention ,Data flow diagram ,Computer engineering ,law ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Quantization (image processing) ,Efficient energy use - Abstract
Memristor based Processing-In-Memory (PIM) systems give alternative solutions to boost the computing energy efficiency of Convolutional Neural Network (CNN) based algorithms. However, Analog-to-Digital Converters' (ADCs) high interface costs and the limited size of the memristor crossbars make it challenging to map CNN models onto PIM systems with both high accuracy and high energy efficiency. Besides, it takes a long time to simulate the performance of large-scale PIM systems, resulting in unacceptable development time for the PIM system. To address these problems, we propose a reliability-aware training framework and a behavior-level modeling tool (MNSIM 2.0) for PIM accelerators. The proposed reliability-aware training framework, containing network splitting/merging analysis and a PIM-based non-uniform activation quantization scheme, can improve the energy efficiency by reducing the ADC resolution requirements in memristor crossbars. Moreover, MNSIM 2.0 provides a general modeling method for PIM architecture design and computation data flow; it can evaluate both accuracy and hardware performance within a short time. Experiments based on MNSIM 2.0 show that the reliability-aware training framework can improve 3.4x energy efficiency of PIM accelerators with little accuracy loss. The equivalent energy efficiency is 9.02 TOPS/W, nearly 2.6~4.2x compared with the existing work. We also evaluate more case studies of MNSIM 2.0, which help us balance the trade-off between accuracy and hardware performance.
- Published
- 2021
- Full Text
- View/download PDF
10. Puncturing the memory wall
- Author
-
Qin Li, Zijie Yu, Huazhong Yang, Fei Qiao, Changlu Liu, Yanzhi Wang, and Peiyan Dong
- Subjects
Computer science ,020208 electrical & electronic engineering ,Word error rate ,Data compression ratio ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Puncturing ,Memory management ,Recurrent neural network ,Computer engineering ,0202 electrical engineering, electronic engineering, information engineering ,Pruning (decision trees) ,Quantization (image processing) ,0105 earth and related environmental sciences ,Block (data storage) - Abstract
The automatic speech recognition (ASR) system is becoming increasingly irreplaceable in smart speech interaction applications. Nonetheless, these applications confront the memory wall when embedded in the energy and memory constrained Internet of Things devices. Therefore, it is extremely challenging but imperative to design a memory-saving and energy-saving ASR system. This paper proposes a joint-optimized scheme of network compression with approximate memory for the economical ASR system. At the algorithm level, this work presents block-based pruning and quantization with error model (BPQE), an optimized compression framework including a novel pruning technique coordinated with low-precision quantization and the approximate memory scheme. The BPQE compressed recurrent neural network (RNN) model comes with an ultra-high compression rate and fine-grained structured pattern that reduce the amount of memory access immensely. At the hardware level, this work presents an ASR-adapted incremental retraining method to further obtain optimal power saving. This retraining method stimulates the utility of the approximate memory scheme, while maintaining considerable accuracy. According to the experiment results, the proposed joint-optimized scheme achieves 58.6% power saving and 40× memory saving with a phone error rate of 20%.
- Published
- 2021
- Full Text
- View/download PDF
11. MNSIM 2.0: A Behavior-Level Modeling Tool for Memristor-based Neuromorphic Computing Systems
- Author
-
Yuan Xie, X. Sharon Hu, Kaizhong Qiu, Xiaoming Chen, Gokul Krishnan, Niu Dimin, Zhenhua Zhu, Lixue Xia, Yu Wang, Yu Cao, Guohao Dai, Huazhong Yang, and Hanbo Sun
- Subjects
010302 applied physics ,Structure (mathematical logic) ,Artificial neural network ,Computer science ,Design space exploration ,Inference ,02 engineering and technology ,Memristor ,01 natural sciences ,020202 computer hardware & architecture ,law.invention ,Neuromorphic engineering ,Computer architecture ,law ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Architecture ,Efficient energy use - Abstract
Memristor based neuromorphic computing systems give alternative solutions to boost the computing energy efficiency of Neural Network (NN) algorithms. Because of the large-scale applications and the large architecture design space, many factors will affect the computing accuracy and system's performance. In this work, we propose a behavior-level modeling tool for memristor-based neuromorphic computing systems, MNSIM 2.0, to model the performance and help researchers to realize an early-stage design space exploration. Compared with the former version and other benchmarks, MNSIM 2.0 has the following new features: 1. In the algorithm level, MNSIM 2.0 supports the inference accuracy simulation for mixed-precision NNs considering non-ideal factors. 2. In the architecture level, a hierarchical modeling structure for PIM systems is proposed. Users can customize their designs from the aspects of devices, interfaces, processing units, buffer designs, and interconnections. 3. Two hardware-aware algorithm optimization methods are integrated in MNSIM 2.0 to realize software-hardware co-optimization.
- Published
- 2020
- Full Text
- View/download PDF
12. Multi-channel precision-sparsity-adapted inter-frame differential data codec for video neural network processor
- Author
-
Zhuqing Yuan, Fanyang Cheng, Zhe Yuan, Yongpan Liu, Fang Su, Yixiong Yang, and Huazhong Yang
- Subjects
Artificial neural network ,Computer science ,business.industry ,020208 electrical & electronic engineering ,Inter frame ,Data compression ratio ,Data_CODINGANDINFORMATIONTHEORY ,02 engineering and technology ,ENCODE ,020202 computer hardware & architecture ,Encoding (memory) ,0202 electrical engineering, electronic engineering, information engineering ,Codec ,business ,Computer hardware ,Communication channel ,Coding (social sciences) - Abstract
Activation I/O traffic is a critical bottleneck of video neural network processor. Recent works adopted an inter-frame difference method to reduce activation size. However, current methods can't fully adapt to the various precision and sparsity in differential data. In this paper, we propose the multi-channel precision-sparsity-adapted codec, which will separate the differential activation and encode activation in multiple channels. We analyze the most adapted encoding of each channel, and select the optimal channel number with the best performance. A two-channel codec hardware has been implemented in the ASIC accelerator, which can encode/decode activations in parallel. Experiment results show that our coding achieves 2.2x-18.2x compression rate in three scenarios with no accuracy loss, and the hardware has 42x/174x improvement on speed and energy-efficiency compared with the software codec.
- Published
- 2020
- Full Text
- View/download PDF
13. NS-KWS
- Author
-
Qin Li, Yanzhi Wang, Sheng Lin, Fei Qiao, Yidong Liu, Huazhong Yang, and Changlu Liu
- Subjects
Speedup ,Microphone ,business.industry ,Computer science ,020208 electrical & electronic engineering ,Process (computing) ,02 engineering and technology ,computer.file_format ,Energy consumption ,Bottleneck ,020202 computer hardware & architecture ,Data conversion ,CMOS ,Keyword spotting ,0202 electrical engineering, electronic engineering, information engineering ,business ,computer ,Computer hardware - Abstract
Keyword spotting (KWS) is a crucial front-end module in the whole speech interaction system. The always-on KWS module detects input words, then activates the energy-consuming complex backend system when keywords are detected. The performance of the KWS determines the standby performance of the whole system and the conventional KWS module encounters the power consumption bottleneck problem of the data conversion near the microphone sensor. In this paper, we propose an energy-efficient near-sensor processing architecture for always-on KWS, which could enhance continuous perception of the whole speech interaction system. By implementing the keyword detection in the analog domain after the microphone sensor, this architecture avoids energy-consuming data converter and achieves faster speed than conventional realizations. In addition, we propose a lightweight gated recurrent unit (GRU) with negligible accuracy loss to ensure the recognition performance. We also implement and fabricate the proposed KWS system with the CMOS 0.18μm process. In the system-view evaluation results, the hardware-software co-design architecture achieves 65.6% energy consumption saving and 71 times speed up than state of the art.
- Published
- 2020
- Full Text
- View/download PDF
14. FeFET-based low-power bitwise logic-in-memory with direct write-back and data-adaptive dynamic sensing interface
- Author
-
Xueqing Li, Vijaykrishnan Narayanan, Yu Wang, Mingyuan Ma, Mingyen Lee, Deliang Fan, Juejian Wu, Yongpan Liu, Tang Wenjun, Bowen Xue, and Huazhong Yang
- Subjects
Adder ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,Interface (computing) ,020208 electrical & electronic engineering ,NAND gate ,02 engineering and technology ,Encryption ,020202 computer hardware & architecture ,Power (physics) ,Non-volatile memory ,0202 electrical engineering, electronic engineering, information engineering ,business ,Bitwise operation ,Computer hardware ,Voltage - Abstract
Compute-in-memory (CiM) is a promising method for mitigating the memory wall problem in data-intensive applications. The proposed bitwise logic-in-memory (BLiM) is targeted at data intensive applications, such as database, data encryption. This work proposes a low-power BLiM approach using the emerging nonvolatile ferroelectric FETs with direct write-back and data-adaptive dynamic sensing interface. Apart from general-purpose random-access memory, it also supports BLiM operations such as copy, not, nand, xor, and full adder (FA). The novel features of the proposed architecture include: (i) direct result-write-back based on the remnant bitline BLiM charge that avoids bitline sensing and charging operations; (ii) a fully dynamic sensing interface that needs no static reference current, but adopts data-adaptive voltage references for certain multi-operand operations, and (iii) selective bitline charging from wordline (instead of pre-charging all bitlines) to save power and also enable direct write-back. Detailed BLiM operations and benchmarking against conventional approaches show the promise of low-power computing with the FeFET-based circuit techniques.
- Published
- 2020
- Full Text
- View/download PDF
15. Enable Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud
- Author
-
Guangjun Ge, Kaiyuan Guo, Hanbo Sun, Yu Wang, Shulin Zeng, Guohao Dai, Kai Zhong, and Huazhong Yang
- Subjects
business.industry ,Computer science ,Deep learning ,Frame (networking) ,Control reconfiguration ,Cloud computing ,Virtualization ,computer.software_genre ,Instruction set ,Computer architecture ,Artificial intelligence ,business ,Field-programmable gate array ,Throughput (business) ,computer - Abstract
FPGAs have shown great potential in providing low-latency and energy-efficient solutions for deep learning applications, especially for the deep neural network (DNN). Currently, the majority of FPGA based DNN accelerators are designed for single-task and static-workload applications, making it difficult to adapt to the multi-task and dynamic-workload applications in the cloud. To meet these requirements, DNN accelerators need to support multi-task concurrent execution and low-overhead runtime resources reconfiguration. However, neither instruction set architecture (ISA) based nor template-based FPGA accelerators can support both functions at the same time. In this paper, we introduce a novel FPGA virtualization framework for ISA-based DNN accelerators in the cloud. As for the design goals of supporting multi-task and runtime reconfiguration, we propose a two-level instruction dispatch module and deep learning hardware resources pooling technique at the hardware level. As for the software level, we propose a tiling-based instruction frame package design and two-stage static-dynamic compilation. Furthermore, we propose a history information aware scheduling algorithm for the proposed ISA-based deep learning accelerators in the cloud scenario. According to our evaluation on Xilinx VU9P FPGA, the proposed virtualization method achieves 1.88x to 2.20x higher throughput and 1.36x to 1.77x lower latency against the static baseline design.
- Published
- 2020
- Full Text
- View/download PDF
16. INCAME: INterruptible CNN Accelerator for Multi-robot Exploration
- Author
-
Jiantao Qiu, Guohao Dai, Zhilin Xu, Yu Wang, Yuanfan Xu, Shulin Zeng, Chaoyang Shen, Jincheng Yu, Chao Yu, and Huazhong Yang
- Subjects
Schedule ,business.industry ,Computer science ,Embedded system ,Deep learning ,Robot ,Artificial intelligence ,Interrupt ,business ,Field-programmable gate array ,Convolutional neural network ,Bottleneck ,Scheduling (computing) - Abstract
Multi-Robot Exploration (MR-Exploration) that provides the location and map is a basic task for many multi-robot applications. Recent researches introduce Convolutional Neural Network (CNN) to critical components in MR-Exploration, like Feature-point Extraction (FE) and Place Recognition (PR), to improve the system performance. Such CNN-based MR-Exploration requires running multiple CNN models simultaneously, together with complex post-processing algorithms, greatly challenges the hardware platforms, which are usually embedded systems. Previous researches have shown that FPGA is a good candidate for CNN processing on embedded platforms. But such accelerators usually process different models sequentially, lacking the ability to schedule multiple tasks at runtime. Furthermore, post-processing of CNNs in FE is also computation consuming and becomes the system bottleneck after accelerating the CNN models. To handle such problems, we propose an INterruptible CNN Accelerator for Multi-Robot Exploration (INCAME) framework for rapid deployment of robot applications on FPGA. In INCAME, we propose a virtual-instruction-based interrupt method to support multi-task on CNN accelerators. INCAME also includes hardware modules to accelerate the post-processing of the CNN-based components. Experimental results show that INCAME enables multi-task scheduling on the CNN accelerator with negligible performance degradation (0.3%). With the help of multi-task supporting and post-processing acceleration, INCAME enables embedded FPGA to execute MR-Exploration in real time (20 fps).
- Published
- 2020
- Full Text
- View/download PDF
17. A 3T/Cell Practical Embedded Nonvolatile Memory Supporting Symmetric Read and Write Access Based on Ferroelectric FETs
- Author
-
Xueqing Li, Hongtao Zhong, Yongpan Liu, Kai Ni, Juejian Wu, and Huazhong Yang
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,Transistor ,Latency (audio) ,Electrical engineering ,Embedded memory ,02 engineering and technology ,01 natural sciences ,Ferroelectricity ,020202 computer hardware & architecture ,law.invention ,Non-volatile memory ,law ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,business ,Energy (signal processing) - Abstract
Making embedded memory symmetric provides the capability of memory access in both rows and columns, which brings new opportunities of significant energy and time savings if only a portion of data in the words need to be accessed. This work investigates the use of ferroelectric field-effect transistors (FeFETs), an emerging nonvolatile, low-power, deeply-scalable, CMOS-compatible transistor technology, and proposes a new 3transistor/cell symmetric nonvolatile memory (SymNVM). With $\sim 1.67\mathrm{x}$ higher density as compared with the prior FeFET design, significant benefits of energy and latency improvement have been achieved, as evaluated and discussed in depth in this paper.
- Published
- 2019
- Full Text
- View/download PDF
18. A Configurable Multi-Precision CNN Computing Framework Based on Single Bit RRAM
- Author
-
Lixue Xia, Yujun Lin, Yu Wang, Song Han, Guohao Dai, Hanbo Sun, Zhenhua Zhu, and Huazhong Yang
- Subjects
010302 applied physics ,Boosting (machine learning) ,Computer science ,Quantization (signal processing) ,Binary number ,02 engineering and technology ,Chip ,01 natural sciences ,Convolutional neural network ,020202 computer hardware & architecture ,Resistive random-access memory ,Computer engineering ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Efficient energy use - Abstract
Convolutional Neural Networks (CNNs) play a vital role in machine learning. Emerging resistive random-access memories (RRAMs) and RRAM-based Processing-In-Memory architectures have demonstrated great potentials in boosting both the performance and energy efficiency of CNNs. However, restricted by the immature process technology, it is hard to implement and fabricate a CNN accelerator chip based on multi-bit RRAM devices. In addition, existing single bit RRAM based CNN accelerators only focus on binary or ternary CNNs which have more than 10% accuracy loss compared with full precision CNNs. This paper proposes a configurable multi-precision CNN computing framework based on single bit RRAM, which consists of an RRAM computing overhead aware network quantization algorithm and a configurable multi-precision CNN computing architecture based on single bit RRAM. The proposed method can achieve equivalent accuracy as full precision CNN but also with lower storage consumption and latency via multiple precision quantization. The designed architecture supports for accelerating the multi-precision CNNs even with various precision among different layers. Experiment results show that the proposed framework can reduce 70% computing area and 75% computing energy on average, with nearly no accuracy loss. And the equivalent energy efficiency is 1.6$\sim 8.6\times$ compared with existing RRM based architectures with only 1.07% area overhead.
- Published
- 2019
- Full Text
- View/download PDF
19. A 16b Clockless Digital-to-Analog Converter with Ultra-Low-Cost Poly Resistors Supporting Wide-Temperature Range from -40°C to 85°C
- Author
-
Xuedi Wang, Xueqing Li, Longqiang Lai, and Huazhong Yang
- Subjects
Data acquisition ,CMOS ,Computer science ,law ,Redundancy (engineering) ,Electronic engineering ,Digital-to-analog converter ,Atmospheric temperature range ,Thin film ,Resistor ,law.invention ,Voltage - Abstract
High-precision digital-to-analog converter (DAC) is a critical component in process control, data acquisition, and testing instruments. In order to achieve high resolution and a wide-temperature range, conventional designs have been adopting high-cost thin-film resistors with laser-trimming to improve the matching property and thus the DAC resolution. In this work, targeting at lowering the cost, we propose an analog resistor redundancy, full code, and piecewise-linear calibration scheme to enable the use of low-cost poly resistors in a standard CMOS process. In order to overcome the drift over a wide temperature range, a feedback circuit is proposed to guarantee that the resistance of the switch can track the resistor with temperature changes. Therefore, the DAC can be calibrated under a specific temperature and tested under an arbitrary temperature from -40℃ to 85℃ using the same calibration codes. The 16b DAC was implemented in a 0.25μm 5V CMOS process with 5V CMOS devices and poly resistors rather than thin film resistor. The test result shows that it can achieve the INL≤0.5LSB, INL≤4LSB, and INL≤4LSB at 25℃, -40℃, and 85℃, respectively, using the same calibration code. It settles in 1μs and it has below 5nV ⋅ s glitch. The current consumption is 1.7mA from 5V voltage supplies.
- Published
- 2019
- Full Text
- View/download PDF
20. Compressed CNN Training with FPGA-based Accelerator
- Author
-
Kaiyuan Guo, Xuefei Ning, Jincheng Yu, Wenshuo Li, Yu Wang, Shuang Liang, and Huazhong Yang
- Subjects
Floating point ,Computer engineering ,Computer science ,Training system ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,02 engineering and technology ,Pruning (decision trees) ,Gradient descent ,Field-programmable gate array ,Quantization (image processing) ,Convolutional neural network ,Efficient energy use - Abstract
Training convolutional neural network (CNN) usually requires large amount of computation resource, time and power. Researchers and cloud service providers in this region needs fast and efficient training system. GPU is currently the best candidate for CNN training. But FPGAs have already shown good performance and energy efficiency as CNN inference accelerators. In this work, we design a compressed training process together with an FPGA-based accelerator for energy efficient CNN training. We adopt two of the widely used model compression methods, quantization and pruning, to accelerate CNN training process. The difference between inference and training brought challenges to apply the two methods in training. First, training requires higher data precision. We use the gradient accumulation buffer to achieve low operation complexity while keeping gradient descent precision. Second, sparse network results in different types of functions in forward and back-propagation phases. We design a novel architecture to utilize both inference and back-propagation sparsity. Experimental results show that the proposed training process achieves similar accuracy compared with traditional training process with floating point data. The proposed accelerator achieves 641GOP/s equivalent performance and 2.86x better energy efficiency compared with GPU.
- Published
- 2019
- Full Text
- View/download PDF
21. A Fine-Grained Sparse Accelerator for Multi-Precision DNN
- Author
-
Dongliang Xie, Song Han, Yujun Lin, Yi Shan, Huazhong Yang, Yu Wang, Shulin Zeng, Shuang Liang, and Junlong Kang
- Subjects
Hardware architecture ,Recurrent neural network ,Software ,Artificial neural network ,Computer engineering ,business.industry ,Computer science ,Quantization (signal processing) ,Datapath ,Field-programmable gate array ,business ,Convolutional neural network - Abstract
Neural Networks (NNs) have made a significant breakthrough in many fields, while they also pose a great challenge to hardware platforms since the state-of-the-art neural networks are both communicational- and computational-intensive. Researchers proposed model compression algorithms using sparsification and quantization, along with specific hardware architecture designs, to accelerate various applications. However, the irregularity of memory access caused by the sparsity severely damages the regularity of intensive computation loops. Therefore, the architecture design for sparse neural networks is crucial to better software and hardware co-design for neural network applications. To face these challenges, this paper first analyzes the computation patterns of different NN structures and unify them into the form of sparse matrix-vector multiplication, sparse matrix-matrix multiplication, and element-wise multiplication. On the basis of the EIE which supports only the fully-connected network and recurrent neural network (RNN), we expand it to support the convolution neural network (CNN) using the input vector transform unit. This paper designs a multi-precision multiplier with supporting datapath, which makes the proposed architecture have a better acceleration effect in the low-bit quantization with the same hardware architecture. The proposed accelerator architecture can achieve the equivalent performance and energy efficiency up to 574.2 GOPS, 42.8 GOPS/W for CNN and 110.4 GOPS, 8.24 GOPS/W for RNN under 4-bit quantization on Xilinx XCKU115 FPGA running at 200MHz. And it is the state-of-the-art accelerator supporting CNN-RNN-based models like the long-term recurrent convolutional network with 571.1 GOPS performance and 42.6 GOPS/W energy efficiency under 4-bit data format.
- Published
- 2019
- Full Text
- View/download PDF
22. GraphSAR
- Author
-
Yu Wang, John Wawrzynek, Tianhao Huang, Huazhong Yang, and Guohao Dai
- Subjects
010302 applied physics ,High energy ,Speedup ,Computer science ,Computation ,Energy reduction ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Graph ,020202 computer hardware & architecture ,Long latency ,Resistive random-access memory ,0103 physical sciences ,Memory architecture ,0202 electrical engineering, electronic engineering, information engineering - Abstract
Large-scale graph processing has drawn great attention in recent years. The emerging metal-oxide resistive random access memory (ReRAM) and ReRAM crossbars have shown huge potential in accelerating graph processing. However, the sparse feature of natural graphs hinders the performance of graph processing on ReRAMs. Previous work of graph processing on ReRAMs stored and computed edges separately, leading to high energy consumption and long latency of transferring data. In this paper, we present GraphSAR, a sparsity-aware processing-in-memory large-scale graph processing accelerator on ReRAMs. Computations over edges are performed in the memory, eliminating overheads of transferring edges. Moreover, graphs are divided considering the sparsity. Subgraphs with low densities are further divided into smaller ones to minimize the waste of memory space. According to our extensive experimental results, GraphSAR achieves 4.43x energy reduction and 1.85x speedup (8.19x lower energy-delay product, EDP) against previous graph processing architecture on ReRAMs (GraphR [1]).
- Published
- 2019
- Full Text
- View/download PDF
23. AERIS
- Author
-
Shuangchen Li, Fang Su, Zhibo Wang, Zhe Yuan, Huazhong Yang, Wenyu Sun, Yongpan Liu, Xueqing Li, and Jinshan Yue
- Subjects
Power gating ,XNOR gate ,Computer science ,Electronic engineering ,Neural network system ,Chip ,Scheduling (computing) ,Efficient energy use ,Electronic circuit ,Resistive random-access memory - Abstract
ReRAM-based processing-in-memory (PIM) architecture is a promising solution for deep neural networks (NN), due to its high energy efficiency and small footprint. However, traditional PIM architecture has to use a separate crossbar array to store either positive or negative (P/N) weights, which limits both energy efficiency and area efficiency. Even worse, imbalance running time of different layers and idle ADCs/DACs even lower down the whole system efficiency. This paper proposes AERIS, an Area/Energy-efficient 1T2R ReRAM based processing-In-memory NN System-on-a-chip to enhance both energy and area efficiency. We propose an area-efficient 1T2R ReRAM structure to represent both P/N weights in a single array, and a reference current cancelling scheme (RCS) is also presented for better accuracy. Moreover, a layer-balance scheduling strategy, as well as the power gating technique for interface circuits, such as ADCs/DACs, is adopted for higher energy efficiency. Experiment results show that compared with state-of-the-art ReRAM-based architectures, AERIS achieves 8.5x/1.3x peak energy/area efficiency improvements in total, due to layer-balance scheduling for different layers, power gating of interface circuits, and 1T2R ReRAM circuits. Furthermore, we demonstrate that the proposed RCS compensates the non-ideal factors of ReRAM and improves NN accuracy by 5.2% in the XNOR net on CIFAR-10 dataset.
- Published
- 2019
- Full Text
- View/download PDF
24. An N-way group association architecture and sparse data group association load balancing algorithm for sparse CNN accelerators
- Author
-
Huazhong Yang, Ruoyang Liu, Zhe Yuan, Jingyu Wang, and Yongpan Liu
- Subjects
Computer science ,02 engineering and technology ,Load balancing (computing) ,Collision ,020202 computer hardware & architecture ,Performance limit ,Application-specific integrated circuit ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Architecture ,Algorithm ,Sparse matrix ,Collision rate ,Efficient energy use - Abstract
In recent years, ASIC CNN Accelerators have attracted great attention among researchers for the high performance and energy efficiency. Some former works utilize the sparsity of CNN networks to improve the performance and the energy efficiency. However, these methods bring tremendous overhead to the output memory, and the performance suffers from the hash collision. This paper presents: 1) an N-Way Group Association Architecture to reduce the memory overhead for Sparse CNN Accelerators; 2) a Sparse Data Group Association Load Balancing Algorithm which is implemented by the Scheduler module in the architecture to reduce the collision rate and improve the performance. Compared with the state-of-art accelerator, this work achieves either 1) 1.74x performance with 50% memory overhead reduction in the 4-way associated design or 2) 1.91x performance without memory overhead reduction the 2-way associated design, which is close to the theoretical performance limit (without collision).
- Published
- 2019
- Full Text
- View/download PDF
25. Mixed size crossbar based RRAM CNN accelerator with overlapped mapping method
- Author
-
Jilan Lin, Ming Cheng, Zhenhua Zhu, Hanbo Sun, Lixue Xia, Xiaoming Chen, Huazhong Yang, and Yu Wang
- Subjects
Boosting (machine learning) ,Speedup ,Computer science ,Computation ,020208 electrical & electronic engineering ,02 engineering and technology ,Parallel computing ,Energy consumption ,Convolutional neural network ,020202 computer hardware & architecture ,Resistive random-access memory ,Idle ,0202 electrical engineering, electronic engineering, information engineering ,Crossbar switch ,Efficient energy use - Abstract
Convolutional Neural Networks (CNNs) play a vital role in machine learning. CNNs are typically both computing and memory intensive. Emerging resistive random-access memories (RRAMs) and RRAM crossbars have demonstrated great potentials in boosting the performance and energy efficiency of CNNs. Compared with small crossbars, large crossbars show better energy efficiency with less interface overhead. However, conventional workload mapping methods for small crossbars cannot make full use of the computation ability of large crossbars. In this paper, we propose an Overlapped Mapping Method (OMM) and MIxed Size Crossbar based RRAM CNN Accelerator (MISCA) to solve this problem. MISCA with OMM can reduce the energy consumption caused by the interface circuits, and improve the parallelism of computation by leveraging the idle RRAM cells in crossbars. The simulation results show that MISCA with OMM can achieve 2.7× speedup, 30% utilization rate improvement, and 1.2× energy efficiency improvement on average compared with fixed size crossbars based accelerator using the conventional mapping method. In comparison with GPU platform, MISCA with OMM can perform 490.4× higher on average in energy efficiency and 20× higher on average in speedup. Compared with PRIME, an existing RRAM based accelerator, MISCA has 26.4× speedup and 1.65× energy efficiency improvement.
- Published
- 2018
- Full Text
- View/download PDF
26. An extensible system simulator for intermittently-powered multiple-peripheral IoT devices
- Author
-
Huazhong Yang, Yongpan Liu, Tongda Wu, and Lefan Zhang
- Subjects
010302 applied physics ,Computer science ,Quality of service ,Interface (computing) ,02 engineering and technology ,01 natural sciences ,Extensibility ,020202 computer hardware & architecture ,Power (physics) ,Software deployment ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Systems design ,Virtual device ,Energy harvesting ,Simulation - Abstract
Energy harvesting is an alternative to achieve maintenance-free IoT devices. However, the intermittent and low-intensity ambient power supply poses a great challenge to guarantee the quality of service (QoS) of these applications. Adequate QoS simulation is required to evaluate the system design before deployment. Unfortunately, existing simulators lack supports on neither system-level behaviors under power failure circumstances nor the modeling mechanism for peripheral functionality and energy-related parameters. This paper proposes a system-level simulator named AES to evaluate and assist the intermittently-powered system (IPS) design. Adopting a flexible energy message handling framework and an easily-configured virtual device interface, AES supports both functionality and energy-related behavior simulation of all hardware modules under intermittent power scenarios. A hardware prototype is established and validates that the deviation of AES is less than 6.4%, which is adequate for IoT applications. With AES, this paper also explores the impact and design space of the system parameters in an IPS and provides a group of design guidelines to improve the performance by 37.2% in average, which reveals the potential of AES on IPS design.
- Published
- 2018
- Full Text
- View/download PDF
27. Energy-efficient MFCC extraction architecture in mixed-signal domain for automatic speech recognition
- Author
-
Huifeng Zhu, Qi Wei, Qin Li, Xinjun Liu, Fei Qiao, and Huazhong Yang
- Subjects
Speedup ,Computer science ,Speech recognition ,Feature extraction ,Frame (networking) ,02 engineering and technology ,Energy consumption ,020202 computer hardware & architecture ,ComputingMethodologies_PATTERNRECOGNITION ,Frequency domain ,0202 electrical engineering, electronic engineering, information engineering ,Mel-frequency cepstrum ,Energy (signal processing) ,Efficient energy use - Abstract
This paper proposes a novel processing architecture to extract Mel-Frequency Cepstrum Coefficients (MFCC) for automatic speech recognition. Inspired by the human ear, the energy-efficient analog-domain information processing is adopted to replace the energy-intensive Fourier Transform in conventional digital-domain. Moreover, the proposed architecture extracts the acoustic features in the mixed-signal domain, which significantly reduces the cost of Analog-to-Digital Converter (ADC) and the computational complexity. We carry out the circuit-level simulation based on 180nm CMOS technology, which shows an energy consumption of 2.4 nJ/frame, and a processing speed of 45.79 μs/frame. The proposed architecture achieves 97.2% energy saving and about 6.4× speedup than state of the art. Speech recognition simulation reaches the classification accuracy of 99% using the proposed MFCC features.
- Published
- 2018
- Full Text
- View/download PDF
28. TIME
- Author
-
Yu Wang, Lixue Xia, Yuan Xie, Ming Cheng, Yi Cai, Zhenhua Zhu, and Huazhong Yang
- Subjects
Random access memory ,Artificial neural network ,Computer science ,020208 electrical & electronic engineering ,Supervised learning ,02 engineering and technology ,Memristor ,Backpropagation ,020202 computer hardware & architecture ,law.invention ,Resistive random-access memory ,Application-specific integrated circuit ,law ,Memory architecture ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering - Abstract
The training of neural network (NN) is usually time-consuming and resource intensive. Memristor has shown its potential in computation of NN. Especially for the metal-oxide resistive random access memory (RRAM), its crossbar structure and multi-bit characteristic can perform the matrix-vector product in high precision, which is the most common operation of NN. However, there exist two challenges on realizing the training of NN. Firstly, the current architecture can only support the inference phase of training and cannot perform the backpropagation (BP), the weights update of NN. Secondly, the training of NN requires enormous iterations and constantly updates the weights to reach the convergence, which leads to large energy consumption because of lots of write and read operations. In this work, we propose a novel architecture, TIME, and peripheral circuit designs to enable the training of NN in RRAM. TIME supports the BP and the weights update while maximizing the reuse of peripheral circuits for the inference operation on RRAM. Meanwhile, a variability-free tuning scheme and gradually-write circuits are designed to reduce the cost of tuning RRAM. We explore the performance of both SL (supervised learning) and DRL (deep reinforcement learning) in TIME, and a specific mapping method of DRL is also introduced to further improve the energy efficiency. Experimental results show that, in SL, TIME can achieve 5.3× higher energy efficiency on average compared with the most powerful application-specific integrated circuits (ASIC) in the literature. In DRL, TIME can perform averagely 126× higher than GPU in energy efficiency. If the cost of tuning RRAM can be further reduced, TIME have the potential of boosting the energy efficiency by 2 orders of magnitude compared with ASIC.
- Published
- 2017
- Full Text
- View/download PDF
29. Design Methodology for Thin-Film Transistor Based Pseudo-CMOS Logic Array with Multi-Layer Interconnect Architecture
- Author
-
Hailong Yao, Yongpan Liu, Wenyu Sun, Xiaojun Guo, Qinghang Zhao, Jiaqing Zhao, and Huazhong Yang
- Subjects
010302 applied physics ,Engineering ,Diode–transistor logic ,Pass transistor logic ,business.industry ,Logic family ,Electrical engineering ,Hardware_PERFORMANCEANDRELIABILITY ,02 engineering and technology ,01 natural sciences ,Resistor–transistor logic ,020202 computer hardware & architecture ,Programmable logic device ,Integrated injection logic ,Logic gate ,0103 physical sciences ,Hardware_INTEGRATEDCIRCUITS ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,business ,Hardware_LOGICDESIGN ,Logic optimization - Abstract
Thin-film transistor (TFT) circuits are important for flexible electronics which are promising in the area of wearable devices. However, most TFT technologies only have unipolar devices and the process variation and defective rate are relatively high, which impose challenges to TFT circuit design. In this paper, we propose a novel logic array based on pseudo-CMOS logic to address the problem of unipolar TFT circuit design. A multi-layer interconnect architecture and wire routing methodology are presented to improve the routability and meanwhile the area efficiency. The experimental results show that the proposed logic array reduces more than 80% area compared with transistor level scheme.
- Published
- 2017
- Full Text
- View/download PDF
30. Evaluating Data Resilience in CNNs from an Approximate Memory Perspective
- Author
-
Yizhe Zhu, Fei Qiao, Yuanchang Chen, Jie Han, Huazhong Yang, and Yuansheng Liu
- Subjects
010302 applied physics ,Degree (graph theory) ,Computer science ,Perspective (graphical) ,Volume (computing) ,Word error rate ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Convolutional neural network ,020202 computer hardware & architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Resilience (network) ,Scaling ,Algorithm ,Data transmission - Abstract
Due to the large volumes of data that need to be processed, efficient memory access and data transmission are crucial for high-performance implementations of convolutional neural networks (CNNs). Approximate memory is a promising technique to achieve efficient memory access and data transmission in CNN hardware implementations. To assess the feasibility of applying approximate memory techniques, we propose a framework for the data resilience evaluation (DRE) of CNNs and verify its effectiveness on a suite of prevalent CNNs. Simulation results show that a high degree of data resilience exists in these networks. By scaling the bit-width of the first five dominant data subsets, the data volume can be reduced by 80.38% on average with a 2.69% loss in relative prediction accuracy. For approximate memory with random errors, all the synaptic weights can be stored in the approximate part when the error rate is less than 10--4, while 3 MSBs must be protected if the error rate is fixed at 10--3. These results indicate a great potential for exploiting approximate memory techniques in CNN hardware design.
- Published
- 2017
- Full Text
- View/download PDF
31. ForeGraph
- Author
-
Ningyi Xu, Guohao Dai, Yuze Chi, Huazhong Yang, Yu Wang, and Tianhao Huang
- Subjects
010302 applied physics ,Virtex ,Computer science ,02 engineering and technology ,Parallel computing ,Chip ,01 natural sciences ,020202 computer hardware & architecture ,Scheduling (computing) ,Data access ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,Field-programmable gate array ,Random access - Abstract
The performance of large-scale graph processing suffers from challenges including poor locality, lack of scalability, random access pattern, and heavy data conflicts. Some characteristics of FPGA make it a promising solution to accelerate various applications. For example, on-chip block RAMs can provide high throughput for random data access. However, large-scale processing on a single FPGA chip is constrained by limited on-chip memory resources and off-chip bandwidth. Using a multi-FPGA architecture may alleviate these problems to some extent, while the data partitioning and communication schemes should be considered to ensure the locality and reduce data conflicts. In this paper, we propose ForeGraph, a large-scale graph processing framework based on the multi-FPGA architecture. In ForeGraph, each FPGA board only stores a partition of the entire graph in off-chip memory. Communication over partitions is reduced. Vertices and edges are sequentially loaded onto the FPGA chip and processed. Under our scheduling scheme, each FPGA chip performs graph processing in parallel without conflicts. We also analyze the impact of system parameters on the performance of ForeGraph. Our experimental results on Xilinx Virtex UltraScale XCVU190 chip show ForeGraph outperforms state-of-the-art FPGA-based large-scale graph processing systems by 4.54x when executing PageRank on the Twitter graph (1.4 billion edges). The average throughput is over 900 MTEPS in our design and 2.03x larger than previous work.
- Published
- 2017
- Full Text
- View/download PDF
32. RRAM based learning acceleration
- Author
-
Lixue Xia, Yu Wang, Boxun Li, Huazhong Yang, Tianqi Tang, and Ming Cheng
- Subjects
Artificial neural network ,Computer science ,business.industry ,Deep learning ,020207 software engineering ,Ranging ,02 engineering and technology ,020202 computer hardware & architecture ,symbols.namesake ,Computer engineering ,CMOS ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,symbols ,Electronic design automation ,Artificial intelligence ,Field-programmable gate array ,business ,Electrical efficiency ,Von Neumann architecture - Abstract
Deep Learning (DL) is becoming popular in a wide range of domains. Many emerging applications, ranging from image and speech recognition to natural language processing and information retrieval, rely heavily on deep learning techniques, especially the Neural Networks (NNs). NNs have led to great advances in recognition accuracy compared with other traditional methods in recent years. NN-based methods demand much more computation and memory resource, and therefore a number of NN accelerators have been proposed on CMOS-based platforms, such as FPGA and GPU [1]. However, it becomes more and more difficult to obtain substantial power efficiency and gains directly through the scaling down of traditional CMOS technique. Meanwhile, the large data amount in DL applications also meets an ever-increasing "memory wall" challenge because of the efficiency of von Neumann architecture. Consequently, there is a growing research interest of exploring emerging nano-devices and new computing architectures to further improve power efficiency [2].
- Published
- 2016
- Full Text
- View/download PDF
33. SATS
- Author
-
Chun Jason Xue, Yongpan Liu, Hehe Li, Hyung Gyu Lee, Huazhong Yang, and Tongda Wu
- Subjects
Engineering ,business.industry ,Real-time computing ,Probabilistic logic ,020206 networking & telecommunications ,02 engineering and technology ,Energy consumption ,Clock synchronization ,Exponential function ,Synchronizer ,0202 electrical engineering, electronic engineering, information engineering ,Redundancy (engineering) ,Electronic engineering ,020201 artificial intelligence & image processing ,business ,Energy harvesting ,Wireless sensor network - Abstract
Reliable and ultra-low power time synchronization becomes more and more important with the popularity of energy harvesting sensor nodes. This paper proposes an untethered and probabilistic ultra-lower power time synchronization method for energy intermittent sensor network. It avoids the frequent RF communications with the assistance of a solar clock. The SATS system consists of two main parts: the synchronizer, a low power solar clock module for time synchronization, and the S3-Mapping, an offline sequence matching algorithm. Furthermore, we develop an improved version of S3-Mapping, which reduces the computation complexity from exponential to linear using the redundancy models and the onion peeling method. The SATS system is validated by both simulations and a prototype, which shows that the second level synchronization precision can be achieved under reasonable probability. What's more, the energy consumption of time synchronization is reduced by over 1 ~ 2 magnitudes compared with the up-to-date low power time synchronization protocol.
- Published
- 2016
- Full Text
- View/download PDF
34. Switched by input
- Author
-
Tianqi Tang, Yu Wang, Xiling Yin, Wenqin Huangfu, Huazhong Yang, Boxun Li, Lixue Xia, and Ming Cheng
- Subjects
010302 applied physics ,Artificial neural network ,Computer science ,Quantization (signal processing) ,02 engineering and technology ,01 natural sciences ,Convolutional neural network ,020202 computer hardware & architecture ,Resistive random-access memory ,Neuromorphic engineering ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,Crossbar switch ,MNIST database ,Efficient energy use - Abstract
Convolutional Neural Network (CNN) is a powerful technique widely used in computer vision area, which also demands much more computations and memory resources than traditional solutions. The emerging metal-oxide resistive random-access memory (RRAM) and RRAM crossbar have shown great potential on neuromorphic applications with high energy efficiency. However, the interfaces between analog RRAM crossbars and digital peripheral functions, namely Analog-to-Digital Converters (AD-Cs) and Digital-to-Analog Converters (DACs), consume most of the area and energy of RRAM-based CNN design due to the large amount of intermediate data in CNN. In this paper, we propose an energy efficient structure for RRAM-based CNN. Based on the analysis of data distribution, a quantization method is proposed to transfer the intermediate data into 1 bit and eliminate DACs. An energy efficient structure using input data as selection signals is proposed to reduce the ADC cost for merging results of multiple crossbars. The experimental results show that the proposed method and structure can save 80% area and more than 95% energy while maintaining the same or comparable classification accuracy of CNN on MNIST.
- Published
- 2016
- Full Text
- View/download PDF
35. HW/SW co-design of nonvolatile IO system in energy harvesting sensor nodes for optimal data acquisition
- Author
-
Zhangyuan Wang, Zewei Li, Chun Jason Xue, Wenyu Sun, Daming Zhang, Huazhong Yang, Xin Shi, Yongpan Liu, and Jiwu Shu
- Subjects
010302 applied physics ,Engineering ,business.industry ,Interface (computing) ,Initialization ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Power (physics) ,Non-volatile memory ,Data acquisition ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Transient (computer programming) ,business ,Energy harvesting - Abstract
Energy harvesting has been widely investigated as a promising alternative for future wearable sensors or internet-of-things. However, power and performance overhead is induced when IO operations are interrupted by power failures because non-preemptive characteristic of IO operations causes expensive re-executions. Furthermore, the state-of-art IO devices need long and power hungry initializing process, which makes IO operations inefficient in transient powered systems. This paper proposed a HW/SW co-design approach for nonvolatile IO system to maximize data acquisition. A ferroelectric flip-flop based nonvolatile IO architecture is adopted to reduce IO initialization overhead by 3-4 orders of magnitude. Based on the nonvolatile IO interface, we further formulate the optimal data acquisition as an INLP problem and a risk-aware online scheduler is presented to solve the problem efficiently. Experimental results show that the proposed HW/SW co-design architecture improves data acquisition by 2-5 times compared with conventional HW/SW architecture.
- Published
- 2016
- Full Text
- View/download PDF
36. Performance-aware task scheduling for energy harvesting nonvolatile processors considering power switching overhead
- Author
-
Hehe Li, Donglai Xiang, Daming Zhang, Huazhong Yang, Yongpan Liu, Jinshan Yue, Chun Jason Xue, Jingtong Hu, Jinyang Li, and Chenchen Fu
- Subjects
Rate-monotonic scheduling ,Job shop scheduling ,business.industry ,Computer science ,Processor scheduling ,020206 networking & telecommunications ,02 engineering and technology ,Dynamic priority scheduling ,Fair-share scheduling ,020202 computer hardware & architecture ,Scheduling (computing) ,Fixed-priority pre-emptive scheduling ,Embedded system ,Two-level scheduling ,0202 electrical engineering, electronic engineering, information engineering ,business ,Standby power ,Integer programming - Abstract
Nonvolatile processors have manifested strong vitality in battery-less energy harvesting sensor nodes due to their characteristics of zero standby power, resilience to power failures and fast read/write operations. However, I/O and sensing operations cannot store their system states after power off, hence they are sensitive to power failures and high power switching overhead is induced during power oscillation, which significantly degrades the system performance. In this paper, we propose a novel performance-aware task scheduling technique considering power switching overhead for energy harvesting nonvolatile processors. We first give the analysis of the power switching overhead on energy harvesting sensor nodes. Then, the scheduling problem is formulated by MILP (Mixed Integer Linear Programming). Furthermore, a task splitting strategy is adopted to improve the performance and an heuristic scheduling algorithm is proposed to reduce the problem complexity. Experimental results show that the proposed scheduling approach can improve the performance by 14% on average compared to the state-of-the-art scheduling strategy. With the employment of the task splitting approach, the execution time can be further reduced by 10.6%.
- Published
- 2016
- Full Text
- View/download PDF
37. FPGP
- Author
-
Yuze Chi, Yu Wang, Huazhong Yang, and Guohao Dai
- Subjects
010302 applied physics ,Computer science ,Breadth-first search ,Data path ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Graph ,020202 computer hardware & architecture ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Graph algorithms ,Field-programmable gate array - Abstract
Large-scale graph processing is gaining increasing attentions in many domains. Meanwhile, FPGA provides a power-efficient and highly parallel platform for many applications, and has been applied to custom computing in many domains. In this paper, we describe FPGP (FPGA Graph Processing), a streamlined vertex-centric graph processing framework on FPGA, based on the interval-shard structure. FPGP is adaptable to different graph algorithms and users do not need to change the whole implementation on the FPGA. In our implementation, an on-chip parallel graph processor is proposed to both maximize the off-chip bandwidth of graph data and fully utilize the parallelism of graph processing. Meanwhile, we analyze the performance of FPGP and show the scalability of FPGP when the bandwidth of data path increases. FPGP is more power-efficient than single machine systems and scalable to larger graphs compared with other FPGA-based graph systems.
- Published
- 2016
- Full Text
- View/download PDF
38. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
- Author
-
Kaiyuan Guo, Jincheng Yu, Jiantao Qiu, Jie Wang, Sen Song, Boxun Li, Ningyi Xu, Erjin Zhou, Song Yao, Yu Wang, Huazhong Yang, and Tianqi Tang
- Subjects
business.industry ,Computer science ,Deep learning ,02 engineering and technology ,Frame rate ,Convolutional neural network ,020202 computer hardware & architecture ,CUDA ,0202 electrical engineering, electronic engineering, information engineering ,Bandwidth (computing) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Field-programmable gate array ,Quantization (image processing) ,Auxiliary memory ,Computer hardware - Abstract
In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN.In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric.Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.
- Published
- 2016
- Full Text
- View/download PDF
39. A STT-RAM-based low-power hybrid register file for GPGPUs
- Author
-
Gushu Li, Xiaoming Chen, Yongpan Liu, Henry Hoffmann, Huazhong Yang, Yu Wang, and Guangyu Sun
- Subjects
Memory buffer register ,Memory address register ,Hardware_MEMORYSTRUCTURES ,Computer science ,Processor register ,Control register ,Register file ,Static random-access memory ,Thread (computing) ,Parallel computing ,Memory data register ,Stack register - Abstract
Recently, general-purpose graphics processing units (GPGPUs) have been widely used to accelerate computing in various applications. To store the contexts of thousands of concurrent threads on a GPU, a large static random-access memory (SRAM)-based register file is employed. Due to high leakage power of SRAM, the register file consumes 20% to 40% of the total GPU power consumption. Thus, hybrid memory system, which combines SRAM and the emerging non-volatile memory (NVM), has been employed for register file design on GPUs. Although it has shown strong potential to alleviate the power issue of GPUs, existing hybrid memory solutions might not exploit the intrinsic feature of GPU register file. By leveraging the warp schedule on GPU, this paper proposes a hybrid register architecture which consists of a NVM-based register file and mixed SRAM-based write buffers with a warp-aware write back strategy. Simulation results show that our design can eliminate 64% of write accesses to NVM and reduce power of register file by 66% on average, with only 4.2% performance degradation. After we apply the power gating technique, the register power is further reduced to 25% of SRAM counterpart on average.
- Published
- 2015
- Full Text
- View/download PDF
40. Deadline-aware task scheduling for solar-powered nonvolatile sensor nodes with global energy migration
- Author
-
Jinyang Li, Chun Jason Xue, Xiao Sheng, Huazhong Yang, Tongda Wu, Daming Zhang, and Yongpan Liu
- Subjects
High energy ,Engineering ,Global energy ,business.industry ,Real-time computing ,Energy migration ,Scheduling (computing) ,law.invention ,Capacitor ,law ,Solar powered ,business ,Queue ,Solar power - Abstract
Solar-powered sensor nodes with energy storages are widely used today and promising in the coming trillion sensor era, as they do not require manual battery charging or replacement. The changeable and limited solar power supply seriously affects the deadline miss rates (DMRs) of tasks on these nodes and therefore energy-driven task scheduling is necessary. However, current algorithms focus on the single period (or the current task queue) for high energy utilization and suffer from bad long term DMR. To get better long term DMR, we propose a long term deadline-aware scheduling algorithm with energy migration strategies for distributed super capacitors. Experimental results show that the proposed algorithm reduces the DMR by 27.8% and brings less than 3% of the total energy consumption.
- Published
- 2015
- Full Text
- View/download PDF
41. Ambient energy harvesting nonvolatile processors
- Author
-
Yiqun Wang, Kaisheng Ma, Huazhong Yang, Jiwu Shu, Xueqing Li, Yongpan Liu, Meng-Fan Chang, Zewei Li, Jack Sampson, Yuan Xie, Hehe Li, and Shuangchen Li
- Subjects
Engineering ,Hardware_MEMORYSTRUCTURES ,business.industry ,Processor design ,Energy consumption ,Ambient energy ,Backup ,Embedded system ,System level ,business ,Performance metric ,Energy harvesting ,Computer hardware ,Leakage (electronics) - Abstract
Energy harvesting is gaining more and more attentions due to its characteristics of ultra-long operation time without maintenance. However, frequent unpredictable power failures from energy harvesters bring performance and reliability challenges to traditional processors. Nonvolatile processors are promising to solve such a problem due to their advantage of zero leakage and efficient backup and restore operations. To optimize the nonvolatile processor design, this paper proposes new metrics of nonvolatile processors to consider energy harvesting factors for the first time. Furthermore, we explore the nonvolatile processor design from circuit to system level. A prototype of energy harvesting nonvolatile processor is set up and experimental results show that the proposed performance metric meets the measured results by less than 6.27% average errors. Finally, the energy consumption of nonvolatile processor is analyzed under different benchmarks.
- Published
- 2015
- Full Text
- View/download PDF
42. Energy Efficient RRAM Spiking Neural Network for Real Time Classification
- Author
-
Lixue Xia, Hai Li, Boxun Li, Yu Wang, Yuan Xie, Tianqi Tang, Peng Gu, and Huazhong Yang
- Subjects
Spiking neural network ,Boosting (machine learning) ,business.industry ,Computer science ,Ranging ,Machine learning ,computer.software_genre ,Resistive random-access memory ,Neuromorphic engineering ,Artificial intelligence ,business ,computer ,Electrical efficiency ,MNIST database ,Efficient energy use - Abstract
Inspired by the human brain's function and efficiency, neuromorphic computing offers a promising solution for a wide set of tasks, ranging from brain machine interfaces to real-time classification. The spiking neural network (SNN), which encodes and processes information with bionic spikes, is an emerging neuromorphic model with great potential to drastically promote the performance and efficiency of computing systems. However, an energy efficient hardware implementation and the difficulty of training the model significantly limit the application of the spiking neural network. In this work, we address these issues by building an SNN-based energy efficient system for real time classification with metal-oxide resistive switching random-access memory (RRAM) devices. We implement different training algorithms of SNN, including Spiking Time Dependent Plasticity (STDP) and Neural Sampling method. Our RRAM SNN systems for these two training algorithms show good power efficiency and recognition performance on realtime classification tasks, such as the MNIST digit recognition. Finally, we propose a possible direction to further improve the classification accuracy by boosting multiple SNNs.
- Published
- 2015
- Full Text
- View/download PDF
43. Design Methodologies for 3D Mixed Signal Integrated Circuits
- Author
-
Guoqing Chen, Wulong Liu, Huazhong Yang, Yu Wang, Yuan Xie, and Xue Han
- Subjects
Engineering ,12-bit ,business.industry ,Three-dimensional integrated circuit ,Successive approximation ADC ,Mixed-signal integrated circuit ,Integrated circuit ,Power (physics) ,law.invention ,Reduction (complexity) ,law ,Hardware_INTEGRATEDCIRCUITS ,Electronic engineering ,Performance improvement ,business - Abstract
Three-dimensional (3D) integration technology has been proposed as a promising technology to provide small footprint, reduced wirelength, and the capability of heterogeneous integration. In particular, 3D IC is a good candidate to address the design issues in conventional analog/digital mixed-signal IC designs. In this work, we focus on modeling and analyzing the impacts of through silicon vias (TSVs) on mixed-signal ICs. Based on the analysis, a set of design methodologies for 3D mixed-signal ICs are proposed. The design methodologies are verified with a case study, in which a 12-bit successive approximation register analog-to-digital converter (SAR ADC) is re-designed by partitioning it into three stacked layers for 3D integration. The experimental results show that, compared to the traditional 2D counterpart, our 3D SAR ADC with optimized TSV placement can achieve significant area and power reduction, and performance improvement. Specifically, due to the isolation of substrate noise disturbance in our 3D design, the signal-to-noise-plus-distortion ratio (SNDR) is improved from 68.74 dB to 74.12 dB.
- Published
- 2014
- Full Text
- View/download PDF
44. Run-Time Technique for Simultaneous Aging and Power Optimization in GPGPUs
- Author
-
Yuan Xie, Xiaoming Chen, Yu Wang, Yun Liang, and Huazhong Yang
- Subjects
Reduction (complexity) ,Negative-bias temperature instability ,Computer science ,Electronic engineering ,Bandwidth (computing) ,Parallel computing ,General-purpose computing on graphics processing units ,Degradation (telecommunications) ,Threshold voltage ,Power (physics) ,Power optimization - Abstract
High-performance general-purpose graphics processing units (GPGPUs) may suffer from serious power and negative bias temperature instability (NBTI) problems. In this paper, we propose a framework for run-time aging and power optimization. Our technique is based on the observation that many GPGPU applications achieve optimal performance with only a portion of cores due to either bandwidth saturation or shared resource contention. During run-time, given the dynamically tracked NBTI-induced threshold voltage shift and the problem size of GPGPU applications, our algorithm returns the optimal number of cores using detailed performance modeling. The unused cores are power-gated for power saving and NBTI recovery. Experiments show that our proposed technique achieves on average 34% reduction in NBTI-induced threshold voltage shift and 19% power reduction, while the average performance degradation is less than 1%.
- Published
- 2014
- Full Text
- View/download PDF
45. Accelerating frequent item counting with FPGA
- Author
-
Huazhong Yang, Yu Wang, Yuliang Sun, Rong Luo, Sitao Huang, Zilong Wang, and Lanjun Wang
- Subjects
Filter (video) ,Computer science ,Time series data mining ,Scalability ,Real-time computing ,Lookup table ,Data input ,Field-programmable gate array ,Throughput (business) - Abstract
Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data input speeds, the most challenging problem in frequent item counting is to meet the requirement of wire-speed processing. In this paper, we propose a streaming oriented PE-ring framework on FPGA for counting frequent items. Compared with the best existing FPGA implementation, our basic PE-ring framework saves 50% lookup table resources cost and achieves the same throughput in a more scalable way. Furthermore, we adopt SIMD-like cascaded filter for further performance improvements, which outperforms the previous work by up to 3.24 times in some data distributions.
- Published
- 2014
- Full Text
- View/download PDF
46. Nonzero pattern analysis and memory access optimization in GPU-based sparse LU factorization for circuit simulation
- Author
-
Du Su, Huazhong Yang, Yu Wang, and Xiaoming Chen
- Subjects
Computer science ,MathematicsofComputing_NUMERICALANALYSIS ,Pattern analysis ,Parallel computing ,Sparse approximation ,Incomplete LU factorization ,Solver ,LU decomposition ,Memory access pattern ,Computational science ,law.invention ,law ,Component (UML) ,Computer Science::Mathematical Software ,Sparse matrix - Abstract
The sparse matrix solver is a critical component in circuit simulators. Some researches have developed GPU-based LU factorization approaches to accelerate the sparse solver. But the performance of these solvers is constrained by the irregularities of sparse matrices. This work investigates the nonzero patterns and memory access patterns in sparse LU factorization, and explores the common features to give guidelines on the improvements of the GPU solvers. We further propose a crisscross blocked implementation on GPUs. The proposed method attains average speedups of 1.68× compared with the unblocked method and 2.2× compared with 4-threaded PARDISO, for circuit matrices.
- Published
- 2013
- Full Text
- View/download PDF
47. Accelerating subsequence similarity search based on dynamic time warping distance with FPGA
- Author
-
Huazhong Yang, Zilong Wang, Yu Wang, Sitao Huang, Lanjun Wang, and Hao Li
- Subjects
Reduction (complexity) ,Dynamic time warping ,Data dependency ,Speedup ,Computer science ,Nearest neighbor search ,Search engine indexing ,Subsequence ,Synchronization (computer science) ,Parallel computing - Abstract
Subsequence search, especially subsequence similarity search, is one of the most important subroutines in time series data mining algorithms, and there is increasing evidence that Dynamic Time Warping (DTW) is the best distance metric. However, in spite of the great effort in software speedup techniques, including early abandoning strategies, lower bound, indexing, computation-reuse, DTW still cost too much time for many applications, e.g. 80% of the total time. Since DTW is a 2-Dimension sequential dynamic search with quite high data dependency, it is hard to use parallel hardware to accelerate it. In this work, we propose a novel framework for FPGA based subsequence similarity search and a novel PE-ring structure for DTW calculation. This framework utilizes the data reusability of continuous DTW calculations to reduce the bandwidth and exploit the coarse-grain parallelism; meanwhile guarantees the accuracy with a two-phase precision reduction. The PE-ring supports on-line updating patterns of arbitrary lengths, and utilizes the hard-wired synchronization of FPGA to realize the fine-grained parallelism. It also achieves flexible parallelism degree to do performance-cost trade-off. The experimental results show that we can achieve several orders of magnitude speedup in accelerating subsequence similarity search compared with the best software and current GPU/FPGA implementations in different datasets.
- Published
- 2013
- Full Text
- View/download PDF
48. Pub/Sub on stream
- Author
-
Huazhong Yang, Xiaotao Chang, Zhaoran Wang, Yu Zhang, Yu Wang, Xiang Mi, and Kun Wang
- Subjects
Schedule ,Smart grid ,Computer science ,business.industry ,Stream ,Distributed computing ,Quality of service ,Scalability ,Throughput ,Message broker ,business ,Blossom algorithm ,Computer network - Abstract
Publish/Subscribe (Pub/Sub) is becoming an increasingly popular message delivery technique in the Internet of Things (IoT) era. However, classical Publish/Subscribe is not suitable for some emerging IoT applications such as smart grid, transportation and sensor/actuator applications due to its lack of QoS capability.To meet the requirements for QoS in IoT message delivery, in this paper we propose the first Publish/Subscribe message broker with the ability to actively schedule computation resources to guarantee QoS requirements. We abstract the message matching algorithm into a task graph to express the data flow, forming a task-based stream matching framework. Based on the framework, we explore a message dispatching algorithm called Smart Dispatch and a task scheduling algorithm called DFGS to guarantee different QoS requirements.Experiments show that, the QoS-aware system can support more than 10x throughput than QoS-ignorant systems in representative Smart Grid cases. Also, our system shows near-linear scalability on a commodity multi-core machine.
- Published
- 2012
- Full Text
- View/download PDF
49. Sparse LU factorization for parallel circuit simulation on GPU
- Author
-
Xiaoming Chen, Chenxi Zhang, Yu Wang, Ling Ren, and Huazhong Yang
- Subjects
Multi-core processor ,Speedup ,Computer science ,Thread (computing) ,Parallel computing ,Solver ,LU decomposition ,Memory access pattern ,law.invention ,Matrix (mathematics) ,CUDA ,Factorization ,law ,Scalability - Abstract
Sparse solver has become the bottleneck of SPICE simulators. There has been few work on GPU-based sparse solver because of the high data-dependency. The strong data-dependency determines that parallel sparse LU factorization runs efficiently on shared-memory computing devices. But the number of CPU cores sharing the same memory is often limited. The state of the art Graphic Processing Units (GPU) naturally have numerous cores sharing the device memory, and provide a possible solution to the problem. In this paper, we propose a GPU-based sparse LU solver for circuit simulation. We optimize the work partitioning, the number of active thread groups, and the memory access pattern, based on GPU architecture. On matrices whose factorization involves many floating-point operations, our GPU-based sparse LU factorization achieves 7.90× speedup over 1-core CPU and 1.49× speedup over 8-core CPU. We also analyze the scalability of parallel sparse LU factorization and investigate the specifications on CPUs and GPUs that most influence the performance.
- Published
- 2012
- Full Text
- View/download PDF
50. A low-power all-digital GFSK demodulator with robust clock data recovery
- Author
-
Huazhong Yang, Rong Luo, Pengpeng Chen, and Bo Zhao
- Subjects
Intermediate frequency ,CMOS ,Low IF receiver ,Computer science ,Electronic engineering ,Bit error rate ,Modulation index ,Demodulation ,Frequency deviation ,Center frequency - Abstract
This paper presents an all-digital Gaussian frequency shift keying (GFSK) demodulator with robust clock data recovery (CDR) for low-intermediate-frequency (low-IF) receivers in wireless sensor networks (WSN). The proposed demodulator can detect and adapt to the intermediate frequency of the received signal automatically. In addition, the CDR can tolerate the frequency deviation of the input clock. An implementation of the demodulator with CDR is realized with HJTC 0.18 ¼m CMOS technology. The chip is designed for GFSK signals with a center frequency of 200 kHz, a modulation index of 1 and a data rate of 100 kbps. Experimental results show that the chip consumes 0.53 mA from a 1.8 V power supply, and only a 11 dB input signal to noise ratio (SNR) is required for 10-3 bit error rate (BER). The tolerance range for IF offset is \pm12.5% at 11 dB input SNR, and the CDR can tolerate frequency deviation of the input clock of \pm0.1%.
- Published
- 2012
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.