186 results on '"Meng Fan"'
Search Results
2. MARS: Multimacro Architecture SRAM CIM-Based Accelerator With Co-Designed Compressed Neural Networks
- Author
-
Jye-Luen Lee, Chih-Cheng Hsieh, Zuo-Wei Yeh, Kea-Tiong Tang, Syuan-Hao Sie, Meng-Fan Chang, Yi-Ren Chen, Zhaofang Li, and Chih-Cheng Lu
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,Computer Science - Emerging Technologies ,ComputerApplications_COMPUTERSINOTHERSYSTEMS ,02 engineering and technology ,Convolutional neural network ,Machine Learning (cs.LG) ,Software ,Hardware Architecture (cs.AR) ,0202 electrical engineering, electronic engineering, information engineering ,Static random-access memory ,Electrical and Electronic Engineering ,Macro ,Computer Science - Hardware Architecture ,Throughput (business) ,Artificial neural network ,business.industry ,Deep learning ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Emerging Technologies (cs.ET) ,Embedded system ,Artificial intelligence ,business ,Efficient energy use - Abstract
Convolutional neural networks (CNNs) play a key role in deep learning applications. However, the large storage overheads and the substantial computation cost of CNNs are problematic in hardware accelerators. Computing-in-memory (CIM) architecture has demonstrated great potential to effectively compute large-scale matrix-vector multiplication. However, the intensive multiply and accumulation (MAC) operations executed at the crossbar array and the limited capacity of CIM macros remain bottlenecks for further improvement of energy efficiency and throughput. To reduce computation costs, network pruning and quantization are two widely studied compression methods to shrink the model size. However, most of the model compression algorithms can only be implemented in digital-based CNN accelerators. For implementation in a static random access memory (SRAM) CIM-based accelerator, the model compression algorithm must consider the hardware limitations of CIM macros, such as the number of word lines and bit lines that can be turned on at the same time, as well as how to map the weight to the SRAM CIM macro. In this study, a software and hardware co-design approach is proposed to design an SRAM CIM-based CNN accelerator and an SRAM CIM-aware model compression algorithm. To lessen the high-precision MAC required by batch normalization (BN), a quantization algorithm that can fuse BN into the weights is proposed. Furthermore, to reduce the number of network parameters, a sparsity algorithm that considers a CIM architecture is proposed. Last, MARS, a CIM-based CNN accelerator that can utilize multiple SRAM CIM macros as processing units and support a sparsity neural network, is proposed., IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2021
- Published
- 2022
3. Two-Way Transpose Multibit 6T SRAM Computing-in-Memory Macro for Inference-Training AI Edge Chips
- Author
-
Yen-Lin Chung, Ting-Wei Chang, Ren-Shuo Liu, Fu-Chun Chang, Jian-Wei Su, Sih-Han Li, Hongwu Jiang, Shimeng Yu, Ta-Wei Liu, Yung-Ning Tu, Kea-Tiong Tang, Chung-Chuan Lo, Meng-Fan Chang, Shanshi Huang, Yuan Wu, Wei-Hsing Huang, Yen-Chi Chou, Chih-Cheng Hsieh, Pei-Jung Lu, Jing-Hong Wang, Ruhui Liu, Jin-Sheng Ren, Chih-I Wu, Xin Si, and Shyh-Shyuan Sheu
- Subjects
Process variation ,Edge device ,Computer science ,Computation ,Transpose ,Static random-access memory ,Enhanced Data Rates for GSM Evolution ,Electrical and Electronic Engineering ,Macro ,Computational science ,Efficient energy use - Abstract
Computing-in-memory (CIM) based on SRAM is a promising approach to achieving energy-efficient multiply-and-accumulate (MAC) operations in artificial intelligence (AI) edge devices; however, existing SRAM-CIM chips support only DNN inference. The flow of training data requires that CIM arrays perform convolutional computation using transposed weight matrices. This article presents a two-way transpose (TWT) multiply cell with high resistance to process variation and a novel read scheme that uses input-aware zone prediction of maximum partial MAC values to enhance the signal margin for robust readout. A 28-nm 64-kb TWT CIM macro fabricated using foundry-provided compact 6T-SRAM cells achieved ${T_{AC}}$ of 3.8-21 ns and energy efficiency of 7-61.1 TOPS/W in performing MAC operations using 2-8-b inputs, 4-8-b weights, and 10-20-b outputs.
- Published
- 2022
4. A 40-nm, 64-Kb, 56.67 TOPS/W Voltage-Sensing Computing-In-Memory/Digital RRAM Macro Supporting Iterative Write With Verification and Online Read-Disturb Detection
- Author
-
Meng-Fan Chang, Win-San Khwa, Muya Chang, Arijit Raychowdhury, Yu-Der Chih, and Jong-Hyeok Yoon
- Subjects
Reliability (semiconductor) ,CMOS ,Computer science ,business.industry ,Quantization (signal processing) ,State (computer science) ,Electrical and Electronic Engineering ,Macro ,Chip ,business ,Computer hardware ,Efficient energy use ,Resistive random-access memory - Abstract
Computing-in-memory (CIM) architectures have gained importance in achieving high-throughput energy-efficient artificial intelligence (AI) systems. Resistive RAM (RRAM) is a promising candidate for CIM architectures due to a multiply-and-accumulate (MAC)-friendly structure, high bit density, compatibility with a CMOS process, and nonvolatility. Notwithstanding the advancement of RRAM technology, the reliability of an RRAM array hinders the spread of RRAM applications such that a circuit-technology joint approach is necessary to attain reliable RRAM-based CIM architectures. This article presents a 64-kb hybrid CIM/digital RRAM macro supporting: 1) active-feedback-based voltage-sensing read (RD) to enable 1-8-b programmable vector-matrix multiplication under a low-resistance ratio of the high-resistance state to the low-resistance state in an RRAM array; 2) iterative write with verification to secure a tight resistance distribution; and 3) online RD-disturb detection in the background during CIM. The test chip fabricated in a 40-nm CMOS and RRAM process achieves a peak energy efficiency of 56.67 TOPS/W while demonstrating the eight-bitline hybrid CIM/digital MAC operation with 1-8-b inputs and weights and 20-b outputs without quantization.
- Published
- 2022
5. SAPIENS: A 64-kb RRAM-Based Non-Volatile Associative Memory for One-Shot Learning and Inference at the Edge
- Author
-
Yue-Der Chih, Weier Wan, Ching-Hua Wang, Harry Chuang, Hongjie Wang, Po-Han Chen, Wei-Chen Chen, Haitong Li, Priyanka Raina, Akash Levy, Win-San Khwa, H.-S. Philip Wong, and Meng-Fan Chang
- Subjects
Hardware_MEMORYSTRUCTURES ,Artificial neural network ,Computer science ,business.industry ,Pattern recognition ,Content-addressable memory ,Chip ,One-shot learning ,Electronic, Optical and Magnetic Materials ,Resistive random-access memory ,Feature (machine learning) ,Enhanced Data Rates for GSM Evolution ,Artificial intelligence ,Electrical and Electronic Engineering ,Quantization (image processing) ,business - Abstract
Learning from a few examples (one/few-shot learning) on the fly is a key challenge for on-device machine intelligence. We present the first chip-level demonstration of one-shot learning with Stanford Associative memory for Programmable, Integrated Edge iNtelligence via life-long learning and Search (SAPIENS), a resistive random access memory (RRAM)-based non-volatile associative memory (AM) chip that serves as the backend for memory-augmented neural networks (MANNs). The 64-kb fully integrated RRAM-CMOS AM chip performs long-term feature embedding and retrieval, demonstrated on a 32-way one-shot learning task on the Omniglot dataset. Using only one example per class for 32 unseen classes during on-chip learning, SAPIENS achieves 79% measured inference accuracy on Omniglot, comparable to edge software model accuracy using five-level quantization (82%). It achieves an energy efficiency of 118 GOPS/W at 200 MHz for in-memory L1 distance computation and prediction. Multi-bank measurements on the same chip show that increasing the capacity from three banks (24 kb) to eight banks (64 kb) improves the chip accuracy from 73.5% to 79%, while minimizing the accuracy excursion due to bank-to-bank variability.
- Published
- 2021
6. A Local Computing Cell and 6T SRAM-Based Computing-in-Memory Macro With 8-b MAC Operation for Edge AI Chips
- Author
-
Yung-Ning Tu, William Shih, Jing-Hong Wang, Wei-Chiang Shih, Xin Si, Yajuan He, Yen-Chi Chou, Nan-Chun Lien, Yen-Lin Chung, Meng-Fan Chang, Qiang Li, Jian-Wei Su, Ta-Wei Liu, Ssu-Yen Wu, Pei-Jung Lu, Ren-Shuo Liu, Chih-Cheng Hsieh, Ruhui Liu, Chung-Chuan Lo, Kea-Tiong Tang, and Wei-Hsing Huang
- Subjects
Process variation ,business.industry ,Computer science ,Sense amplifier ,Header ,Static random-access memory ,Enhanced Data Rates for GSM Evolution ,Electrical and Electronic Engineering ,Macro ,business ,B-MAC ,Computer hardware ,Efficient energy use - Abstract
This article presents a computing-in-memory (CIM) structure aimed at improving the energy efficiency of edge devices running multi-bit multiply-and-accumulate (MAC) operations. The proposed scheme includes a 6T SRAM-based CIM (SRAM-CIM) macro capable of: 1) weight-bitwise MAC (WbwMAC) operations to expand the sensing margin and improve the readout accuracy for high-precision MAC operations; 2) a compact 6T local computing cell to perform multiplication with suppressed sensitivity to process variation; 3) an algorithm-adaptive low MAC-aware readout scheme to improve energy efficiency; 4) a bitline header selection scheme to enlarge signal margin; and 5) a small-offset margin-enhanced sense amplifier for robust read operations against process variation. A fabricated 28-nm 64-kb SRAM-CIM macro achieved access times of 4.1–8.4 ns with energy efficiency of 11.5–68.4 TOPS/W, while performing MAC operations with 4- or 8-b input and weight precision.
- Published
- 2021
7. A quasi-monolithic phase-field description for mixed-mode fracture using predictor–corrector mesh adaptivity
- Author
-
Yan Jin, Thomas Wick, and Meng Fan
- Subjects
Soundness ,Predictor–corrector method ,Work (thermodynamics) ,Basis (linear algebra) ,Field (physics) ,Computer science ,Finite elements ,General Engineering ,Predictor–corrector mesh refinement ,010103 numerical & computational mathematics ,Mixed-mode fracture ,Dewey Decimal Classification::600 | Technik ,01 natural sciences ,Computer Science Applications ,010101 applied mathematics ,Dewey Decimal Classification::000 | Allgemeines, Wissenschaft::000 | Informatik, Wissen, Systeme::004 | Informatik ,Robustness (computer science) ,Modeling and Simulation ,Phase-field fracture ,Uniaxial compression test ,Fracture (geology) ,Applied mathematics ,0101 mathematics ,Software ,Energy (signal processing) - Abstract
In this work, we develop a mixed-mode phase-field fracture model employing a parallel-adaptive quasi-monolithic framework. In nature, failure of rocks and rock-like materials is usually accompanied by the propagation of mixed-mode fractures. To address this aspect, some recent studies have incorporated mixed-mode fracture propagation criteria to classical phase-field fracture models, and new energy splitting methods were proposed to split the total crack driving energy into mode-I and mode-II parts. As extension in this work, a splitting method for masonry-like materials is modified and incorporated into the mixed-mode phase-field fracture model. A robust, accurate and efficient parallel-adaptive quasi-monolithic framework serves as basis for the implementation of our new model. Three numerical tests are carried out, and the results of the new model are compared to those of existing models, demonstrating the numerical robustness and physical soundness of the new model. In total, six models are computationally analyzed and compared.
- Published
- 2021
8. STICKER-T: An Energy-Efficient Neural Network Processor Using Block-Circulant Algorithm and Unified Frequency-Domain Acceleration
- Author
-
Jinshan Yue, Huazhong Yang, Zhe Yuan, Yung-Ning Tu, Xueqing Li, Ao Ren, Wenyu Sun, Meng-Fan Chang, Yi-Ju Chen, Ruoyang Liu, Yanzhi Wang, and Yongpan Liu
- Subjects
Artificial neural network ,CMOS ,Computer science ,Frequency domain ,Fast Fourier transform ,Overhead (computing) ,Enhanced Data Rates for GSM Evolution ,Electrical and Electronic Engineering ,Algorithm ,Circulant matrix ,Efficient energy use ,Block (data storage) - Abstract
The emerging edge intelligence requires low-cost energy-efficient neural network (NN) processors. Supporting various types of edge NN models leads to extra circuit overhead. Designing a unified NN processor with high energy/area efficiency is challenging. This work presents a frequency-domain-accelerated unified NN processor, named STICKER-T. It combines algorithm, architecture, and circuit-level optimization to achieve high energy/area efficiency. By utilizing the block-circulant NN (CirCNN) algorithm, this work supports frequency-domain acceleration and a unified workflow for convolutional, fully connected, and recurrent NN (CNN/FC/RNN). Three key innovations are proposed. First, a block-circulant-accelerated chip architecture is implemented to support unified CNN/FC/RNN workflow. Second, a multi-bit 8-128-point global-parallel local-bit-serial fast Fourier transform (FFT) module is designed for efficient high-throughput FFT/inverse FFT (IFFT) operation. Third, by utilizing a 6T hierarchical-bitline-switching transpose-SRAM (HBST-TRAM), 2-D data reuse is enabled in the proposed multi-bit frequency-domain multiply–accumulate (MAC) array. STICKER-T was fabricated in a 65-nm CMOS technology. It can operate at 0.54–1.15 V and 25–200 MHz with 13.3–339-mW power consumption. The peak energy efficiency achieves 140.3 TOPS/W. It shows 8.1 $\times $ area efficiency and 4.2 $\times $ energy efficiency at 4-bit precision compared with the state-of-the-art reconfigurable NN processor.
- Published
- 2021
9. CiM3D: Comparator-in-Memory Designs Using Monolithic 3-D Technology for Accelerating Data-Intensive Applications
- Author
-
Wen-Kuan Yeh, Je-Min Hung, Cheng-Xin Xue, Hariram Thirucherai Govindarajan, Akshay Krishna Ramanathan, Srivatsa Srinivasa Rangachar, Sheng-Po Huang, Chun-Ying Lee, Meng-Fan Chang, Chang-Hong Shen, Fu-Kuo Hsueh, Jia-Min Shieh, Vijaykrishnan Narayanan, John Sampson, and Mon-Shu Ho
- Subjects
Computer engineering. Computer hardware ,Comparator ,Computer science ,3-D-SRAM ,Sorting ,monolithic (sequential) 3-D integrated circuit (M3D-IC) ,Parallel computing ,Electronic, Optical and Magnetic Materials ,TK7885-7895 ,TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,Hardware and Architecture ,ComputingMethodologies_SYMBOLICANDALGEBRAICMANIPULATION ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,sparse matrix multiplication ,Multiplication ,computing-in-memory ,Static random-access memory ,Electrical and Electronic Engineering ,Macro ,Massively parallel ,Energy (signal processing) ,Sparse matrix - Abstract
The compare operation is widely used in many applications, from fundamental sorting to primitive operations in the database and AI systems. We present SRAM-based 3-D-CAM circuit designs using a monolithic 3-D (M3D) integration process for realizing beyond-Boolean in-memory compare operation without any area overheads. We also fabricated a processing-in-memory (PiM) macro with the same 3-D-CAM circuit using M3D for performing massively parallel compare operations used in the database, machine learning, and scientific applications. We show various system designs with the 3-D-CAM supporting operations, such as data filtering, sorting, and sparse matrix–matrix multiplication (SpGEMM). Our systems exhibit up to $272\times $ , $200\times $ , and $226\times $ speedups and $151\times $ , $37\times $ , and $156\times $ energy savings compared to systems using near memory compute for the data filtering, sorting, and SpGEMM applications, respectively.
- Published
- 2021
10. Challenges and Trends of SRAM-Based Computing-In-Memory for AI Edge Devices
- Author
-
Meng-Fan Chang, Je-Min Hung, Chuan-Jia Jhang, Cheng-Xin Xue, and Fu-Chun Chang
- Subjects
Hardware_MEMORYSTRUCTURES ,Edge device ,Computer science ,Bottleneck ,Computing architecture ,symbols.namesake ,Computer architecture ,Memory wall ,symbols ,Static random-access memory ,Electrical and Electronic Engineering ,Macro ,Von Neumann architecture ,Efficient energy use - Abstract
When applied to artificial intelligence edge devices, the conventionally von Neumann computing architecture imposes numerous challenges (e.g., improving the energy efficiency), due to the memory-wall bottleneck involving the frequent movement of data between the memory and the processing elements (PE). Computing-in-memory (CIM) is a promising candidate approach to breaking through this so-called memory wall bottleneck. SRAM cells provide unlimited endurance and compatibility with state-of-the-art logic processes. This paper outlines the background, trends, and challenges involved in the further development of SRAM-CIM macros. This paper also reviews recent silicon-verified SRAM-CIM macros designed for logic and multiplication-accumulation (MAC) operations.
- Published
- 2021
11. A Highly Reliable RRAM Physically Unclonable Function Utilizing Post-Process Randomness Source
- Author
-
He Qian, Bin Gao, Wei-En Lin, Shimeng Yu, Yachuan Pang, Jianshi Tang, Meng-Fan Chang, Huaqiang Wu, Bohan Lin, Dong Wu, Ting-Wei Chang, and Xiaoyu Sun
- Subjects
Hardware security module ,Computer science ,Sense amplifier ,Reliability (computer networking) ,020208 electrical & electronic engineering ,Physical unclonable function ,Reconfigurability ,02 engineering and technology ,Chip ,Resistive random-access memory ,ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,Bit error rate ,Electrical and Electronic Engineering - Abstract
Physically unclonable function (PUF) has been increasingly used as a promising primitive for hardware security with a wide range of applications in the Internet of Things (IoT). In recent years, novel PUF techniques based on resistive switching mechanism in various emerging nonvolatile memories have demonstrated superior performance on reliability and integration density. In this work, a resistive random access memory (RRAM)-based PUF chip with 8-kb capacity is developed. Two operation modes, namely differential mode and median mode, are embedded on chip. To implement these modes, a current sampling-based sense amplifier is designed to distinguish the current values of the PUF cells and the reference cell. In addition, a split-resistance scheme is proposed to enhance the PUF’s reliability significantly. The experiment results show that the differential PUF exhibits excellent performance with native bit error rate (N-BER) below 6 $\times $ 10−6 and inter-Hamming distance (inter-HD) of 49.99%. In the meanwhile, the reconfigurability of PUF challenge-response pairs (CRPs) is demonstrated with 49.77% and 47.29% reconfigure-Hamming distance (reconfigure-HD) in the median mode and the differential mode, respectively.
- Published
- 2021
12. Efficient and Robust Nonvolatile Computing-In-Memory Based on Voltage Division in 2T2R RRAM With Input-Dependent Sensing Control
- Author
-
Wang Ye, Linfang Wang, Xin Si, Feng Zhang, Dashan Shang, Jing Liu, Chunmeng Dou, Xiaoxin Xu, Yongpan Liu, Meng-Fan Chang, Qi Liu, and Jianfeng Gao
- Subjects
Reduction (complexity) ,Memory management ,Artificial neural network ,Computer science ,Voltage divider ,Electronic engineering ,Ranging ,Electrical and Electronic Engineering ,Signal ,Power (physics) ,Resistive random-access memory - Abstract
Resistive memory (RRAM) provides an ideal platform to develop embedded non-volatile computing-in-memory (nvCIM). However, it faces several critical challenges ranging from device non-idealities, large DC currents, and small signal margins. To address these issues, we propose voltage-division (VD) based computing approach and its circuit implementation in two-transistor-two-resistor (2T2R) RRAM cell arrays, which can realize energy-efficient, sign-aware, and robust deep neural network (DNN) processing. A readout technique, namely the input-dependent sensing control (IDSC) scheme, is also introduced for power saving. On this basis, a 400kb VD-based RRAM nvCIM is silicon verified. It achieves 2.54 times power reduction compared to that of the ones rely on conventional weighted-current summation (WCS) mechanism, a peak energy-efficiency of 42.6 TOPS/W and a minimum latency of 15.98 ns.
- Published
- 2021
13. A 0.5-V Real-Time Computational CMOS Image Sensor With Programmable Kernel for Feature Extraction
- Author
-
Chih-Cheng Hsieh, Chung-Chuan Lo, Meng-Fan Chang, Yi-Ren Chen, Tzu-Hsiang Hsu, Ren-Shuo Liu, and Kea-Tiong Tang
- Subjects
Pixel ,Computer science ,business.industry ,020208 electrical & electronic engineering ,Feature extraction ,02 engineering and technology ,Frame rate ,Convolutional neural network ,Kernel (image processing) ,Gesture recognition ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Image sensor ,business ,Computer hardware ,Pulse-width modulation - Abstract
As the growing demand on artificial intelligence (AI) Internet-of-Things (IoT) devices, smart vision sensors with energy-efficient computing capability are required. This article presents a low-power and low-voltage dual mode 0.5-V computational CMOS image sensor (C2IS) with array-parallel computing capability for feature extraction using convolution. In the feature extraction mode, by applying the pulsewidth modulation (PWM) pixel and switch-current integration (SCI) circuit, the in-sensor eight-directional matrix-parallel multiply–accumulate (MAC) operation is realized. Furthermore, the analog-domain convolution-on-readout (COR) operation, the programmable $3\times3$ kernel with ±3-bit weights, and the tunable-resolution column-parallel analog-to-digital converter (ADC) (1–8 bit) are implemented to achieve the real-time feature extraction without using additional memory and sacrificing frame rate. In the image capturing mode, the sensor provides the linear-response 8-bit raw image data. The C2IS prototype has been fabricated in the TSMC 0.18- $\mu \text{m}$ standard process technology and verified to demonstrate the raw and feature images at 480 frames/s with a power consumption of 77/ $117~\mu \text{W}$ and the resultant FoM of 9.8/14.8 pJ/pixel/frame, respectively. The prototype sensor is used as a real-time edge feature detection frond-end camera and accompanied with a simplified convolutional neural network (CNN) architecture to demonstrate the hand gesture recognition. The prototype system achieves more than 95% validation accuracy.
- Published
- 2021
14. Understanding the scale-up of fermentation processes from the viewpoint of the flow field in bioreactors and the physiological response of strains
- Author
-
Guan Wang, Yingping Zhuang, Jianye Xia, Chen Min, Meng Fan, and Zeyu Wang
- Subjects
Environmental Engineering ,Computer science ,General Chemical Engineering ,food and beverages ,Industrial fermentation ,02 engineering and technology ,General Chemistry ,021001 nanoscience & nanotechnology ,Biochemistry ,Flow field ,020401 chemical engineering ,SCALE-UP ,Bioreactor ,Fermentation ,Biochemical engineering ,0204 chemical engineering ,0210 nano-technology - Abstract
The production capability of a fermentation process is predominately determined by individual strains, which ultimately affected ultimately by interactions between the scale-dependent flow field developed within bioreactors and the physiological response of these strains. Interpreting these complicated interactions is key for better understanding the scale-up of the fermentation process. We review these two aspects and address progress in strategies for scaling up fermentation processes. A perspective on how to incorporate the multiomics big data into the scale-up strategy is presented to improve the design and operation of industrial fermentation processes.
- Published
- 2021
15. In-memory Learning with Analog Resistive Switching Memory: A Review and Perspective
- Author
-
An Chen, Xiaobo Sharon Hu, Jan Van der Spiegel, Huaqiang Wu, Yue Xi, Meng-Fan Chang, Bin Gao, He Qian, and Jianshi Tang
- Subjects
Artificial neural network ,Computer science ,law ,Circuit design ,Analog computer ,Electronic engineering ,Figure of merit ,Linearity ,Semiconductor memory ,Energy consumption ,Electrical and Electronic Engineering ,Electronic circuit ,law.invention - Abstract
In this article, we review the existing analog resistive switching memory (RSM) devices and their hardware technologies for in-memory learning, as well as their challenges and prospects. Since the characteristics of the devices are different for in-memory learning and digital memory applications, it is important to have an in-depth understanding across different layers from devices and circuits to architectures and algorithms. First, based on a top-down view from architecture to devices for analog computing, we define the main figures of merit (FoMs) and perform a comprehensive analysis of analog RSM hardware including the basic device characteristics, hardware algorithms, and the corresponding mapping methods for device arrays, as well as the architecture and circuit design considerations for neural networks. Second, we classify the FoMs of analog RSM devices into two levels. Level 1 FoMs are essential for achieving the functionality of a system (e.g., linearity, symmetry, dynamic range, level numbers, fluctuation, variability, and yield). Level 2 FoMs are those that make a functional system more efficient and reliable (e.g., area, operational voltage, energy consumption, speed, endurance, retention, and compatibility with back-end-of-line processing). By constructing a device-to-application simulation framework, we perform an in-depth analysis of how these FoMs influence in-memory learning and give a target list of the device requirements. Lastly, we evaluate the main FoMs of most existing devices with analog characteristics and review optimization methods from programming schemes to materials and device structures. The key challenges and prospects from the device to system level for analog RSM devices are discussed.
- Published
- 2021
16. Challenges and Trends of Nonvolatile In-Memory-Computation Circuits for AI Edge Devices
- Author
-
Je-Min Hung, Chuan-Jia Jhang, Yen-Cheng Chiu, Ping-Chun Wu, and Meng-Fan Chang
- Subjects
Non-volatile memory ,symbols.namesake ,Memory management ,Computer architecture ,Edge device ,Computer science ,Latency (audio) ,symbols ,Energy consumption ,Macro ,Quantization (image processing) ,Von Neumann architecture - Abstract
Nonvolatile memory (NVM)-based computing-in-memory (nvCIM) is a promising candidate for artificial intelligence (AI) edge devices to overcome the latency and energy consumption imposed by the movement of data between memory and processors under the von Neumann architecture. This paper explores the background and basic approaches to nvCIM implementation, including input methodologies, weight formation and placement, and readout and quantization methods. This paper outlines the major challenges in the further development of nvCIM macros and reviews trends in recent silicon-verified devices.
- Published
- 2021
17. A CMOS-integrated compute-in-memory macro based on resistive random-access memory for AI edge devices
- Author
-
Ta-Wei Liu, Meng-Fan Chang, Yu-Der Chih, Je-Syu Liu, Chin-Yi Su, Ting-Wei Chang, Shih-Ying Wei, Tsung-Yuan Huang, Cheng-Xin Xue, Wei-Chen Wei, Je-Min Hung, Chun-Ying Lee, Tai-Hsing Wen, Mon-Shu Ho, Yi-Ren Chen, Yen-Kai Chen, Kea-Tiong Tang, Yun-Chen Lo, Jing-Hong Wang, Sheng-Po Huang, Chou Chung-Cheng, Shih-Hsih Teng, Chung-Chuan Lo, Chih-Cheng Hsieh, Ren-Shuo Liu, Hui-Yao Kao, Yen-Cheng Chiu, and Tzu-Hsiang Hsu
- Subjects
Resistive touchscreen ,Edge device ,Computer science ,business.industry ,Process (computing) ,Energy consumption ,Electronic, Optical and Magnetic Materials ,Resistive random-access memory ,CMOS ,Electrical and Electronic Engineering ,Macro ,business ,Instrumentation ,Computer hardware ,Electronic circuit - Abstract
The development of small, energy-efficient artificial intelligence edge devices is limited in conventional computing architectures by the need to transfer data between the processor and memory. Non-volatile compute-in-memory (nvCIM) architectures have the potential to overcome such issues, but the development of high-bit-precision configurations required for dot-product operations remains challenging. In particular, input–output parallelism and cell-area limitations, as well as signal margin degradation, computing latency in multibit analogue readout operations and manufacturing challenges, still need to be addressed. Here we report a 2 Mb nvCIM macro (which combines memory cells and related peripheral circuitry) that is based on single-level cell resistive random-access memory devices and is fabricated in a 22 nm complementary metal–oxide–semiconductor foundry process. Compared with previous nvCIM schemes, our macro can perform multibit dot-product operations with increased input–output parallelism, reduced cell-array area, improved accuracy, and reduced computing latency and energy consumption. The macro can, in particular, achieve latencies between 9.2 and 18.3 ns, and energy efficiencies between 146.21 and 36.61 tera-operations per second per watt, for binary and multibit input–weight–output configurations, respectively. Commercial complementary metal–oxide–semiconductor and resistive random-access memory technologies can be used to create multibit compute-in-memory circuits capable of fast and energy-efficient inference for use in small artificial intelligence edge devices.
- Published
- 2020
18. A 4-Kb 1-to-8-bit Configurable 6T SRAM-Based Computation-in-Memory Unit-Macro for CNN-Based AI Edge Processors
- Author
-
Yen-Cheng Chiu, Ren-Shuo Liu, Yung-Ning Tu, Jing-Hong Wang, Jia-Jing Chen, Meng-Fan Chang, Kea-Tiong Tang, Shyh-Shyuan Sheu, Chih-Cheng Hsieh, Ruhui Liu, Sih-Han Li, Wei-Hsing Huang, Jian-Wei Su, Chih-I Wu, Wei-Chen Wei, Je-Min Hung, Zhixiao Zhang, and Xin Si
- Subjects
Random access memory ,business.industry ,Computer science ,Computation ,020208 electrical & electronic engineering ,8-bit ,Binary number ,02 engineering and technology ,0202 electrical engineering, electronic engineering, information engineering ,Static random-access memory ,Electrical and Electronic Engineering ,Macro ,business ,Computer hardware - Abstract
Previous SRAM-based computing-in-memory (SRAM-CIM) macros suffer small read margins for high-precision operations, large cell array area overhead, and limited compatibility with many input and weight configurations. This work presents a 1-to-8-bit configurable SRAM CIM unit-macro using: 1) a hybrid structure combining 6T-SRAM based in-memory binary product-sum (PS) operations with digital near-memory-computing multibit PS accumulation to increase read accuracy and reduce area overhead; 2) column-based place-value-grouped weight mapping and a serial-bit input (SBIN) mapping scheme to facilitate reconfiguration and increase array efficiency under various input and weight configurations; 3) a self-reference multilevel reader (SRMLR) to reduce read-out energy and achieve a sensing margin 2 $\times $ that of the mid-point reference scheme; and 4) an input-aware bitline voltage compensation scheme to ensure successful read operations across various input-weight patterns. A 4-Kb configurable 6T-SRAM CIM unit-macro was fabricated using a 55-nm CMOS process with foundry 6T-SRAM cells. The resulting macro achieved access times of 3.5 ns per cycle (pipeline) and energy efficiency of 0.6–40.2 TOPS/W under binary to 8-b input/8-b weight precision.
- Published
- 2020
19. A Relaxed Quantization Training Method for Hardware Limitations of Resistive Random Access Memory (ReRAM)-Based Computing-in-Memory
- Author
-
Chih-Cheng Lu, Meng-Fan Chang, Cheng-Xin Xue, Wei-Chen Wei, Jye-Luen Lee, Hao-Wen Kuo, Syuan-Hao Sie, Kea-Tiong Tang, Chuan-Jia Jhang, and Yi-Ren Chen
- Subjects
lcsh:Computer engineering. Computer hardware ,Computer science ,Computation ,lcsh:TK7885-7895 ,02 engineering and technology ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Macro ,computing-in-memory (CIM) ,business.industry ,Quantization (signal processing) ,020208 electrical & electronic engineering ,Compression ,Cognitive neuroscience of visual object recognition ,deep learning ,resistive random access memory (ReRAM) ,Training methods ,020202 computer hardware & architecture ,Electronic, Optical and Magnetic Materials ,Resistive random-access memory ,Neuromorphic engineering ,Hardware and Architecture ,quantization ,business ,MNIST database ,Computer hardware - Abstract
Nonvolatile computing-in-memory (nvCIM) exhibits high potential for neuromorphic computing involving massive parallel computations and for achieving high energy efficiency. nvCIM is especially suitable for deep neural networks, which are required to perform large amounts of matrix-vector multiplications. However, a comprehensive quantization algorithm has yet to be developed, which overcomes the hardware limitations of resistive random access memory (ReRAM)-based nvCIM, such as the number of I/O, word lines (WLs), and ADC outputs. In this article, we propose a quantization training method for compressing deep models. The method comprises three steps: input and weight quantization, ReRAM convolution (ReConv), and ADC quantization. ADC quantization optimizes the error sampling problem by using the Gumbel-softmax trick. Under a 4-bit ADC of nvCIM, the accuracy only decreases by 0.05% and 1.31% for the MNIST and CIFAR-10, respectively, compared with the corresponding accuracies obtained under an ideal ADC. The experimental results indicate that the proposed method is effective for compensating the hardware limitations of nvCIM macros.
- Published
- 2020
20. Challenges and Trends inDeveloping Nonvolatile Memory-Enabled Computing Chips for Intelligent Edge Devices
- Author
-
Meng-Fan Chang, Xueqing Li, Je-Min Hung, and Juejian Wu
- Subjects
010302 applied physics ,Edge device ,Computer science ,Computation ,01 natural sciences ,Electronic, Optical and Magnetic Materials ,Non-volatile memory ,symbols.namesake ,Computer architecture ,Memory wall ,0103 physical sciences ,symbols ,Electrical and Electronic Engineering ,Latency (engineering) ,Macro ,Efficient energy use ,Von Neumann architecture - Abstract
Under the von Neumann computing architecture, the edge devices used for artificial intelligence (AI) and the Internet of Things (IoTs) are limited in terms of latency and energy efficiency due to the movement of data between the memory and the processor. Nonvolatile memory-based computation-in-memory (nvCIM) is a promising candidate to overcome this issue, referred to as the memory wall. This article outlines the background and major challenges in the development of nvCIM macros as well as some of the recent progress in nvCIM for logic computation, pattern-matching computation, and multiply-and-accumulate (MAC) computation. We also summarize recent trends in nvCIM development at the end of each section.
- Published
- 2020
21. Embedded 1-Mb ReRAM-Based Computing-in- Memory Macro With Multibit Input and Weight for CNN-Based AI Edge Processors
- Author
-
Tung-Cheng Chang, Jing-Hong Wang, Je-Syu Liu, Chrong Jung Lin, Wei-En Lin, Ya-Chin King, Cheng-Xin Xue, Tsung-Yuan Huang, Chun-Ying Lee, Ren-Shuo Liu, Meng-Fan Chang, Wei-Hao Chen, Kea-Tiong Tang, Hui-Yao Kao, Wei-Yu Lin, Yen-Cheng Chiu, Jiafang Li, Ting-Wei Chang, Chih-Cheng Hsieh, and Wei-Chen Wei
- Subjects
Non-volatile memory ,business.industry ,Computer science ,Clamper ,Sense amplifier ,Circuit design ,Electrical and Electronic Engineering ,Macro ,business ,Computer hardware ,Resistive random-access memory - Abstract
Computing-in-memory (CIM) based on embedded nonvolatile memory is a promising candidate for energy-efficient multiply-and-accumulate (MAC) operations in artificial intelligence (AI) edge devices. However, circuit design for NVM-based CIM (nvCIM) imposes a number of challenges, including an area-latency-energy tradeoff for multibit MAC operations, pattern-dependent degradation in signal margin, and small read margin. To overcome these challenges, this article proposes the following: 1) a serial-input non-weighted product (SINWP) structure; 2) a down-scaling weighted current translator (DSWCT) and positive–negative current-subtractor (PN-ISUB); 3) a current-aware bitline clamper (CABLC) scheme; and 4) a triple-margin small-offset current-mode sense amplifier (TMCSA). A 55-nm 1-Mb ReRAM-CIM macro was fabricated to demonstrate the MAC operation of 2-b-input, 3-b-weight with 4-b-out. This nvCIM macro achieved $T_{\text {MAC}}= 14.6$ ns at 4-b-out with peak energy efficiency of 53.17 TOPS/W.
- Published
- 2020
22. A Twin-8T SRAM Computation-in-Memory Unit-Macro for Multibit CNN-Based AI Edge Processors
- Author
-
Wei-Hsing Huang, Ren-Shuo Liu, Yung-Ning Tu, Shimeng Yu, Yen-Cheng Chiu, Meng-Fan Chang, Qiang Li, Xiaoyu Sun, Jia-Jing Chen, Jing-Hong Wang, Rui Liu, Chih-Cheng Hsieh, Kea-Tiong Tang, Wei-Chen Wei, Xin Si, and Ssu-Yen Wu
- Subjects
Computer science ,business.industry ,Overhead (computing) ,Enhanced Data Rates for GSM Evolution ,Static random-access memory ,Electrical and Electronic Engineering ,business ,Chip ,Voltage reference ,Computer hardware - Abstract
Computation-in-memory (CIM) is a promising candidate to improve the energy efficiency of multiply-and-accumulate (MAC) operations of artificial intelligence (AI) chips. This work presents an static random access memory (SRAM) CIM unit-macro using: 1) compact-rule compatible twin-8T (T8T) cells for weighted CIM MAC operations to reduce area overhead and vulnerability to process variation; 2) an even–odd dual-channel (EODC) input mapping scheme to extend input bandwidth; 3) a two’s complement weight mapping (C2WM) scheme to enable MAC operations using positive and negative weights within a cell array in order to reduce area overhead and computational latency; and 4) a configurable global–local reference voltage generation (CGLRVG) scheme for kernels of various sizes and bit precision. A 64 $\times $ 60 b T8T unit-macro with 1-, 2-, 4-b inputs, 1-, 2-, 5-b weights, and up to 7-b MAC-value (MACV) outputs was fabricated as a test chip using a foundry 55-nm process. The proposed SRAM-CIM unit-macro achieved access times of 5 ns and energy efficiency of 37.5–45.36 TOPS/W under 5-b MACV output.
- Published
- 2020
23. FCEP: A Fast Concolic Execution for Reaching Software Patches
- Author
-
Meng Fan
- Subjects
Software ,Programming language ,business.industry ,Computer science ,computer.software_genre ,business ,computer ,Concolic execution - Published
- 2021
24. Identification of the Ferroptosis-Related Long Non-Coding RNAs Signature to Improve the Prognosis Prediction in patients with NSCLC
- Author
-
Mingwei Chen, Hui Ren, Puyu Shi, Meng Fan, Meng Li, and Yanpeng Zhang
- Subjects
Prognosis prediction ,Text mining ,Computer science ,business.industry ,Ferroptosis ,In patient ,Identification (biology) ,Computational biology ,business ,Signature (logic) ,Coding (social sciences) - Abstract
Background: Non-small cell lung cancer (NSCLC) is the most prevalent type of lung carcinoma with an unfavorable prognosis. Ferroptosis, a novel iron-dependent programmed cell death, is involved in the development of multiple cancers. Of note, the prognostic value of ferroptosis-related lncRNAs in NSCLC remains uncertain. Methods: Gene expression profiles and clinical information of NSCLC were retrieved from the TCGA database. Ferroptosis-related genes (FRGs) were explored in the FerrDb database and ferroptosis-related lncRNAs (FRGs-lncRNAs) were identified by the correlation analysis and the LncTarD database. Next, The differentially expressed FRGs-lncRNAs were screened and FRGs-lncRNAs associated with the prognosis were explored by univariate Cox regression analysis and Kaplan-Meier survival analysis. Then, an FRGs-lncRNAs signature was constructed by the Lasso-penalized Cox model in the training cohort and verified by internal and external validation. Finally, the potential correlation between risk score, immune response, and chemotherapeutic sensitivity was further investigated.Results: 129 lncRNAs with a potential regulatory relationship with 59 differentially expressed FRGs were found in NSCLC and 10 FRGs-lncRNAs associated with the prognosis of NSCLC were identified (PPConclusion: A novel FRGs-lncRNAs signature was successfully constructed, which may contribute to improving the management strategies of NSCLC.
- Published
- 2021
25. CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI Accelerator with 2 MByte On-Chip Foundry Resistive RAM for Efficient Training and Inference
- Author
-
Albert Gural, Kalhan Koul, Gregorio B. Lopes, Boris Murmann, Yu-Der Chih, Zainab F. Khan, Subhasish Mitra, Win-San Khwa, Guenole Lallement, Robert M. Radway, Massimo Giordano, Priyanka Raina, Meng-Fan Chang, Victor Turbiner, Timothy Liu, Rohan Doshi, John W. Kustin, and Kartik Prabhu
- Subjects
Very-large-scale integration ,Artificial neural network ,Computer science ,business.industry ,Enhanced Data Rates for GSM Evolution ,TOPS ,Macro ,business ,Chip ,Energy (signal processing) ,Computer hardware ,Resistive random-access memory - Abstract
CHIMERA is the first non-volatile deep neural network (DNN) chip for edge AI training and inference using foundry on-chip resistive RAM (RRAM) macros and no off-chip memory. CHIMERA achieves 0.92 TOPS peak performance and 2.2 TOPS/W. We scale inference to 6x larger DNNs by connecting 6 CHIMERAs with just 4% execution time and 5% energy costs, enabled by communication-sparse DNN mappings that exploit RRAM non-volatility through quick chip wakeup/shutdown (33 µs). We demonstrate the first incremental edge AI training which overcomes RRAM write energy, speed, and endurance challenges. Our training achieves the same accuracy as traditional algorithms with up to 283x fewer RRAM weight update steps and 340x better energy-delay product. We thus demonstrate 10 years of 20 samples/minute incremental edge AI training on CHIMERA.
- Published
- 2021
26. A 6.54-to-26.03 TOPS/W Computing-In-Memory RNN Processor using Input Similarity Optimization and Attention-based Context-breaking with Output Speculation
- Author
-
Zhixiao Zhang, Hao Sun, Ruiqi Guo, Ruhui Liu, Shouyi Yin, Leibo Liu, Hao Li, Shaojun Wei, Meng-Fan Chang, and Limei Tang
- Subjects
Very-large-scale integration ,Scheme (programming language) ,Speedup ,Computer science ,Pipeline (computing) ,Context (language use) ,Parallel computing ,Macro ,Chip ,Throughput (business) ,computer ,computer.programming_language - Abstract
This work presents a 65nm RNN processor with computing-inmemory (CIM) macros. The main contributions include: 1) A similarity analyzer (SimAyz) to fully leverage the temporal stability of input sequences with 1.52× performance speedup; 2) An attention-based context-breaking (AttenBrk) method with output speculation to reduce off-chip data accesses up to 30.3%; 3) A double-buffering scheme for CIM macros to hide writing latency and a pipeline processing element (PE) array to increase the system throughput. Measured results show that this chip achieves 6.54-to-26.03 TOPS/W energy efficiency vary from various LSTM benchmarks.
- Published
- 2021
27. Introduction to the Special Issue on the 2019 IEEE International Solid-State Circuits Conference (ISSCC)
- Author
-
Tony Chan Carusone, Meng-Fan Chang, Mingoo Seok, and Hsie-Chia Chang
- Subjects
Digital electronics ,business.industry ,Computer science ,Solid-state ,Electrical and Electronic Engineering ,Telecommunications ,business ,Electronic circuit - Abstract
This Special Issue of the IEEE Journal of Solid-State Circuits is dedicated to a collection of the best articles selected from the 2019 IEEE International Solid-State Circuits Conference (ISSCC) that took place on February 17–21, 2019, in San Francisco, CA, USA. This Special Issue covers articles from the Wireline, Digital Circuits, Digital Architectures and Systems (DASs), and Memory Committees.
- Published
- 2020
28. A Dual-Split 6T SRAM-Based Computing-in-Memory Unit-Macro With Fully Parallel Product-Sum Operation for Binarized DNN Edge Processors
- Author
-
Meng-Fan Chang, Qiang Li, Xiaoyu Sun, Xin Si, Jiafang Li, Jia-Jing Chen, Hiroyuki Yamauchi, Rui Liu, Shimeng Yu, and Win-San Khwa
- Subjects
Offset (computer science) ,Artificial neural network ,Computer science ,business.industry ,020208 electrical & electronic engineering ,Transistor ,Binary number ,02 engineering and technology ,law.invention ,XNOR gate ,Hardware and Architecture ,law ,0202 electrical engineering, electronic engineering, information engineering ,Static random-access memory ,Electrical and Electronic Engineering ,business ,Computer hardware ,Access time ,Efficient energy use - Abstract
Computing-in-memory (CIM) is a promising approach to reduce the latency and improve the energy efficiency of deep neural network (DNN) artificial intelligence (AI) edge processors. However, SRAM-based CIM (SRAM-CIM) faces practical challenges in terms of area overhead, performance, energy efficiency, and yield against variations in data patterns and transistor performance. This paper employed a circuit-system co-design methodology to develop a SRAM-CIM unit-macro for a binary-based fully connected neural network (FCNN) layer of the DNN AI edge processors. The proposed SRAM-CIM unit-macro supports two binarized neural network models: an XNOR neural network (XNORNN) and a modified binary neural network (MBNN). To achieve compact area, fast access time, robust operations, and high energy-efficiency, our proposed SRAM-CIM uses a split-wordline compact-rule 6T SRAM and circuit techniques, including a dynamic input-aware reference generation (DIARG) scheme, an algorithm-dependent asymmetric control (ADAC) scheme, a write disturb-free (WDF) scheme, and a common-mode-insensitive small offset voltage-mode sensing amplifier (CMI-VSA). A fabricated 65-nm 4-Kb SRAM-CIM unit-macro achieved 2.4- and 2.3-ns product-sum access times for a FCNN layer using XNORNN and MBNN, respectively. The measured maximum energy efficiency reached 30.49 TOPS/W for XNORNN and 55.8 TOPS/W for the MBNN modes.
- Published
- 2019
29. Recent Advances in Compute-in-Memory Support for SRAM Using Monolithic 3-D Integration
- Author
-
Zhixiao Zhang, Meng-Fan Chang, Srivatsa Srinivasa, Akshay Krishna Ramanathan, and Xin Si
- Subjects
Random access memory ,Moore's law ,Computer science ,media_common.quotation_subject ,ComputerApplications_COMPUTERSINOTHERSYSTEMS ,02 engineering and technology ,020202 computer hardware & architecture ,symbols.namesake ,Memory management ,Computer architecture ,Hardware and Architecture ,Logic gate ,MOSFET ,Memory architecture ,0202 electrical engineering, electronic engineering, information engineering ,symbols ,Static random-access memory ,Electrical and Electronic Engineering ,Software ,Von Neumann architecture ,media_common - Abstract
Computing-in-memory (CiM) is a popular design alternative to overcome the von Neumann bottleneck and improve the performance of artificial intelligence computing applications. Monolithic three-dimensional (M3D) technology is a promising solution to extend Moore's law through the development of CiM for data-intensive applications. In this article, we first discuss the motivation and challenges associated with two-dimensional CiM designs, and then examine the possibilities presented by emerging M3D technologies. Finally, we review recent advances and trends in the implementation of CiM using M3D technology.
- Published
- 2019
30. A 28-nm 320-Kb TCAM Macro Using Split-Controlled Single-Load 14T Cell and Triple-Margin Voltage Sense Amplifier
- Author
-
Meng-Fan Chang, Hiroyuki Yamauchi, Cheng-Xin Xue, Yi-Ju Chen, Wei-Cheng Zhao, and Tzu-Hsien Yang
- Subjects
Computer science ,Amplifier ,020208 electrical & electronic engineering ,Transistor ,02 engineering and technology ,Sense (electronics) ,law.invention ,Power (physics) ,Non-volatile memory ,CMOS ,Margin (machine learning) ,law ,Logic gate ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,Electrical and Electronic Engineering ,Macro ,Leakage (electronics) - Abstract
Ternary content-addressable memory (TCAM) is limited by large cell area, high search power, significant active-mode leakage current, and a tradeoff between search speed and signal margin on the match-line (ML). In this paper, we developed a split-controlled single-load 14T (SCSL-14T) TCAM cell and a triple-margin voltage sense amplifier (TM-VSA) to achieve the following: 1) compact cell area; 2) lower search delay and search energy; 3) reduced current leakage in standby and active modes; and 4) tolerance for small sensing margin. A testchip with 320-Kb 14T-TCAM macro was fabricated using a 28-nm CMOS logic process and modified compact foundry six-transistor (6T) cell. The proposed macro achieved search delay of only 710 ps and 0.422 fJ/bit/search.
- Published
- 2019
31. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors
- Author
-
Kea-Tiong Tang, Chunmeng Dou, Wei-Hao Chen, K. C. Li, Meng-Fan Chang, Ren-Shuo Liu, Cheng-Xin Xue, Pin-Yi Li, Chorng-Jung Lin, Yen-Cheng Chiu, Jing-Hong Wang, Mon-Shu Ho, Ya-Chin King, Jianhua Yang, Wei-Chen Wei, Jian-Hao Huang, Chih-Cheng Hsieh, and Wei-Yu Lin
- Subjects
Computer science ,business.industry ,Memristor ,Electronic, Optical and Magnetic Materials ,Resistive random-access memory ,law.invention ,CMOS ,law ,visual_art ,Electronic component ,visual_art.visual_art_medium ,Electrical and Electronic Engineering ,Crossbar switch ,business ,Instrumentation ,Access time ,Computer hardware ,Efficient energy use ,Electronic circuit - Abstract
Non-volatile computing-in-memory (nvCIM) could improve the energy efficiency of edge devices for artificial intelligence applications. The basic functionality of nvCIM has recently been demonstrated using small-capacity memristor crossbar arrays combined with peripheral readout circuits made from discrete components. However, the advantages of the approach in terms of energy efficiency and operating speeds, as well as its robustness against device variability and sneak currents, have yet to be demonstrated experimentally. Here, we report a fully integrated memristive nvCIM structure that offers high energy efficiency and low latency for Boolean logic and multiply-and-accumulation (MAC) operations. We fabricate a 1 Mb resistive random-access memory (ReRAM) nvCIM macro that integrates a one-transistor–one-resistor ReRAM array with control and readout circuits on the same chip using an established 65 nm foundry complementary metal–oxide–semiconductor (CMOS) process. The approach offers an access time of 4.9 ns for three-input Boolean logic operations, a MAC computing time of 14.8 ns and an energy efficiency of 16.95 tera operations per second per watt. Applied to a deep neural network using a split binary-input ternary-weighted model, the system can achieve an inference accuracy of 98.8% on the MNIST dataset. A 1 Mb non-volatile computing-in-memory system, which integrates a resistive memory array with control and readout circuits using an established 65 nm foundry CMOS process, can offer high energy efficiency and low latency for Boolean logic and multiply-and-accumulation operations.
- Published
- 2019
32. ROBIN: Monolithic-3D SRAM for Enhanced Robustness with In-Memory Computation Support
- Author
-
Xueqing Li, Akshay Krishna Ramanathan, Wei-Hao Chen, Jack Sampson, Vijaykrishnan Narayanan, Swaroop Ghosh, Sumeet Kumar Gupta, Meng-Fan Chang, and Srivatsa Srinivasa
- Subjects
business.industry ,Computer science ,020208 electrical & electronic engineering ,Transistor ,NAND gate ,02 engineering and technology ,law.invention ,Addressability ,Data access ,XNOR gate ,Hardware and Architecture ,Robustness (computer science) ,law ,0202 electrical engineering, electronic engineering, information engineering ,Static random-access memory ,Electrical and Electronic Engineering ,business ,Computer hardware ,Efficient energy use - Abstract
We present a novel 3D-SRAM cells using a monolithic 3D integration technology for realizing both robustness of the cell and in-memory Boolean logic computing capability. The proposed two-layer cell designs make use of additional transistors over the SRAM layer to enable assist techniques as well as provide logic functions (such as AND/NAND, OR/NOR, and XNOR/XOR) or enable content addressability without degrading cell density. Through analysis, we provide insights into the benefits provided by three memory assist and two logic modes, and we evaluate the energy efficiency of our proposed design. We show that the assist techniques improve SRAM read stability by $2.2\times $ and increase the write margin by 17.6% while staying within the SRAM footprint. By the virtue of increased robustness, the cell enables seamless operation at lower supply voltages; and thereby, ensures energy efficiency. Energy delay product reduces by $1.6\times $ over standard 6T SRAM with a faster data access. When computing bulk In-memory operations, $6.5\times $ energy saving is achieved as compared to computing outside the memory system.
- Published
- 2019
33. Stability analysis for stochastic complex-valued delayed networks with multiple nonlinear links and impulsive effects
- Author
-
Meng Fan, Huan Su, Pengfei Wang, and Zhenyao Sun
- Subjects
Work (thermodynamics) ,Computer science ,Applied Mathematics ,Mechanical Engineering ,Aerospace Engineering ,Complex valued ,Ocean Engineering ,01 natural sciences ,Stability (probability) ,Nonlinear system ,Dwell time ,Control and Systems Engineering ,Control theory ,0103 physical sciences ,Complex variables ,Electrical and Electronic Engineering ,010301 acoustics ,Differential inequalities - Abstract
This paper focuses on the stability of stochastic complex-valued delayed networks with multiple nonlinear links and impulsive effects. Different from the previous work, the links among nodes are multiple and can be nonlinear. Besides, the features of complex variables, time-varying delays and stochastic perturbations are taken into account. By utilizing complex version Ito’s formula, impulsive differential inequalities with multiple delays and graph-theoretical technique, several stability criteria are given without splitting the real and imaginary parts. These stability criteria show that if the impulsive dynamics is stable while continuous dynamics is not, it requires the dwell time of impulsive sequences to be small. Conversely, if the continuous dynamics is stable while impulsive dynamics is not, it requires the dwell time of impulsive sequences to be large. Then the theoretical results are applied to a class of stochastic complex-valued coupled oscillators. The numerical examples are carried out for demonstration purpose.
- Published
- 2019
34. Predicting protein-ligand interactions based on bow-pharmacological space and Bayesian additive regression trees
- Author
-
Haishuai Wang, Hao Dai, Hien-haw Liow, Dong-Qing Wei, Huai-Meng Fan, Ching Chiek Koh, Nicholas Keone Lee, J.B. Brown, Luonan Chen, Li Li, and Daniel Reker
- Subjects
Models, Molecular ,0301 basic medicine ,Computer science ,Bayesian probability ,lcsh:Medicine ,Computational biology ,Ligands ,Molecular Docking Simulation ,Article ,Statistics, Nonparametric ,Machine Learning ,03 medical and health sciences ,chemistry.chemical_compound ,Bayes' theorem ,0302 clinical medicine ,Drug Development ,Chemogenomics ,Computational models ,Humans ,lcsh:Science ,chemistry.chemical_classification ,Binding Sites ,Multidisciplinary ,Drug discovery ,Ligand ,lcsh:R ,High-throughput screening ,Proteins ,Bayes Theorem ,Small molecule ,High-Throughput Screening Assays ,3. Good health ,030104 developmental biology ,Enzyme ,Drug screening ,Models, Chemical ,Drug development ,chemistry ,lcsh:Q ,Hydrophobic and Hydrophilic Interactions ,Algorithms ,030217 neurology & neurosurgery ,Protein Binding ,Protein ligand - Abstract
Identifying potential protein-ligand interactions is central to the field of drug discovery as it facilitates the identification of potential novel drug leads, contributes to advancement from hits to leads, predicts potential off-target explanations for side effects of approved drugs or candidates, as well as de-orphans phenotypic hits. For the rapid identification of protein-ligand interactions, we here present a novel chemogenomics algorithm for the prediction of protein-ligand interactions using a new machine learning approach and novel class of descriptor. The algorithm applies Bayesian Additive Regression Trees (BART) on a newly proposed proteochemical space, termed the bow-pharmacological space. The space spans three distinctive sub-spaces that cover the protein space, the ligand space, and the interaction space. Thereby, the model extends the scope of classical target prediction or chemogenomic modelling that relies on one or two of these subspaces. Our model demonstrated excellent prediction power, reaching accuracies of up to 94.5–98.4% when evaluated on four human target datasets constituting enzymes, nuclear receptors, ion channels, and G-protein-coupled receptors . BART provided a reliable probabilistic description of the likelihood of interaction between proteins and ligands, which can be used in the prioritization of assays to be performed in both discovery and vigilance phases of small molecule development.
- Published
- 2019
35. Predicting seismic-based risk of lost circulation using machine learning
- Author
-
Zhen Nie, Yunhu Lu, Zhi Geng, Yunhong Ding, Mian Chen, Meng Fan, and Hanqing Wang
- Subjects
Lost circulation ,business.industry ,Computer science ,Supervised learning ,Seismic attribute ,Variance (accounting) ,Geotechnical Engineering and Engineering Geology ,Machine learning ,computer.software_genre ,Well drilling ,Field (computer science) ,Fuel Technology ,Software deployment ,Trajectory ,Artificial intelligence ,business ,computer - Abstract
Lost circulation during well drilling and completion wastes productive time, and even kills the well in severe cases. Timely identifying lost circulation events and taking countermeasures has been the focus of related study. However, a real prediction of lost circulation risk before drilling would be an active response to the challenge. In this paper, a technical solution is proposed to evaluate geological lost-circulation risk in the field using 3D seismic data attributes and machine learning technique. First, four seismic attributes (variance, attenuation, sweetness, RMS amplitude) that are the most correlated with lost circulation incidents are recommended. Then a prediction model is built by conducting supervised learning that involves a majority voting algorithm. The performance of the model is illustrated by six unseen drilled wells and shows the ability and potential to forecast lost circulation probability both along well trajectory and in the region far away from the drilled wells. The prediction resolution in the lateral and vertical direction is about 25 m and 6 m (2 ms), respectively, which are distinct advantages over the traditional description of geological structures using seismic data. It shows that the lost circulation risk can be hardly recognized by interpreting one specific seismic attribute, which is a common practice. Finally, the challenges in predicting lost circulation risk using seismic data are summarized. Overall, the study suggests that machine learning would be a practical solution to predict various construction risks that are related to seismic-based geological issues. Knowing in advance the risks, people could avoid or at least minimize the losses by optimizing well deployment in the field and taking preventive measures.
- Published
- 2019
36. A Few-Step and Low-Cost Memristor Logic Based on MIG Logic for Frequent-Off Instant-On Circuits in IoT Applications
- Author
-
Dongyu Fan, Meng-Fan Chang, Wei Wang, Yiming Wang, Ming Liu, Ling Li, Yun Li, Qi Liu, Xinghua Wang, Feng Zhang, and Haihua Shen
- Subjects
Adder ,business.industry ,Computer science ,Binary multiplier ,02 engineering and technology ,Memristor ,021001 nanoscience & nanotechnology ,Chip ,020202 computer hardware & architecture ,law.invention ,Stateful firewall ,law ,0202 electrical engineering, electronic engineering, information engineering ,Central processing unit ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,Electrical and Electronic Engineering ,0210 nano-technology ,Field-programmable gate array ,business ,Computer hardware ,Electronic circuit - Abstract
Nonvolatile logic implemented by memristor devices is a potential candidate for inherent logic-in-memory architecture. Majority-inverter graph (MIG) logic is a novel logic-structure compared with conventional AND/OR/INV graphs logic. In this brief, MIG logic constructed by memristors is proposed to implement the stateful logic arithmetic. A design of full adders based on MIG logic is proposed, followed by a 4-bit Wallace tree multiplier based on the MIG logic to implement in-situ store. A test chip controlled by a FPGA is designed to verify its feasibility. Compared with multipliers implemented by conventional logic, this method has fewer steps and smaller area, making it suitable for frequent-off and instant-on circuits in IoT applications. The experimental results have proved its significance on the grounds that the 4-bit multiplier only needs 18 processing steps with 51 memristor cells to complete a multiplication arithmetic whereas, by contrast, one conventional 4-bit binary multiplier requires 648 MOSFETs and 124 steps while IMP based 4-bit multiplier logic requires 112 CRS units and 221 steps to operate a multiplying computation.
- Published
- 2019
37. Stability of impulsive coupled systems on networks with both multicoupling structure and time‐varying delays
- Author
-
Meng Fan, Pengfei Wang, Huan Su, and Zhenyao Sun
- Subjects
Control and Systems Engineering ,Control theory ,Computer science ,Mechanical Engineering ,General Chemical Engineering ,Biomedical Engineering ,Structure (category theory) ,Aerospace Engineering ,Electrical and Electronic Engineering ,Stability (probability) ,Industrial and Manufacturing Engineering - Published
- 2019
38. Sparsity-Aware Clamping Readout Scheme for High Parallelism and Low Power Nonvolatile Computing-in-Memory Based on Resistive Memory
- Author
-
Wang Ye, Meng-Fan Chang, Ming Liu, Linfang Wang, Chunmeng Dou, Junjie An, and Qi Liu
- Subjects
Non-volatile memory ,Parallel processing (DSP implementation) ,Artificial neural network ,Computer science ,Electronic engineering ,Perceptron ,Throughput (business) ,Signal ,Clamping ,Resistive random-access memory - Abstract
The input parallelism of resistive memory (RRAM) based nonvolatile computing-in-memory (nvCIM) structure is limited by the signal margin as well as the readout precision. In this work, we propose a sparsity-aware clamping (SAC) scheme and its circuit implementation for nvCIM by co-design of circuit and algorithm. It can adaptively tune the quantized range and resolution of the readout circuit according to the degree of sparsity in neural network models. As a result, the SAC scheme can effectively increase the input parallelism of nvCIMs without incurring degradation on the signal margin or increasing the hardware cost for analogue readout. A case study on processing a multi-layer perceptron (MLP) model with the proposed nvCIM structure shows that the SAC scheme can improve the throughput by 2 times and increase the energy efficiency by 25.35% with negligible inference accuracy loss.
- Published
- 2021
39. A 40nm 1Mb 35.6 TOPS/W MLC NOR-Flash Based Computation-in-Memory Structure for Machine Learning
- Author
-
Jingjing Li, Zhaolong Qin, Yuxin Zhang, Yajuan He, Meng-Fan Chang, Qiang Li, Xin Si, Sitao Zeng, Sanfeng Zhang, Chen Wang, Chunmeng Dou, and Zhiguo Zhu
- Subjects
Computer science ,business.industry ,Amplifier ,ComputerApplications_COMPUTERSINOTHERSYSTEMS ,Multiplication ,Node (circuits) ,TOPS ,business ,Quantization (image processing) ,Throughput (business) ,Computer hardware ,Bottleneck ,Efficient energy use - Abstract
Computation-in-memory (CIM) is a feasible method to overcome "Von-Neumann bottleneck" with high throughput and energy efficiency. In this paper, we proposed a 1Mb Multi-Level (MLC) NOR Flash based CIM (MLFlash- CIM) structure with 40nm technology node. A multi-bit readout circuit was proposed to realize adaptive quantization, which comprises a current interface circuit, a multi-level analog shift amplifier (AS-Amp) and an 8-bit SAR-ADC. When applied to a modified VGG-16 Network with 16 layers, the proposed MLFlash-CIM can achieve 92.73% inference accuracy under CIFAR-10 dataset. This CIM structure also achieved a peak throughput of 3.277 TOPS and an energy efficiency of 35.6 TOPS/W with 4-bit multiplication and accumulation (MAC) operations.
- Published
- 2021
40. Challenges of Computation-in-Memory Circuits for AI Edge Applications
- Author
-
Meng-Fan Chang, Chuan-Jia Jhang, and Ping-Cheng Chen
- Subjects
Memory circuits ,Development (topology) ,Edge device ,Computer architecture ,Computer science ,Computation ,Enhanced Data Rates for GSM Evolution ,Macro ,Bottleneck ,Computing architecture - Abstract
In the conventionally von-Neumann computing architecture, the separation of memory and computing units (i.e., the memory-wall bottleneck) poses notable challenges to the development of energy-efficient AI edge devices. Computing-in-memory (CIM) is an emerging approach to breaking through this bottleneck. This paper outlines the background of computing-in-memory macros as well as on-going challenges and trends in the development of this technology.
- Published
- 2021
41. A 40nm 100Kb 118.44TOPS/W Ternary-weight Computein-Memory RRAM Macro with Voltage-sensing Read and Write Verification for reliable multi-bit RRAM operation
- Author
-
Yu-Der Chih, Meng-Fan Chang, Win-San Khwa, Muya Chang, Jong-Hyeok Yoon, and Arijit Raychowdhury
- Subjects
CMOS ,Computer science ,Encoding (memory) ,Electronic engineering ,State (computer science) ,Macro ,Convolutional neural network ,Throughput (business) ,Voltage ,Resistive random-access memory - Abstract
RRAM is a promising candidate for compute-in-memory (CIM) applications owing to its natural multiply-and-accumulate (MAC)-supporting structure, high bit-density, non-volatility, and a monolithic CMOS and RRAM process. In particular, multi-bit encoding in RRAM cells helps support advanced applications such as AI with higher MAC throughput and bit-density. Notwithstanding prior efforts into commercializing RRAM technology, underlying challenges hinder the wide usage of RRAM [1]. As a circuit-domain approach to address the challenges, this paper presents a 101.4Kb ternary-weight RRAM macro with 256x256 cells supporting: (1) CIM for ternary weight networks by employing voltage-based read (RD) with active feedback surmounting a low resistance ratio (R-ratio) between the high resistance state (HRS) and the low resistance state (LRS) in high-endurance RRAM, and (2) iterative write with verification (IWR) to facilitate a reliable multi-bit encoding under a narrow margin. Compared to [2] supporting CIM with binary RRAM cells, this work provides 38.44x (=33x3/23x3) flexibility on 3x3 filters in convolutional neural networks (CNNs), and 1.585x bit density improvement, thereby enabling advanced CIM applications with ternary weight networks.
- Published
- 2021
42. 16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for AI Edge Chips
- Author
-
Shyh-Shyuan Sheu, Shih-Chieh Chang, Li-Yang Hung, Kea-Tiong Tang, Meng-Fan Chang, Sih-Han Li, Yen-Lin Chung, Tianlong Pan, Ping-Chun Wu, Yen-Chi Chou, Jin-Sheng Ren, Jian-Wei Su, Ren-Shuo Liu, Pei-Jung Lu, Wei-Chung Lo, Chih-Cheng Hsieh, Chung-Chuan Lo, Ta-Wei Liu, Xin Si, Chih-I Wu, and Ruhui Liu
- Subjects
Computer science ,business.industry ,Transistor ,law.invention ,Process variation ,Memory management ,law ,System on a chip ,Static random-access memory ,Macro ,business ,Throughput (business) ,Dram ,Computer hardware - Abstract
Recent SRAM-based computation-in-memory (CIM) macros enable mid-to-high precision multiply-and-accumulate (MAC) operations with improved energy efficiency using ultra-small/small capacity (0.4-8KB) memory devices. However, advanced CIM-based edge-AI chips favor multiple mid/large capacity SRAM-CIM macros: with high input (IN) and weight (W) precision to reduce the frequency of data reloads from external DRAM, and to avoid the need for additional SRAM buffers or ultra-large on-chip weight buffers. However, enlarging memory capacity and throughput increases the delay parasitics on WLs and BLs, and the number of parallel computing elements; resulting in longer compute latency (t AC ), lower energy-efficiency (EF), degraded signal margin, and larger fluctuations in power consumption across data-patterns (see Fig. 16.3.1). Recent SRAM-CIM macros tend to not use in-lab SRAM cells, with a logic-based layout, in favor of foundry provided compact-layout 8T [2], 3, [5] or 6T cells with local-computing cells (LCCs) [4], [6] to reduce the cell-array area and facilitate manufacturing. This paper presents a SRAM-CIM structure using (1) a segmented-BL charge-sharing (SBCS) scheme for MAC operations, with low energy consumption and a consistently high signal margin across MAC values (MACV); (2) An new LCC cell, called a source-injection local-multiplication cell (SILMC), to support the SBCS scheme with a consistent signal margin against transistor process variation; and (3) A prioritized-hybrid-ADC (Ph-ADC) to achieve a small area and power overhead for analog readout. A 28nm 384kb SRAM-CIM macro was fabricated using a foundry compact-6T cell with support for MAC operations with 16 accumulations of 8b-inputs and 8b-weights with near-full precision output (20b). This macro achieves a 7.2ns t AC and a 22.75TOPS/W EF for 8b-MAC operations with an FoM (IN-precision × W-precision × output-ratio × output-channel × EF/t AC ) 6× higher than prior work.
- Published
- 2021
43. Session 16 Overview: Computation in Memory
- Author
-
Seung-Jun Bae, Ru Huang, and Meng-Fan Chang
- Subjects
Cover (telecommunications) ,Computer architecture ,Computer science ,Computation ,SIGNAL (programming language) ,Static random-access memory ,Session (computer science) ,Macro ,eDRAM ,Resistive random-access memory - Abstract
Computation in memory (CIM) continues to diversify to cover various memory technologies using computations performed in different signal domains. This session covers CIM designs using ReRAM, eDRAM, and SRAM with computations in both analog and digital domains. Paper 16.1 describes a high-performance 22nm ReRAM design using a hybrid-precision technique that supports up to an 8b-input and an 8bweight MAC operations, while achieving 11.91TOPS/W for an 8b-input, 8b-weight and 14b-output, and 195.7TOPS/W for a 1b-input, 2b-weight and 4b-output. 16.2 describes the first 1T1C eDRAM design supporting analog 8b-input, 8b-weight and 8b-output computations at 4.76 TOPS/W in a 65nm technology. In 16.3 a 28nm SRAM CIM macro with up to 22.75TOPS/W for a 4b-input, 4b-weight and 12b-output and 94.31TOPS/W for a 8b-input, 8b-weight and 20b-output. 16.4 takes a different approach and focuses on an all-digital SRAM area-efficient CIM macro design achieving up to 89TOPS/W with 4b-input, 4b-weight and 16b-output.
- Published
- 2021
44. 15.2 A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating
- Author
-
Yen-Lin Chung, Meng-Fan Chang, Jiaxin Liu, Zhe Yuan, Jinshan Yue, Yongpan Liu, Jian-Wei Su, Yifan He, Nan Sun, Mingtao Zhan, Yipeng Wang, Ping-Chun Wu, Yuxuan Huang, Huazhong Yang, Li-Yang Hung, Xiaoyu Feng, and Xueqing Li
- Subjects
Electronic system-level design and verification ,Power gating ,Artificial neural network ,Computer engineering ,Edge device ,Computer science ,System on a chip ,Static random-access memory ,Macro ,Block (data storage) - Abstract
Computing-in-memory (CIM) is an attractive approach for energy-efficient neural network (NN) processors, especially for low-power edge devices. Previous CIM chips [1] –[5] have demonstrated macro and system-level design enabling multi-bit operations and sparsity support. However, several challenges exist, as shown in Fig. 15.2.1. First, though a previously proposed block-wise sparsity strategy [5] can power off ADCs, zeros still contributed to storage requirements, and power gating was not applied to computing resources. Second, on-chip SRAM CIM macros are not large enough to hold all weights. Updating weights between computing operations leads to significant performance loss. Finally, the limited sensing margin incurs poor accuracy for large NN models on practical datasets, such as ImageNet. The precision and power of the ADCs should be optimized and adjusted.
- Published
- 2021
45. 16.1 A 22nm 4Mb 8b-Precision ReRAM Computing-in-Memory Macro with 11.91 to 195.7TOPS/W for Tiny AI Edge Devices
- Author
-
Hui-Yao Kao, Tsung-Yung Jonathan Chang, Sheng-Po Huang, Kea-Tiong Tang, Ren-Shuo Liu, Fu-Chun Chang, Ta-Wei Liu, Chih-Cheng Hsieh, Win-San Khwa, Yu-Der Chih, Chin-I Su, Cheng-Xin Xue, Je-Min Hung, Meng-Fan Chang, Chuan-Jia Jhang, Yen-Hsiang Huang, Peng Chen, and Chung-Chuan Lo
- Subjects
Scheme (programming language) ,Hardware_MEMORYSTRUCTURES ,Edge device ,business.industry ,Computer science ,Electrical engineering ,Latency (audio) ,Binary number ,Resistive random-access memory ,Non-volatile memory ,Macro ,business ,computer ,computer.programming_language ,Voltage - Abstract
Battery-powered tiny-AI edge devices require large-capacity nonvolatile compute-in-memory (nvCIM), with multibit input (IN), weight (W), and output (OUT) precision to support complex applications, high-energy efficiency (EF$_{{\mathrm {MAC}}})$, and short computing latency $(t_{{\mathrm {AC}}})$ for multiply-and-accumulate (MAC) operations. Due to the low read-disturbfree voltage of nonvolatile memory (NVM) devices and the large parasitic load on the bitline, most existing Mb-level nvCIM macros use a current-mode read scheme [1–5] and only achieve a low IN-W precision (binary to 4b).
- Published
- 2021
46. 16.4 An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications
- Author
-
Cheng-Han Lu, Yu-Lin Chen, Yen-Huei Chen, Mori Haruki, Rawan Naous, Kerem Akarvardar, Yi-Chun Shih, Zhao Wei-Chang, Dar Sun, Yu-Der Chih, Lee Chia-Fu, Tsung-Yung Jonathan Chang, Mahmut E. Sinangil, Po-Hao Lee, Meng-Fan Chang, Hung-jen Liao, Tan-Li Chou, Yih Wang, Fujiwara Hidehiro, and Chieh-Pu Lo
- Subjects
Edge device ,business.industry ,Computer science ,Cloud computing ,Machine learning ,computer.software_genre ,Bandwidth (computing) ,Static random-access memory ,Artificial intelligence ,Enhanced Data Rates for GSM Evolution ,Latency (engineering) ,business ,Field-programmable gate array ,computer ,Efficient energy use - Abstract
From the cloud to edge devices, artificial intelligence (AI) and machine learning (ML) are widely used in many cognitive tasks, such as image classification and speech recognition. In recent years, research on hardware accelerators for AI edge devices has received more attention, mainly due to the advantages of AI at the edge: including privacy, low latency, and more reliable and effective use of network bandwidth. However, traditional computing architectures (such as CPUs, GPUs, FPGAs, and even existing AI accelerator ASICs) cannot meet the future needs of energy-constrained AI edge applications. This is because ML computing is data-centric, most of the energy in these architectures is consumed by memory accesses. In order to improve energy efficiency, both academia and industry are exploring a new computing architecture, namely compute in memory (CIM). CIM research is focused on a more analog approach with high-energy efficiency; however, lack of accuracy, due to a low SNR, is the main disadvantage; therefore, an analog approach may not be suitable for some applications that require high accuracy.
- Published
- 2021
47. Monolithic 3D+-IC Based Massively Parallel Compute-in-Memory Macro for Accelerating Database and Machine Learning Primitives
- Author
-
Srivatsa Srinivasa Rangachar, Meng-Fan Chang, Sheng-Po Huang, Chang-Hong Shen, Jia-Min Shieh, Vijaykrishnan Narayanan, John Sampson, Mon-Shu Ho, Wen-Kuan Yeh, Cheng-Xin Xue, Hariram Thirucherai Govindarajan, Chun-Ying Lee, Akshay Krishna Ramanathan, Je-Min Hung, and Fu-Kuo Hsueh
- Subjects
Speedup ,Database ,Computer science ,business.industry ,020208 electrical & electronic engineering ,Sorting ,Three-dimensional integrated circuit ,02 engineering and technology ,Machine learning ,computer.software_genre ,Application-specific integrated circuit ,0202 electrical engineering, electronic engineering, information engineering ,Multiplication ,Artificial intelligence ,Macro ,business ,Massively parallel ,computer ,Sparse matrix - Abstract
This paper demonstrates the first Monolithic 3D+-IC based Compute-in-Memory (CiM) Macro performing massively parallel beyond-Boolean operations targeting database and machine learning (ML) applications. The proposed CiM technique supports data filtering, sorting, and sparse matrix-matrix multiplication (SpGEMM) operations. Our system exhibits up to 272x speedup and 151x energy savings compared to the ASIC baseline.
- Published
- 2020
48. A Two-way SRAM Array based Accelerator for Deep Neural Network On-chip Training
- Author
-
Shimeng Yu, Ta-Wei Liu, Ruhui Liu, Hongwu Jiang, Meng-Fan Chang, Yen-Chi Chou, Wei-Hsing Huang, Jian-Wei Su, Xiaochen Peng, and Shanshi Huang
- Subjects
Hardware_MEMORYSTRUCTURES ,Speedup ,Artificial neural network ,business.industry ,Computer science ,020208 electrical & electronic engineering ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Bottleneck ,Backpropagation ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Multiplication ,Static random-access memory ,Signed number representations ,business ,Computer hardware ,0105 earth and related environmental sciences - Abstract
On-chip training of large-scale deep neural networks (DNNs) is challenging due to computational complexity and resource limitation. Compute-in-memory (CIM) architecture exploits the analog computation inside the memory array to speed up the vectormatrix multiplication (VMM) and alleviate the memory bottleneck. However, existing CIM prototype chips, in particular, SRAM-based accelerators target at implementing low-precision inference engine only. In this work, we propose a two-way SRAM array design that could perform bi-directional in-memory VMM with minimum hardware overhead. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. We taped-out and validated proposed two-way SRAM array design in TSMC 28nm process. Based on the silicon measurement data on CIM macro, we explore the hardware performance for the entire architecture for DNN on-chip training. The experimental data shows that proposed accelerator can achieve energy efficiency of ~3.2 TOPS/W, >1000 FPS and >300 FPS for ResNet and DenseNet training on ImageNet, respectively.
- Published
- 2020
49. A 28nm 1.5Mb Embedded 1T2R RRAM with 14.8 Mb/mm2 using Sneaking Current Suppression and Compensation Techniques
- Author
-
Meng-Fan Chang, Hangbing Lv, Feng Zhang, Jianhua Yang, Xiaoyong Xue, Ming Liu, Xiaoyang Zeng, and Xiaoxin Xu
- Subjects
Computer science ,business.industry ,Transistor ,Electrical engineering ,Chip ,law.invention ,Resistive random-access memory ,PMOS logic ,Current limiting ,law ,Resistor ,business ,Reset (computing) ,Voltage - Abstract
For the first time, 1T2R RRAM cell using PMOS selector is proposed and demonstrated with improved reset voltage and high density for embedded NVM. Hierarchical bit line and 3-state cell storage confine the associated sneaking current locally within a small zone and further suppress it by >90%. Self-adaptive write driver with current limiter and sneaking current compensator realizes power saving and accurate current compliance for set. To suit PMOS selector, reverse read using current sensing with dummy reference brings fast and reliable read. A 1.5Mb RRAM test chip in 28nm improved the record storage density by 40% to 14.8 Mb/mm2. Reliable read is performed at low VDD of 0.6V@(−40, 125)°C.
- Published
- 2020
50. 15.2 A 28nm 64Kb Inference-Training Two-Way Transpose Multibit 6T SRAM Compute-in-Memory Macro for AI Edge Chips
- Author
-
Heng-Yuan Lee, Jian-Wei Su, Ting-Wei Chang, Chung-Chuan Lo, Shih-Chieh Chang, Shimeng Yu, Hongwu Jiang, Kea-Tiong Tang, Wei-Hsing Huang, Sih-Han Li, Shyh-Shyuan Sheu, Zhixiao Zhang, Ta-Wei Liu, Yung-Ning Tu, Pei-Jung Lu, Meng-Fan Chang, Xin Si, Shanshi Huang, Jing-Hong Wang, Ruhui Liu, Chih-Cheng Hsieh, Yen-Chi Chou, and Ren-Shuo Liu
- Subjects
Edge device ,Computer science ,business.industry ,020208 electrical & electronic engineering ,020207 software engineering ,Cloud computing ,02 engineering and technology ,Computational science ,Margin (machine learning) ,Transpose ,0202 electrical engineering, electronic engineering, information engineering ,Static random-access memory ,Enhanced Data Rates for GSM Evolution ,Macro ,business - Abstract
Many Al edge devices require local intelligence to achieve fast computing time (t AC ), high energy efficiency (EF), and privacy. The transfer-learning approach is a popular solution for Al edge chips, wherein data used to re-train the Al in the cloud is used to fine-tune (re-train) a few of the neural layers in edge devices. This enables the dynamic incorporation of data from in-situ environments or private information. Computing-in-memory (CIM) is a promising approach to improve EF for Al edge chips, existing CIM schemes support inference [1]–[5] with forward (FWD) propagation; however, they do not support training, requiring both FWD and backward (BWD) propagation, due to differences in weight-access flow for FWD and BWD propagation. As Fig. 15.2.1 shows, efforts to increase the precision of the input (IN), weight (W), and/or output (OUT) tend to degrade r AC and EF for training operations irrespective of scheme: digital FWD and BWD (DF-DB) or CIM-FWD-digital-BWD (CiMF-DB). This work develops a two-way transpose (TWT) SRAM-CIM macro supporting multibit MAC operations for FWD and BWD propagation with fast r AC and high EF within a compact area. The proposed scheme features (1) A TWT multiply cell (TWT-MC) with a high resistance to process variation; and (2) a small-offset gain-enhancement sense amplifier (SOGE-SA) to tolerate a small read margin. A 28nm 64Kb TWT SRAM-CIM macro was fabricated using a foundry-provided compact 6T-SRAM cell for SRAM-CIM devices supporting both inference and training operations for the first time. This macro also demonstrates the fastest t AC (3.8 – 21ns) and highest EF (7 – 61.1TOPS/w) for MAC operations using 2 – 8b inputs, 4 – 8b weights and 12 − 20b outputs.
- Published
- 2020
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.