1,655 results on '"PARALLEL processing"'
Search Results
2. Evaluating RISC-V Vector Instruction Set Architecture Extension with Computer Vision Workloads.
- Author
-
Li, Ruo-Shi, Peng, Ping, Shao, Zhi-Yuan, Jin, Hai, and Zheng, Ran
- Subjects
COMPUTER architecture ,GRAYSCALE model ,COMPUTER vision ,COMPUTER performance ,PARALLEL processing ,ALGORITHMS - Abstract
Computer vision (CV) algorithms have been extensively used for a myriad of applications nowadays. As the multimedia data are generally well-formatted and regular, it is beneficial to leverage the massive parallel processing power of the underlying platform to improve the performances of CV algorithms. Single Instruction Multiple Data (SIMD) instructions, capable of conducting the same operation on multiple data items in a single instruction, are extensively employed to improve the efficiency of CV algorithms. In this paper, we evaluate the power and effectiveness of RISC-V vector extension (RV-V) on typical CV algorithms, such as Gray Scale, Mean Filter, and Edge Detection. By our examinations, we show that compared with the baseline OpenCV implementation using scalar instructions, the equivalent implementations using the RV-V (version 0.8) can reduce the instruction count of the same CV algorithm up to 24x, when processing the same input images. Whereas, the actual performances improvement measured by the cycle counts is highly related with the specific implementation of the underlying RV-V co-processor. In our evaluation, by using the vector co-processor (with eight execution lanes) of Xuantie C906, vector-version CV algorithms averagely exhibit up to 2.98x performances speedups compared with their scalar counterparts. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
3. High-Performance RNS Modular Exponentiation by Sum-Residue Reduction.
- Author
-
Wu, Tao
- Subjects
EXPONENTIATION ,COMPUTER arithmetic ,NURSES ,BLOCKCHAINS ,NUMBER systems ,CATHODE ray tubes ,INFORMATION technology security - Abstract
With rapid development and application of artificial intelligence and block chain, the requirement of information and data security is also increased, in which the public-key cryptography, such as Rivest-Shamir-Adleman (RSA) cryptography, plays a significant role. Modular exponentiation is fundamental in computer arithmetic and is widely applied in cryptography, such as ElGamal cryptography, Diffie–Hellman key exchange protocol, and RSA cryptography. The implementation of modular exponentiation in a residue number system leads to high parallelism in computation and has been applied in many hardware architectures. While most residue number system (RNS)-based architectures utilize RNS Montgomery algorithm with two residue number systems, the recent modular multiplication algorithm with sum residues performs modular reduction in only one residue number system with about the same parallelism. In this work, it is shown that the high-performance modular exponentiation and RSA cryptography can be implemented in RNS. Both the algorithm and architecture are improved to achieve high performance with extra area overheads, where a 1024-bit modular exponentiation can be completed in 0.567 ms in Xilinx XC6VLX195t-3 platform, costing 26489 slices, 87357 LUTs, 363 dedicated multipilers of $18 \times 18$ bits, and 65 block RAMs. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
4. Event‐based high throughput computing: A series of case studies on a massively parallel softcore machine.
- Author
-
Vousden, Mark, Morris, Jordan, McLachlan Bragg, Graeme, Beaumont, Jonathan, Rafiev, Ashur, Luk, Wayne, Thomas, David, and Brown, Andrew
- Subjects
- *
CONDENSED matter physics , *ELECTRICITY pricing , *COMPUTATIONAL chemistry , *COMPUTER architecture , *MESSAGE passing (Computer science) - Abstract
This paper introduces an event‐based computing paradigm, where workers only perform computation in response to external stimuli (events). This approach is best employed on hardware with many thousands of smaller compute cores with a fast, low‐latency interconnect, as opposed to traditional computers with fewer and faster cores. Event‐based computing is timely because it provides an alternative to traditional big computing, which suffers from immense infrastructural and power costs. This paper presents four case study applications, where an event‐based computing approach finds solutions to orders of magnitude more quickly than the equivalent traditional big compute approach, including problems in computational chemistry and condensed matter physics. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
5. Accelerator-Level Parallelism: Charging computer scientists to develop the science needed to best achieve the performance and cost goals of accelerator-level parallelism hardware and software.
- Author
-
Hill, Mark D. and Reddi, Vijay Janapa
- Subjects
- *
PARALLEL processing , *COMPUTER architecture , *SYSTEMS on a chip , *COMPUTER software development , *COMPUTER science - Abstract
The authors discuss the need for a new branch of computer science to develop accelerator-level parallelism in both computer hardware and software. They mention various forms of parallel processing that already exist, present a case study of mobile systems on a chip that are starting to rely on accelerator-level parallelism, and examine how the science might proceed in the future.
- Published
- 2021
- Full Text
- View/download PDF
6. End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators.
- Author
-
Curzel, Serena, Agostini, Nicolas Bohm, Castellana, Vito Giovanni, Minutoli, Marco, Limaye, Ankur, Manzano, Joseph, Zhang, Jeff, Brooks, David, Wei, Gu-Yeon, Ferrandi, Fabrizio, and Tumeo, Antonino
- Subjects
- *
ARTIFICIAL neural networks , *FINITE state machines , *MACHINE learning , *FIELD programmable gate arrays , *COMPILERS (Computer programs) - Abstract
Edge systems are required to autonomously make real-time decisions based on large quantities of input data under strict power, performance, area, and other constraints. Meeting these constraints is only possible by specializing systems through hardware accelerators purposefully built for machine learning and data analysis algorithms. However, data science evolves at a quick pace, and manual design of custom accelerators has high non-recurrent engineering costs: general solutions are needed to automatically and rapidly transition from the formulation of a new algorithm to the deployment of a dedicated hardware implementation. Our solution is the SOftware Defined Architectures (SODA) Synthesizer, an end-to-end, multi-level, modular, extensible compiler toolchain providing a direct path from machine learning tools to hardware. The SODA Synthesizer frontend is based on the multilevel intermediate representation (MLIR) framework; it ingests pre-trained machine learning models, identifies kernels suited for acceleration, performs high-level optimizations, and prepares them for hardware synthesis. In the backend, SODA leverages state-of-the-art high-level synthesis techniques to generate highly efficient accelerators, targeting both field programmable devices (FPGAs) and application-specific circuits (ASICs). In this paper, we describe how the SODA Synthesizer can also assemble the generated accelerators (based on the finite state machine with datapath model) in a custom system driven by a distributed controller, building a coarse-grained dataflow architecture that does not require a host processor to orchestrate parallel execution of multiple accelerators. We show the effectiveness of our approach by automatically generating ASIC accelerators for layers of popular deep neural networks (DNNs). Our high-level optimizations result in up to 74x speedup on isolated accelerators for individual DNN layers, and our dynamically scheduled architecture yields an additional 3x performance improvement when combining accelerators to handle streaming inputs. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
7. Exploiting Hardware-Based Data-Parallel and Multithreading Models for Smart Edge Computing in Reconfigurable FPGAs.
- Author
-
Rodriguez, Alfonso, Otero, Andres, Platzner, Marco, and de la Torre, Eduardo
- Subjects
- *
EDGE computing , *ADAPTIVE computing systems , *FIELD programmable gate arrays , *COMPUTER systems , *COMPUTING platforms , *ONLINE exhibitions - Abstract
Current edge computing systems are deployed in highly complex application scenarios with dynamically changing requirements. In order to provide the expected performance and energy efficiency values in these situations, the use of heterogeneous hardware/software platforms at the edge has become widespread. However, these computing platforms still suffer from the lack of unified software-driven programming models to efficiently deploy multi-purpose hardware-accelerated solutions. In parallel, edge computing systems also face another huge challenge: operating under multiple conditions that were not taken into account during any of the design stages. Moreover, these conditions may change over time, forcing self-adaptation mechanisms to become a must. This paper presents an integrated architecture to exploit hardware-accelerated data-parallel models and transparent hardware/software multithreading. In particular, the proposed architecture leverages the ARTICo3 framework and ReconOS to allow developers to select the most suitable programming model to deploy their edge computing applications onto run-time reconfigurable hardware devices. An evolvable hardware system is used as an additional architectural component during validation, providing support for continuous lifelong learning in smart edge computing scenarios. In particular, the proposed setup exhibits online learning capabilities that include learning by imitation from software-based reference algorithms. Experimental results show the benefits of the proposed approach, exposing different run-time tradeoffs (e.g., computing performance versus functional correctness of the evolved solutions), and highlighting the benefits of using scalable data-parallel models to perform circuit evolution under dynamically changing application scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
8. Memory-Computing Decoupling: A DNN Multitasking Accelerator With Adaptive Data Arrangement.
- Author
-
Li, Chuxi, Fan, Xiaoya, Wu, Xiaoti, Yang, Zhao, Wang, Miao, Zhang, Meng, and Zhang, Shengbing
- Subjects
- *
ARTIFICIAL neural networks , *DATA conversion , *PARALLEL processing , *AUGMENTED reality , *COMPUTER multitasking , *DATA distribution , *ENERGY consumption - Abstract
Multiple deep neural networks (DNNs) are increasingly used in real-world intelligent applications, such as intelligent robotics and autonomous vehicles to collectively complete complicated tasks running on edge devices. Because each layer of the subtasks prefers a distinct dataflow due to the heterogeneity in shape and scale of the network layers, a variable dataflow approach on the DNN accelerators is urgently required. On DNN accelerators that enable multiple dataflows, however, we detect a dimension mismatch between parallel processing under the dataflow approach and linear data memory arrangement. When multiple DNN tasks share partial features or weights, the issue is further exacerbated. During processing, this mismatch causes a sluggish data supply from both off-chip and on-chip memory. Consequently, the overall throughput, performance, and energy efficiency suffer since DNN models are sensitive to data density. In this work, we reveal the mechanism behind this data dimension mismatch and present a series of metrics that quantify the influence on system performance. On this foundation, we offer a framework that tracks the data tensor dimension conversion and employs a flexible data arrangement over multi-DNN computation to adapt to dataflow variability. An accelerator architecture named data arrangement multi-DNN accelerator (DARMA) that features a data arrangement and distribution circuit and hierarchical memory for data dimension conversion is also presented. Since the mismatch is mitigated, the suggested accelerator outperforms current accelerators in terms of bandwidth and processing unit utilization. Through tests on VR/AR, MLperf, and other multitask applications, the evaluation results show that the proposed architecture provides both energy-efficiency and throughput improvements. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
9. Fused Architecture for Dense and Sparse Matrix Processing in TensorFlow Lite.
- Author
-
Nunez-Yanez, Jose
- Subjects
- *
SPARSE matrices , *CENTRAL processing units , *DIGITAL signal processing , *FIELD programmable gate arrays , *GATE array circuits - Abstract
In this article, we present a hardware architecture optimized for sparse and dense matrix processing in TensorFlow Lite and compatible with embedded-heterogeneous devices that integrate central processing unit and field-programmable gate array (FPGA) resources. The fused architecture for dense and sparse matrices design offers multiple configuration options that tradeoff parallelism and complexity, and uses a dataflow model to create four stages that read, compute, scale, and write results. All stages are designed to support TensorFlow Lite operations including asymmetric quantized activations, column-major matrix write, per-filter/per-axis bias values, and current scaling specifications. The configurable accelerator is integrated with the TensorFlow Lite inference engine running on the ARMv8 processor. We compare performance/power/energy with the state-of-the-art RUY software multiplication library showing up to 18× acceleration and 48× in dense and sparse modes, respectively. The sparse mode benefits from structural pruning to fully utilize the digital signal processing blocks present in the FPGA device. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
10. $TC-Stream$ T C - S t r e a m : Large-Scale Graph Triangle Counting on a Single Machine Using GPUs.
- Author
-
Huang, Jianqiang, Wang, Haojie, Fei, Xiang, Wang, Xiaoying, and Chen, Wenguang
- Subjects
- *
GRAPH algorithms , *SOLID state drives , *GRAPHICS processing units , *PARALLEL algorithms , *TRIANGLES , *ON-demand computing , *COUNTING , *SOCIAL network analysis - Abstract
In this paper, we build a $TC$ T C - $Stream$ S t r e a m , a high-performance graph processing system specific for a triangle counting algorithm on graph data with up to tens of billions of edges, which significantly exceeds the device memory capacity of Graphics Processing Units (GPUs). The triangle counting problem is a broad research topic in data mining and social network analysis in the graph processing field. As the scale of the graph data grows, a portion of the graph data must be loaded iteratively. In the existing literature, graphs with billions of edges need to be done distributively, which is cost-intensive. Also, many disk-based triangle counting systems are proposed for CPU architectures, but their tackling performances are inefficient. To solve the above problem, we propose $TC$ T C - $Stream$ S t r e a m , and it focuses on three issues: 1) For power-law graphs, because the amount of tasks of each vertex or edge is inconsistent, it is bound to cause different demands of computing and memory resources for different task types. We propose a parallel vertex approach and the reordering of vertices for graph data that can be placed in the GPU device memory to ensure the maximum workload balancing; 2) A binary-search-based set intersection method is designed to achieve the maximum parallelism in GPU; 3) For the graph data that exceeds the GPU device memory capacity, we develop a novel vertical partition algorithm to guarantee the independent computing on each partition so that the three computation processes, i.e., the computation on GPU, the data transmission between main memory of CPU and SSD, and the communication between the CPU and the GPU can be perfectly overlapped. Moreover, the $TC$ T C - $Stream$ S t r e a m optimizes edge-iterator models and benefits from multi-thread parallelism. Extensive experiments conducted on large-scale datasets showed that the $TC$ T C - $stream$ s t r e a m running on a single Tesla V100 GPU performs $2.4-6\times$ 2. 4 - 6 × and $1.8-4.4\times$ 1. 8 - 4. 4 × faster than the state-of-the-art single-machine in-memory triangle counting system and GPU-based triangle counting system, respectively, and achieves $2.4\times$ 2. 4 × faster than the state-of-the-art out-of-core distributed system PDTL running on an 8-node cluster when processing the graph data with 42.5 billion edges, which demonstrates the high performance and cost-effectiveness of the $TC$ T C - $Stream$ S t r e a m . [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
11. Auto-GNAS: A Parallel Graph Neural Architecture Search Framework.
- Author
-
Chen, Jiamin, Gao, Jianliang, Chen, Yibo, Oloulade, Babatounde Moctard, Lyu, Tengfei, and Li, Zhao
- Subjects
- *
GRAPH algorithms , *SEARCH algorithms , *GENETIC algorithms , *LINEAR acceleration , *PARALLEL programming , *COMPUTER architecture - Abstract
Graph neural networks (GNNs) have received much attention as GNNs have recently been successfully applied on non-euclidean data. However, artificially designed graph neural networks often fail to get satisfactory model performance for a given graph data. Graph neural architecture search effectively constructs the GNNs that achieve the expected model performance with the rise of automatic machine learning. The challenge is efficiently and automatically getting the optimal GNN architecture in a vast search space. Existing search methods serially evaluate the GNN architectures, severely limiting system efficiency. To solve these problems, we develop an Automatic Graph Neural Architecture Search framework (Auto-GNAS) with parallel estimation to implement an automatic graph neural search process that requires almost no manual intervention. In Auto-GNAS, we design the search algorithm with multiple genetic searchers. Each searcher can simultaneously use evaluation feedback information, information entropy, and search results from other searchers based on sharing mechanism to improve the search efficiency. As far as we know, this is the first work using parallel computing to improve the system efficiency of graph neural architecture search. According to the experiment on the real datasets, Auto-GNAS obtain competitive model performance and better search efficiency than other search algorithms. Since the parallel estimation ability of Auto-GNAS is independent of search algorithms, we expand different search algorithms based on Auto-GNAS for scalability experiments. The results show that Auto-GNAS with varying search algorithms can achieve nearly linear acceleration with the increase of computing resources. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
12. ReHy: A ReRAM-Based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training.
- Author
-
Jin, Hai, Liu, Cong, Liu, Haikun, Luo, Ruikun, Xu, Jiahong, Mao, Fubing, and Liao, Xiaofei
- Subjects
- *
NONVOLATILE random-access memory , *CONVOLUTIONAL neural networks , *BLOOD pressure testing machines , *MATRIX multiplications , *LOGIC design - Abstract
Processing-In-Memory(PIM) has emerged as a high-performance and energy-efficient computing paradigm for accelerating convolutional neural network (CNN) applications. Resistive random access memory (ReRAM) has been widely used in PIM architectures due to its extremely high efficiency for accelerating matrix-vector multiplications through analog computing. However, because CNN training usually requires high-precision computation in the backward propagation (BP) stage, the limited precision of analog PIM accelerators impedes their adoption in CNN training. In this article, we propose ReHy, a hybrid PIM accelerator to support CNN training in ReRAM arrays. It is composed of Analog PIM (APIM) and Digital PIM (DPIM) modules. ReHy uses APIM to accelerate the feed-forward propagation (FP) stage for high performance, and DPIM to process the BP stage for high accuracy. We exploit the capability of ReRAM for Boolean logic operations to design the DPIM architecture. Particularly, we design floating-point multiplication and addition operators to support matrix multiplications in ReRAM arrays. We also propose a performance model to offload high-precision matrix multiplications to DPIM according to the data parallelism. Experimental results show that ReHy can speed up CNN training by 48.8× and 2.4×, and reduce energy consumption by 35.1× and 2.33×, compared with CPU/GPU architectures (baseline) and the state-of-the-art FloatPIM, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
13. Design for Real-Time Nonlinear Model Predictive Control With Application to Collision Imminent Steering.
- Author
-
Wurts, John, Stein, Jeffrey L., and Ersal, Tulga
- Subjects
TRAJECTORY optimization ,PREDICTION models ,AUTOMOTIVE engineering ,GRAPHICS processing units ,PARALLEL processing ,PARALLEL algorithms ,BEAM steering - Abstract
Model predictive control (MPC) is a competitive option in modern control systems due to its ability to account for future response and incorporation of complex control objectives. As applications become more intricate, nonlinearities limit the utility of linear control strategies, thus requiring more sophisticated architectures, often at a significant computational cost. This article investigates the computational cost of solving nonlinear MPC problems and provides a framework for designing nonlinear MPC architectures compatible with real-time performance. To motivate the computational complexity associated with nonlinear MPC, the design of an automotive collision imminent steering system and the controller is considered. Various trajectory optimization strategies are examined and compared for this application, identifying multiple-shooting-based Runge–Kutta explicit integration as the most suitable. The control algorithm is then mapped into a graphics processor unit-based hardware system, where special considerations of the parallel hardware architecture are discussed. Compared to the single-shooting solution as the benchmark, multiple shootings on parallel hardware achieve three orders of magnitude improvement in wall time, supporting real-time implementation. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
14. A Survey on Memory-centric Computer Architectures.
- Author
-
GEBREGIORGIS, ANTENEH, HOANG ANH DU NGUYEN, JINTAO YU, BISHNOI, RAJENDRA, TAOUIL, MOTTAQIALLAH, CATTHOOR, FRANCKY, and HAMDIOUI, SAID
- Subjects
COMPUTER architecture ,INTERNET surveys ,PARALLEL processing - Abstract
Faster and cheaper computers have been constantly demanding technological and architectural improvements. However, current technology is suffering from three technology walls: leakage wall, reliability wall, and cost wall. Meanwhile, existing architecture performance is also saturating due to three well-known architecture walls: memory wall, power wall, and instruction-level parallelism (ILP) wall. Hence, a lot of novel technologies and architectures have been introduced and developed intensively. Our previous work has presented a comprehensive classification and broad overview of memory-centric computer architectures. In this article, we aim to discuss the most important classes of memory-centric architectures thoroughly and evaluate their advantages and disadvantages. Moreover, for each class, the article provides a comprehensive survey on memory-centric architectures available in the literature. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
15. Improving performance of simultaneous multithreading CPUs using autonomous control of speculative traces.
- Author
-
Ortiz, Ryan F. and Lin, Wei-Ming
- Subjects
- *
PARALLEL processing , *COMPUTER architecture , *INFORMATION sharing - Abstract
Simultaneous Multithreading (SMT) allows for a processor to concurrently execute multiple independent threads while sharing certain data path components to optimize resource waste. Speculative execution allows for these processors to take advantage of Instruction-Level Parallelism but the penalty for a miss speculation includes the wasting of resources amongst these shared resources where clock cycles are wasted at a time. In this paper we show that an average of 13 % of instructions are flushed as a result of incorrect predictions. These flushed out instructions could have potentially taken up shared resources which other non-speculative threads could have used. This paper proposes a technique that can dynamically adjust how many speculative instructions a thread can rename and decode aiming to diminish the waste of the shared resources. Our simulation results show, with the proposed technique, that the average flushed out instruction rate is reduced by 23 % and average throughput is improved by 13 %. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. SimpleSSD: Modeling Solid State Drives for Holistic System Simulation
- Author
-
Jung, Myoungsoo, Zhang, Jie, Abulila, Ahmed, Kwon, Miryeong, Shahidi, Narges, Shalf, John, Kim, Nam Sung, and Kandemir, Mahmut
- Subjects
Built Environment and Design ,Architecture ,Affordable and Clean Energy ,Hardware ,computer architecture ,parallel processing ,computational modeling ,systems simulation ,microprocessors ,software ,Computer Hardware & Architecture - Abstract
Existing solid state drive (SSD) simulators unfortunately lack hardware and/or software architecture models. Consequently, they are far from capturing the critical features of contemporary SSD devices. More importantly, while the performance of modern systems that adopt SSDs can vary based on their numerous internal design parameters and storage-level configurations, a full system simulation with traditional SSD models often requires unreasonably long runtimes and excessive computational resources. In this work, we propose SimpleSSD, a high-fidelity simulator that models all detailed characteristics of hardware and software, while simplifying the nondescript features of storage internals. In contrast to existing SSD simulators, SimpleSSD can easily be integrated into publicly-available full system simulators. In addition, it can accommodate a complete storage stack and evaluate the performance of SSDs along with diverse memory technologies and microarchitectures. Thus, it facilitates simulations that explore the full design space at different levels of system abstraction.
- Published
- 2018
17. STICKER-IM: A 65 nm Computing-in-Memory NN Processor Using Block-Wise Sparsity Optimization and Inter/Intra-Macro Data Reuse.
- Author
-
Yue, Jinshan, Liu, Yongpan, Yuan, Zhe, Feng, Xiaoyu, He, Yifan, Sun, Wenyu, Zhang, Zhixiao, Si, Xin, Liu, Ruhui, Wang, Zi, Chang, Meng-Fan, Dou, Chunmeng, Li, Xueqing, Liu, Ming, and Yang, Huazhong
- Subjects
MACRO processors ,ENERGY consumption ,SYSTEM integration ,VIDEO coding ,ARCHITECTURAL design ,COMPUTER architecture - Abstract
Computing-in-memory (CIM) is a promising architecture for energy-efficient neural network (NN) processors. Several CIM macros have demonstrated high energy efficiency, while CIM-based system-on-a-chip is not well explored. This work presents a CIM NN processor, named STICKER-IM, which is implemented with sophisticated system integration. Three key innovations are proposed. First, a CIM-friendly block-wise sparsity (BWS) architecture is designed, enabling both activation-sparsity-aware acceleration and weight-sparsity-aware power-saving. Second, an adaptive kernel-/channel-order (KCO) mapping and intra-/inter-macro scheduling strategy is proposed to improve macro utilization and data reuse. Third, an efficient BWS-optimized CIM (BWS-CIM) macro with adaptive power-OFF ADCs is implemented. The STICKER-IM chip was fabricated in 65-nm CMOS technology. Experimental results show 5.8–158-TOPS/W average system energy efficiency on the sparse NN models. The macro/system-level energy efficiency is $4.23\times / 3.06\times $ higher compared with the state-of-the-art CIM macros and processors. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
18. CAP: Communication-Aware Automated Parallelization for Deep Learning Inference on CMP Architectures.
- Author
-
Zou, Kaiwei, Wang, Ying, Cheng, Long, Qu, Songyun, Li, Huawei, and Li, Xiaowei
- Abstract
Real-time inference of deep learning models on embedded and energy-efficient devices becomes increasingly desirable with the rapid growth of artificial intelligence on edge. Specifically, to achieve superb energy-efficiency and scalability, efficient parallelization of single-pass deep neural network (DNN) inference on chip multiprocessor (CMP) architectures is urgently required by many time-sensitive applications. However, as the number of processing cores scales up and the performance of cores has grown much fast, the on-chip inter-core data movement is prone to be a performance bottleneck for computation. To remedy this problem and further improve the performance of network inference, in this work, we introduce a communication-aware DNN parallelization technique called CAP, by exploiting the elasticity and noise-tolerance of deep learning algorithms on CMP. Moreover, in the hope that the conducted studies can provide new design values for real-time neural network inference on embedded chips, we also have evaluated the proposed approach on both multi-core Neural Network Accelerators (NNA) chips and general-purpose chip-multiprocessors. Our experimental results show that the proposed CAP can achieve 1.12×-1.65× system speedups and 1.14×-2.70× energy efficiency for different neural networks while maintaining the inference accuracy, compared to baseline approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
19. Exploring Data Analytics Without Decompression on Embedded GPU Systems.
- Author
-
Pan, Zaifeng, Zhang, Feng, Zhou, Yanliang, Zhai, Jidong, Shen, Xipeng, Mutlu, Onur, and Du, Xiaoyong
- Subjects
- *
GRAPHICS processing units , *COMPUTER architecture , *ENERGY consumption , *RANDOM access memory - Abstract
With the development of computer architecture, even for embedded systems, GPU devices can be integrated, providing outstanding performance and energy efficiency to meet the requirements of different industries, applications, and deployment environments. Data analytics is an important application scenario for embedded systems. Unfortunately, due to the limitation of the capacity of the embedded device, the scale of problems handled by the embedded system is limited. In this paper, we propose a novel data analytics method, called G-TADOC, for efficient text analytics directly on compression on embedded GPU systems. A large amount of data can be compressed and stored in embedded systems, and can be processed directly in the compressed state, which greatly enhances the processing capabilities of the systems. Particularly, G-TADOC has three innovations. First, a novel fine-grained thread-level workload scheduling strategy for GPU threads has been developed, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, a GPU thread-safe memory pool has been developed to handle inconsistency with low synchronization overheads. Third, a sequence-support strategy is provided to maintain high GPU parallelism while ensuring sequence information for lossless compression. Moreover, G-TADOC involves special optimizations for embedded GPUs, such as utilizing the CPU-GPU shared unified memory. Experiments show that G-TADOC provides 13.2× average speedup compared to the state-of-the-art TADOC. G-TADOC also improves performance-per-cost by 2.6× and energy efficiency by 32.5× over TADOC. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
20. VSDCA: A Voltage Sensing Differential Column Architecture Based on 1T2R RRAM Array for Computing-in-Memory Accelerators.
- Author
-
Jing, Zhaokun, Yan, Bonan, Yang, Yuchao, and Huang, Ru
- Subjects
- *
COLUMNS , *VOLTAGE , *ENERGY consumption , *PARALLEL programming , *STATIC random access memory - Abstract
Non-volatile memory (NVM) such as RRAM and PCM has become the key component in high energy efficiency computing-in-memory (CIM) architectures. However, the computing accuracy and energy efficiency improvement of conventional 1T1R RRAM array based current sensing CIM scheme is hindered by device variation and large output current. In this work, we propose a voltage sensing differential column architecture (VSDCA) based on 1T2R RRAM array for binary memory and CIM applications. The memory mode of VSDCA macro can improve $1.12\times $ to $5.29\times $ relative read margin compared to conventional 1T1R current sensing memory. The computing mode supports 8-bit input, 9-bit weight and 18-bit output high precision and rows fully parallel computing. The VSDCA macro design is evaluated under SMIC 40 nm technology node, the energy efficiency for the high precision CIM reaches 39.52 TOPS/W. The CIFAR10 inference accuracy of the simulated VGG16 and ResNet18 model is 85.91% and 89.32% respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
21. A Survey of Deep Learning on CPUs: Opportunities and Co-Optimizations.
- Author
-
Mittal, Sparsh, Rajput, Poonam, and Subramoney, Sreenivas
- Subjects
- *
DEEP learning , *COMPUTER architecture , *ARTIFICIAL intelligence , *CENTRAL processing units , *GRAPHICS processing units , *PARALLEL programming - Abstract
CPU is a powerful, pervasive, and indispensable platform for running deep learning (DL) workloads in systems ranging from mobile to extreme-end servers. In this article, we present a survey of techniques for optimizing DL applications on CPUs. We include the methods proposed for both inference and training and those offered in the context of mobile, desktop/server, and distributed systems. We identify the areas of strength and weaknesses of CPUs in the field of DL. This article will interest practitioners and researchers in the area of artificial intelligence, computer architecture, mobile systems, and parallel computing. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. ViP: A Hierarchical Parallel Vision Processor for Hybrid Vision Chip.
- Author
-
Zheng, Xuemin, Cheng, Li, Zhao, Mingxin, Luo, Qian, Li, Honglong, Dou, Runjiang, Yu, Shuangming, Wu, Nanjian, and Liu, Liyuan
- Abstract
Nowadays, the vision chip bridging sensing and processing has been extensively employed in high-speed image processing, owing to its excellent performance, low power consumption, and economical cost. However, there is a dilemma in designing processors to support conventional computer vision algorithms and neural networks since the two algorithms have a non-trivial trade-off in proposing a unified architecture. By analyzing computation properties, we propose a novel hierarchical parallel vision processor (ViP) for hybrid vision chips to accelerate both traditional computer vision (CV) and neural network (NN). The ViP architecture includes three parallelism levels: PE for pixel-centric, computing core (CC) for block, and vision core (VC) for global. PEs contain dedicated computing units and data paths for convolution operations without degrading its flexibility. Each CC is driven by customized SIMD instructions and can be dynamically connected for meeting block parallelism requirements. ViP is fabricated in 65nm CMOS technology and achieves a peak performance of 614.4 GOPS and an energy efficiency of 640 GOPS/W at 200 MHz clock frequency. Notably, several experiments on CV and NN are performed, illustrating an ultra-low latency in executing hybrid algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
23. Compiler-Assisted Compaction/Restoration of SIMD Instructions.
- Author
-
Cebrian, Juan M., Balem, Thibaud, Barredo, Adrian, Casas, Marc, Moreto, Miquel, Ros, Alberto, and Jimborean, Alexandra
- Subjects
- *
HIGH performance computing , *COMPACTING , *SUPERCOMPUTERS , *COMPUTER systems , *ENERGY consumption , *COMPUTER architecture - Abstract
Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This article proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent dense vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29 percent and reduces dynamic energy by up to 24.2 percent on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6 percent performance and 14 percent energy improvements for the same configuration and applications. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. An Application Specific Vector Processor for Efficient Massive MIMO Processing.
- Author
-
Attari, Mohammad, Ferreira, Lucas, Liu, Liang, and Malkowsky, Steffen
- Subjects
- *
PARALLEL processing , *PROCESS capability , *MIMO systems , *TELECOMMUNICATION equipment , *BASEBAND , *MIMO radar - Abstract
This paper presents an implementation for a baseband massive multiple-input multiple-output (MIMO) application-specific instruction set processor (ASIP). The ASIP is geared with vector processing capabilities in the form of single instruction multiple data (SIMD), and furthermore exploits instruction level parallelism by employing a very large instruction word (VLIW) architecture. Additionally, a systolic array is built into the pipeline which is tuned to speed up matrix calculations. A parallel memory subsystem and stand-alone accelerators are integrated into the ASIP architecture in order to meet the processing requirement. The processor is synthesized in 22 nm FD-SOI technology running at a clock frequency of 800 MHz. The system achieves a maximum detection throughput of 0.75 Gb/s/mm2 for a $128\times 8$ massive MIMO system. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
25. A High-Performance Domain-Specific Processor With Matrix Extension of RISC-V for Module-LWE Applications.
- Author
-
Zhao, Yifan, Xie, Ruiqi, Xin, Guozhu, and Han, Jun
- Subjects
- *
PARALLEL processing , *COMMUNICATION infrastructure , *EDGE computing , *CRYPTOGRAPHY , *MATRICES (Mathematics) - Abstract
The 5G edge computing infrastructure should be empowered with quantum attack resistance by implementing post-quantum cryptography (PQC). Among various PQC schemes, lattice-based cryptography (LBC) based on learning with error (LWE) has attracted much attention because of its performance efficiency and security guarantee. In LWE-based LBCs, the Module-LWE-based schemes gain advantage over the others benefiting from the unique polynomial matrix and vector structure. To provide a high-performance implementation of Module-LWE applications for the edge computing paradigm, we propose a domain-specific processor based on a matrix extension of RISC-V architecture. This custom extension encapsulates the matrix-based ring operations with a high-level functional abstraction. A 2-D systolic array with configurable functionality is proposed to perform matrix-based number theoretic transform (NTT) and other arithmetic operations, achieving high data-level parallelism with support for the variable-sized polynomial matrix and vector structure. As this structure of Module-LWE involves no data dependency between different inner elements, an out-of-order mechanism is further developed to exploit the instruction-level parallelism. We implement the proposed architecture under TSMC 28nm technology. The evaluation results show that our implementation can achieve up to $3.5\times $ and $3.3\times $ improvement in cycle count respectively in Kyber and Dilithium, compared to the state-of-the-art crypto-processor counterparts. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
26. Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics.
- Author
-
Minutoli, Marco, Castellana, Vito Giovanni, Saporetti, Nicola, Devecchi, Stefano, Lattuada, Marco, Fezzardi, Pietro, Tumeo, Antonino, and Ferrandi, Fabrizio
- Subjects
- *
FIELD programmable gate arrays , *PARALLEL processing , *GRAPH algorithms , *GRAPHICS processing units - Abstract
Graph analytics are an emerging class of irregular applications. Operating on very large datasets, they present unique behaviors, such as fine-grained, unpredictable memory accesses, and highly unbalanced task level parallelism, that make existing high-performance general-purpose processors or accelerators (e.g., GPUs) suboptimal. To address these issues, research and industry are developing a variety of custom accelerator designs for this application area, including solutions based on reconfigurable devices (Field Programmable Gate Arrays). These new approaches often employ High-Level Synthesis (HLS) to Speed up the development of the accelerators. In this paper, we propose a novel architecture template for the automatic generation of accelerators for graph analytics and irregular applications. The architecture template includes a dynamic task scheduling mechanism, a parallel array of accelerators that enables supporting task-level parallelism with context switching, and a related multi-channel memory interface that decouples communication from computation and provides support for fine-grained atomic memory operations. We discuss the integration of the architectural template in an HLS flow, presenting the necessary modifications to enable automatic generation of the custom architectures starting from OpenMP annotated code. We evaluate our approach first by synthesizing and exploring triangle counting, a common graph algorithm, and then by synthesizing custom designs for a set of graph database benchmark queries, representing series of graph pattern matching routines. We compare the synthesized accelerators with previous state-of-the-art methodologies for the synthesis of parallel architectures, showing that the proposed approach allows reducing resource usage by optimizing the number of accelerators replicas without any performance penalty. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
27. Parallel Pipelined Architecture and Algorithm for Matrix Transposition Using Registers.
- Author
-
Zhang, Bo, Ma, Zhenguo, and Luo, Wei
- Abstract
In this brief, we present a new algorithm and architecture for continuous-flow matrix transposition using registers. The algorithm supports $P$ -parallel matrix transposition. The hardware architecture reaches the theoretical minimums in terms of latency and memory. It is composed of a group of identical cascaded basic swap circuits, whose stages are determined by the corresponding algorithm, and can be controlled via a set of counters. Compared with the state-of-the-art architecture, the proposed architecture supports matrices whose rows and columns are integer multiples of $P$. Here $P$ can be arbitrary, including but not limited to power-of-two integers. Moreover, our results provide additional insight into continuous-flow non-square matrix transposition. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. An Efficient Stochastic Convolution Architecture Based on Fast FIR Algorithm.
- Author
-
Wang, Huizheng, Xu, Weihong, Zhang, Zaichen, You, Xiaohu, and Zhang, Chuan
- Abstract
By utilizing stochastic computing (SC), the hardware consumption of convolutional neural networks (CNNs) can be decreased significantly. However, long stream length is required to produce acceptable results, which leads to extended computation time. As a result, the inherent random fluctuation error and long latency of processing random bitstreams have made previous SC-CNN implementations inefficient compared with conventional binary designs. To address these issues, in this brief, an efficient convolution architecture based on fast FIR algorithm (FFA) is proposed by employing FFA to reduce the computational complexity. Further, the combination of two-line SC and Sobol sequences is applied to decrease the processing cycles. The functional simulation targeting LeNet-5 with MNIST dataset and RTL synthesis results show that the proposed design yields higher area efficiency than previous SC-based ones and achieves 64%, 11% higher efficiency in area and energy compared to the 5-bit fixed-point design while maintaining comparable accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
29. Repurposing GPU Microarchitectures with Light-Weight Out-Of-Order Execution.
- Author
-
Iliakis, Konstantinos, Xydis, Sotirios, and Soudris, Dimitrios
- Subjects
- *
GRAPHICS processing units , *PARALLEL processing , *VERNACULAR architecture , *COMPUTER architecture - Abstract
GPU is the dominant platform for accelerating general-purpose workloads due to its computing capacity and cost-efficiency. GPU applications cover an ever-growing range of domains. To achieve high throughput, GPUs rely on massive multi-threading and fast context switching to overlap computations with memory operations. We observe that among the diverse GPU workloads, there exists a significant class of kernels that fail to maintain a sufficient number of active warps to hide the latency of memory operations, and thus suffer from frequent stalling. We argue that the dominant Thread-Level Parallelism model is not enough to efficiently accommodate the variability of modern GPU applications. To address this inherent inefficiency, we propose a novel micro-architecture with lightweight Out-Of-Order execution capability enabling Instruction-Level Parallelism to complement the conventional Thread-Level Parallelism model. To minimize the hardware overhead, we carefully design our extension to highly re-use the existing micro-architectural structures and study various design trade-offs to contain the overall area and power overhead, while providing improved performance. We show that the proposed architecture outperforms traditional platforms by 23 percent on average for low-occupancy kernels, with an area and power overhead of 1.29 and 10.05 percent, respectively. Finally, we establish the potential of our proposal as a micro-architecture alternative by providing 16 percent speedup over a wide collection of 60 general-purpose kernels. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. Scalable and Programmable Neural Network Inference Accelerator Based on In-Memory Computing.
- Author
-
Jia, Hongyang, Ozatay, Murat, Tang, Yinqi, Valavi, Hossein, Pathak, Rakshit, Lee, Jinseok, and Verma, Naveen
- Subjects
LIBRARY software ,ENERGY consumption ,ARTIFICIAL neural networks ,COMPUTER architecture ,BIT error rate - Abstract
This work demonstrates a programmable in-memory-computing (IMC) inference accelerator for scalable execution of neural network (NN) models, leveraging a high-signal-to-noise ratio (SNR) capacitor-based analog technology. IMC accelerates computations and reduces memory accessing for matrix-vector multiplies (MVMs), which dominate in NNs. The accelerator architecture focuses on scalable execution, addressing the overheads of state swapping and the challenges of maintaining high utilization across highly dense and parallel hardware. The architecture is based on a configurable on-chip network (OCN) and scalable array of cores, which integrate mixed-signal IMC with programmable near-memory single-instruction multiple-data (SIMD) digital computing, configurable buffering, and programmable control. The cores enable flexible NN execution mappings that exploit data- and pipeline-parallelism to address utilization and efficiency across models. A prototype is presented, incorporating a $4 \times 4$ array of cores demonstrated in 16 nm CMOS, achieving peak multiply-accumulate (MAC)-level throughput of 3 TOPS and peak MAC-level energy efficiency of 30 TOPS/W, both for 8-b operations. The measured results shows high accuracy of the analog computations, matching bit-true simulations. This enables the abstractions required for robust and scalable architectural and software integration. Developed software libraries and NN-mapping tools are used to demonstrate CIFAR-10 and ImageNet classification, with an 11-layer CNN and ResNet-50, respectively, achieving accuracy, throughput, and energy efficiency of 91.51% and 73.33%, 7815 and 581 image/s, 51.5 k and 3.0 k image/s/W, with 4-b weights and activations. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
31. Navigating the Seismic Shift of Post-Moore Computer Systems Design.
- Author
-
Banerjee, Anindya, Basu, Sankar, Brunvand, Erik, Mazumder, Pinaki, Cleaveland II, Walter R., Singh, Gurdip, Martonosi, Margaret, and Pembleton, Fernanda
- Subjects
- *
COMPUTER engineering , *SYSTEMS design , *COMPUTER systems , *MOORE'S law , *VERY large scale circuit integration - Abstract
Reports on the history and development of computer aided design systems. In quick succession between 1964 and 1971, our field saw the proposal of Moore's law,1 the coining of the term "computer architecture,"2 and the introduction of the first microprocessor.3 For much of the five decades since then, we have benefitted extraordinarily from both the dynamism of Moore's law transistor scaling and the stable durability of the hardware–software abstractions of computer architecture. The dynamic duo of Moore's law and computer architecture have allowed massive scaling to occur, and also to be navigated smoothly with relatively little software impact. For example, in the late 1980s and early 1990s, surges in power density occurred as we reached challenging limits in very large scale integration (VLSI) designs based on bipolar transistors; a technology transition from bipolar to complementary metal–oxide–semiconductor (CMOS) occurred with relatively little impact or awareness from the software portion of the computing community.4 Over the past 10–15 years however, more fundamental shifts have occurred. For example, Dennard scaling,5 a companion phenomenon to Moore's law stating that power density could remain stable while transistor sizes shrank, is reaching physical limits. This means that further Moore's law increases in transistor counts are becoming more complex and are only achieved with great effort and at higher power-density costs. Furthermore, as we reach fundamental physical limits in the functioning of small semiconductor transistors, Moore's law itself is being challenged by the increased physical effort and financial expense required to maintain transistor scaling trends. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
32. Low-Complexity and Low-Latency SVC Decoding Architecture Using Modified MAP-SP Algorithm.
- Author
-
Hong, Seungwoo, Kam, Dongyun, Yun, Sangbu, Choe, Jeongwon, Lee, Namyoon, and Lee, Youngjoo
- Subjects
- *
DECODING algorithms , *ALGORITHMS , *STATIC VAR compensators , *COMPUTER architecture , *PARALLEL processing - Abstract
The compressive sensing (CS) based sparse vector coding (SVC) method is one of the promising ways for the next-generation ultra-reliable and low-latency communications. In this paper, we present advanced algorithm-hardware co-optimization schemes for realizing a cost-effective SVC decoding architecture. The previous maximum a posteriori subspace pursuit (MAP-SP) algorithm is newly modified to relax the computational overheads by applying novel residual forwarding and LLR approximation schemes. A fully-pipelined parallel hardware is also developed to support the modified decoding algorithm, reducing the overall processing latency, especially at the support identification step. In addition, an advanced least-square-problem solver is presented by utilizing the parallel Cholesky decomposer design, further reducing the decoding latency with parallel updates of support values. The implementation results from a 22nm FinFET technology showed that the fully-optimized design is 9.6 times faster while improving the area efficiency by 12 times compared to the baseline realization. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
33. Investigating memory prefetcher performance over parallel applications: From real to simulated.
- Author
-
Girelli, Valéria S., Moreira, Francis B., Serpa, Matheus S., Carastan‐Santos, Danilo, and Navaux, Philippe O. A.
- Subjects
MEMORY ,PARALLEL programming ,CACHE memory ,PARALLEL processing ,COMPUTER architecture - Abstract
Memory prefetcher algorithms are widely used in processors to mitigate the performance gap between the processors and the memory subsystem. The complexities behind the architectures and prefetcher algorithms, however, not only hinder the development of accurate architecture simulators, but also hinder understanding the prefetcher's contribution to performance, on both a real hardware and in a simulated environment. In this paper, we contribute to shed light on the memory prefetcher's role in the performance of parallel High‐Performance Computing applications, considering the prefetcher algorithms offered by both the real hardware and the simulators. We performed a careful experimental investigation, executing the NAS parallel benchmark (NPB) on a real Skylake machine, and as well in a simulated environment with the ZSim and Sniper simulators, taking into account the prefetcher algorithms offered by both Skylake and the simulators. Our experimental results show that: (i) prefetching from the L3 to L2 cache presents better performance gains, (ii) the memory contention in the parallel execution constrains the prefetcher's effect, (iii) Skylake's parallel memory contention is poorly simulated by ZSim and Sniper, and (iv) Skylake's noninclusive L3 cache hinders the accurate simulation of NPB with the Sniper's prefetchers. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
34. MPV—Parallel Readout Architecture for the VME Data Acquisition System.
- Author
-
Baba, H., Ichihara, T., Isobe, T., Ohnishi, T., Yoshida, K., Watanabe, Y., Ota, S., Shimizu, H., Shimoura, S., Takeuchi, S., Nishimura, D., Zenihiro, J., Tokiyasu, A. O., and Yokoyama, R.
- Subjects
- *
DATA acquisition systems , *PARALLEL processing , *GATE array circuits , *FIELD programmable gate arrays , *ACQUISITION of data - Abstract
Mountable controller with parallelized VERSA Module Eurocard (VME) (MPV) is a VME-compatible system having a parallel readout architecture. This article presents the system architecture and its data acquisition performance. In this system, the readout sequence is implemented in a field-programmable gate array (FPGA) to achieve the ideal VME bus speed. Data from multiple VME slave modules are read out in parallel, merged, and sent to a server. Maximum data throughput of 400 Mbps was achieved. Thus, the MPV system can dramatically improve the performance of VME data acquisition systems. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
35. nZESPA: A Near-3D-Memory Zero Skipping Parallel Accelerator for CNNs.
- Author
-
Das, Palash and Kapoor, Hemangee K.
- Subjects
- *
CONVOLUTIONAL neural networks , *MACHINE learning , *COMPUTER architecture , *PARALLEL processing , *ENERGY consumption , *COMPUTER vision - Abstract
Convolutional neural networks (CNNs) are one of the most popular machine learning tools for computer vision. The ubiquitous use in several applications with its high computation-cost has made it lucrative for optimization through accelerated architecture. State-of-the-art has either exploited the parallelism of CNNs, or eliminated computations through sparsity or used near-memory processing (NMP) to accelerate the CNNs. We introduce NMP-fully sparse architecture, which acquires all three capabilities. The proposed architecture is parallel and hence processes the independent CNN tasks concurrently. To exploit the sparsity, the proposed system employs a dataflow, namely, Near-3D-Memory Zero Skipping Parallel dataflow or nZESPA dataflow. This dataflow maintains the compressed-sparse encoding of data that skips all ineffectual zero-valued computations of CNNs. We design a custom accelerator which employs the nZESPA dataflow. The grids of nZESPA modules are integrated into the logic layer of the hybrid memory cube. This integration saves a significant amount of off-chip communications while implementing the concept of NMP. We compare the proposed architecture with three other architectures which either do not exploit sparsity (NMP-dense) or do not employ NMP (traditional-fully sparse) or do not include both (traditional-dense). The proposed system outperforms the baselines in terms of performance and energy consumption while executing CNN inference. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
36. Hardware-Efficient and High-Throughput LLRC Segregation Based Binary QC-LDPC Decoding Algorithm and Architecture.
- Author
-
Verma, Anuj and Shrestha, Rahul
- Abstract
This brief proposes hardware-friendly QC-LDPC decoding algorithm with layered scheduling based on new logarithmic-likelihood-ratio compound (LLRC) segregation technique. Subsequently, we present hardware-efficient QC-LDPC decoder-architecture based on the proposed algorithm and additional architectural optimizations. This decoder has been designed based on the 5G-NR specifications, supporting code-lengths and code-rates in the ranges of 26112–10368 bits and 1/3–8/9, respectively. Performance analysis has shown that suggested LLRC-segregation based decoding algorithm delivers adequate FER of 10−5 between 1 to 6.5 dB of SNR range. Furthermore, proposed QC-LDPC decoder is post-route simulated and implemented on the FPGA platform. It operates at a maximum clock frequency of 135 MHz and delivers a peak throughput of 11.02 Gbps. Eventually, comparison with relevant works shows that our decoder delivers $2.2\times $ higher throughput and $8.3\times $ better hardware-efficiency than the state-of-the-art implementations. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
37. High-Speed Modular Multiplier for Lattice-Based Cryptosystems.
- Author
-
Tan, Weihang, Case, Benjamin M., Wang, Antian, Gao, Shuhong, and Lao, Yingjie
- Abstract
Thanks to the inherent post-quantum resistant properties, lattice-based cryptography has gained increasing attention in various cryptographic applications recently. To facilitate the practical deployment, efficient hardware architectures are demanded to accelerate the operations and reduce the computational resources, especially for the polynomial multiplication, which is the bottleneck of lattice-based cryptosystems. In this brief, we present a novel high-speed modular multiplier architecture for polynomial multiplication. The proposed architecture employs a divide and conquer strategy and exploits a special modulus to increase the parallelism and speed up the calculation, while enabling wider applications across various cryptosystems. The experimental results show that our design achieves around 27% and 39% reduction on the area consumption and delay, respectively, compared to prior works. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
38. Making Frequent-Pattern Mining Scalable, Efficient, and Compact on Nonvolatile Memories.
- Author
-
Yang, Chaoshu, Huang, Po-Chun, Lin, Yi, Dong, Jiaqi, Liu, Duo, Tan, Yujuan, and Liang, Liang
- Subjects
- *
NONVOLATILE memory , *PARALLEL processing , *SEQUENTIAL pattern mining , *RANDOM access memory , *IMAGE compression - Abstract
Frequent-pattern mining is a common means to reveal the hidden trends behind data. However, most frequent-pattern mining algorithms are designed for dynamic random-access memory (DRAM), instead of nonvolatile memories (NVMs) which are preferred by energy-limited systems. Due to the huge differences between the characteristics of NVMs and those of DRAM, existing frequent-pattern mining algorithms encounter the issues of write amplification and energy waste when they are run on NVMs. Moreover, the design complexity is exaggerated when parallel computing architecture is introduced to speedup the mining process. A scalable, time-efficient, and energy-economic solution to the frequent-pattern mining problem is thus urgently needed. Based on the well-known frequent-pattern tree (FP-tree) approach to frequent-pattern mining, this article proposes parallel EvFP-tree (PevFP-tree), a parallel frequent-pattern mining solution for NVMs. By considering the NVM characteristics, PevFP-tree accelerates the mining process and enhances the energy efficiency, as compared to a straightforward design of FP-trees on the parallel architecture. Moreover, PevFP-tree offers superior scalability in terms of the degrees of parallelism of the mining algorithm and the branching factor of its tree structure. Observing that keys are often sparsely distributed in FP-trees, we also propose a compression technique to PevFP-tree, namely, compressed PevFP-tree (CpevFP-tree), which further enhances the time and energy efficiencies of PevFP-tree. The proposed PevFP-tree and CpevFP-tree are evaluated by a series of experiments based on realistic datasets from diversified application scenarios, where CpevFP-tree achieves 88.73% of performance improvements over a straightforward design of FP-trees in the parallel architecture, and 79.47% of performance improvements over PevFP-tree, on average. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
39. Using HEP experiment workflows for the benchmarking and accounting of WLCG computing resources.
- Author
-
Doglioni, C., Kim, D., Stewart, G.A., Silvestris, L., Jackson, P., Kamleh, W., Valassi, Andrea, Alef, Manfred, Barbet, Jean-Michel, Datskova, Olga, De Maria, Riccardo, Fontes Medeiros, Miguel, Giordano, Domenico, Grigoras, Costin, Hollowell, Christopher, Javurkova, Martina, Khristenko, Viktor, Lange, David, Michelotto, Michele, and Rinaldi, Lorenzo
- Subjects
- *
GRID computing , *LARGE Hadron Collider , *PROGRAM transformation , *COMPUTER architecture , *PARALLEL processing - Abstract
Benchmarking of CPU resources in WLCG has been based on the HEP-SPEC06 (HS06) suite for over a decade. It has recently become clear that HS06, which is based on real applications from non-HEP domains, no longer describes typical HEP workloads. The aim of the HEP-Benchmarks project is to develop a new benchmark suite for WLCG compute resources, based on real applications from the LHC experiments. By construction, these new benchmarks are thus guaranteed to have a score highly correlated to the throughputs of HEP applications, and a CPU usage pattern similar to theirs. Linux containers and the CernVM-FS filesystem are the two main technologies enabling this approach, which had been considered impossible in the past. In this paper, we review the motivation, implementation and outlook of the new benchmark suite. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
40. A Novel In-Memory Wallace Tree Multiplier Architecture Using Majority Logic.
- Author
-
Lakshmi, Vijaya, Reuben, John, and Pudi, Vikramkumar
- Subjects
- *
THRESHOLD logic , *ADDITION (Mathematics) , *GATE array circuits , *BUILDING additions , *TREES , *MULTIPLIERS (Mathematical analysis) - Abstract
In-memory computing using emerging technologies such as resistive random-access memory (ReRAM) addresses the ‘von Neumann bottleneck’ and strengthens the present research impetus to overcome the memory wall. While many methods have been recently proposed to implement Boolean logic in memory, the latency of arithmetic circuits (adders and consequently multipliers) implemented as a sequence of such Boolean operations increases greatly with bit-width. Existing in-memory multipliers require $O(n^{2})$ cycles which is inefficient both in terms of latency and energy. In this work, we tackle this exorbitant latency by adopting Wallace Tree multiplier architecture and optimizing the addition operation in each phase of the Wallace Tree. Majority logic primitive was used for addition since it is better than NAND/NOR/IMPLY primitives. Furthermore, high degree of gate-level parallelism is employed at the array level by executing multiple majority gates in the columns of the array. In this manner, an in-memory multiplier of $O(n.log(n))$ latency is achieved which outperforms all reported in-memory multipliers. Furthermore, the proposed multiplier can be implemented in a regular transistor-accessed memory array without any major modifications to its peripheral circuitry and is also energy-efficient. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
41. Low-Complexity Resource-Shareable Parallel Generalized Integrated Interleaved Encoder.
- Author
-
Tang, Yok Jye and Zhang, Xinmiao
- Subjects
- *
ERROR-correcting codes , *FLASH memory , *VIDEO coding , *SHIFT registers , *DIGITAL communications , *PARALLEL processing , *COMPUTER architecture - Abstract
Generalized integrated interleaved (GII) codes nest a set of linear block codewords to generate codewords belonging to stronger codes. They are among the best error-correcting codes for next-generation hyper-speed digital communications and storage. Serial encoders for GII codes based on BCH codes have been previous investigated. They consist of BCH encoders whose inputs and outputs are multiplied by vectors decided by the nesting scheme. However, parallel GII encoders for high-speed systems cannot be designed by directly extending serial encoders due to the unique feature that BCH codes of different error-correcting capabilities are involved. Moreover, GII decoder complexity and latency can be greatly reduced by sharing the encoder to compute short remainders for syndrome computation. Although previous resource-shareable BCH encoders can be utilized to implement resource-shareable GII encoders, they are all serial. This paper first proposes a low-complexity scheme to handle the different error-correcting capabilities of the involved codes and align the input and parity symbols for parallel processing. Then two efficient parallel resource-shareable BCH encoder architectures to be used as GII encoder components are developed. The first design is achieved by deriving parallel register state update formulas for concatenated linear-feedback shift registers (LFSRs). Through reformulating the remainder polynomial divisions, the second design allows the inputs to be added to different LFSR taps, and accordingly reduces the complexity by a significant portion. For an example 160-parallel GII-BCH encoder considered for Flash memory applications, the second proposed design requires 14% smaller area compared to the first one. Besides both of them lead to around 50% latency reduction in the nested syndrome computation with small area overheads compared to the best possible alternative design. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
42. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
- Published
- 2010
- Full Text
- View/download PDF
43. PIT: Processing-In-Transmission With Fine-Grained Data Manipulation Networks.
- Author
-
Zong, Pengchen, Xia, Tian, Zhao, Haoran, Tong, Jianming, Li, Zehua, Zhao, Wenzhe, Zheng, Nanning, and Ren, Pengju
- Subjects
- *
CONVOLUTIONAL neural networks , *MATRIX inversion , *MULTICASTING (Computer networks) , *VERNACULAR architecture , *PARALLEL processing , *BENEFIT performances , *DATA transmission systems - Abstract
In the domain of data parallel computation, most works focus on data flow optimization inside the PE array and favorable memory hierarchy to pursue the maximum parallelism and efficiency, while the importance of data contents has been overlooked for a long time. As we observe, for structured data, insights on the contents (i.e., their values and locations within a structured form) can greatly benefit the computation performance, as fine-grained data manipulation can be performed. In this paper, we claim that by providing a flexible and adaptive data path, an efficient architecture with capability of fine-grained data manipulation can be built. Specifically, we design SOM, a portable and highly-adaptive data transmission network, with the capability of operand sorting, non-blocking self-route ordering and multicasting. Based on SOM, we propose the processing-in-transmission architecture (PITA), which extends the traditional SIMD architecture to perform some fundamental data processing during its transmission, by embedding multiple levels of SOM networks on the data path. We evaluate the performance of PITA in two irregular computation problems. We first map the matrix inversion task onto PITA and show considerable performance gain can be achieved, resulting in 3×-20× speedup against Intel MKL, and 20×-40× against cuBLAS. Then we evaluate our PITA on sparse CNNs. The results indicate that PITA can greatly improve computation efficiency and reduce memory bandwidth pressure. We achieved 2×-9× speedup against several state-of-art accelerators on sparse CNN, where nearly 100 percent PE efficiency is maintained under high sparsity. We believe the concept of PIT is a promising computing paradigm that can enlarge the capability of traditional parallel architecture. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
44. ParIS+: Data Series Indexing on Multi-Core Architectures.
- Author
-
Peng, Botao, Fatourou, Panagiota, and Palpanas, Themis
- Subjects
- *
MAGNITUDE (Mathematics) , *DATA analysis , *ACQUISITION of data , *TASK analysis , *COMPUTER architecture - Abstract
Data series similarity search is a core operation for several data series analysis applications across many different domains. Nevertheless, even state-of-the-art techniques cannot provide the time performance required for large data series collections. We propose ParIS and ParIS+, the first disk-based data series indices carefully designed to inherently take advantage of multi-core architectures, in order to accelerate similarity search processing times. Our experiments demonstrate that ParIS+ completely removes the CPU latency during index construction for disk-resident data, and for exact query answering is up to 1 order of magnitude faster than the current state of the art index scan method, and up to 3 orders of magnitude faster than the optimized serial scan method. ParIS+ (which is an evolution of the ADS+ index) owes its efficiency to the effective use of multi-core and multi-socket architectures, in order to distribute and execute in parallel both index construction and query answering, and to the exploitation of the Single Instruction Multiple Data (SIMD) capabilities of modern CPUs, in order to further parallelize the execution of instructions inside each core. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
45. A Novel DRAM Architecture for Improved Bandwidth Utilization and Latency Reduction Using Dual-Page Operation.
- Author
-
Sudarshan, Chirag, Steiner, Lukas, Jung, Matthias, Lappas, Jan, Weis, Christian, and Wehn, Norbert
- Abstract
Emerging memory-intensive applications require a paradigm shift from processor-centric to memory-centric computing. The performance of state-of-the-art computing systems and accelerators designed for such applications is not limited by the processing speed but rather by the limited DRAM bandwidth and long DRAM latencies. Although, the interface frequency of new commodity DRAM memories is constantly increasing to achieve a higher peak bandwidth, this bandwidth cannot be fully utilized due to long internal latencies. Page-misses (i.e., change of active row) are one of the key contributors to long access latencies that result in a low bandwidth utilization. In this brief, we propose a novel DRAM sub-array level architecture referred to as Dual-Sense-Amplifier (DSA) that masks the page-miss latency by allowing individual sub-arrays or banks to flexibly open two rows concurrently. Additionally, the DSA architecture design aim to be fully compatible with any type of commodity DRAM architecture (e.g., HBM, GDDR, etc.) and to retain most of its circuit designs. The area overhead of DSA for an 8Gb device is 9.6% compared to a commodity DRAM device. On average, a DSA 8Gb device improves the bandwidth utilization by 17.53% and reduces the average response latency by 31.59 ns compared to commodity DRAM devices operated at 800 MHz. Moreover, we demonstrate that the energy consumption overhead of a DSA DRAM in comparison to a commodity DRAM is negligible. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
46. Efficient Incorporation of the RNS Datapath in Reverse Converter.
- Author
-
Taheri, MohammadReza, Molahosseini, Amir Sabbagh, and Navi, Keivan
- Abstract
The class of moduli sets in the form of 2
k , 2n − 1, 2n + 1, m4 with m4 ∈ 2r + 1, 2r − 1 has earned significant popularity in the implementation of the Residue Number System (RNS)-based computational systems, mainly thanks to the efficient arithmetic unit and a high degree of parallelism. However, its complicated inter-modulo computation leads to a high overhead associated with the complex reverse converter. This overhead is the main barrier for energy-efficient implementation of RNS-based devices, particularly for edge computing applications. This brief presents a new approach that embeds the reverse converter into the arithmetic unit of the RNS processor for the aforesaid well-known class of moduli sets. The effective hardware reuse in the proposed approach leads to an area and energy-efficient RNS realization for this class of moduli set. The experimental results based on 65 nm CMOS technology indicate the superiority of RNS realization by employing the proposed design methodology. The proposed architecture for a given RNS provides a substantial 17.4% area-saving and 13.32% less power-consumption on average compared to the traditional design approach, with the negligible penalty in delay. [ABSTRACT FROM AUTHOR]- Published
- 2021
- Full Text
- View/download PDF
47. RRAM-DNN: An RRAM and Model-Compression Empowered All-Weights-On-Chip DNN Accelerator.
- Author
-
Li, Ziyun, Wang, Zhehong, Xu, Li, Dong, Qing, Liu, Bowen, Su, Chin-I, Chu, Wen-Ting, Tsou, George, Chih, Yu-Der, Chang, Tsung-Yung Jonathan, Sylvester, Dennis, Kim, Hun-Seok, and Blaauw, David
- Subjects
NONVOLATILE random-access memory ,ARTIFICIAL neural networks ,MACHINE learning ,RANDOM access memory ,PARALLEL processing - Abstract
This article presents an energy-efficient deep neural network (DNN) accelerator with non-volatile embedded resistive random access memory (RRAM) for mobile machine learning (ML) applications. This DNN accelerator implements weight pruning, non-linear quantization, and Huffman encoding to store all weights on RRAM, enabling single-chip processing for large neural network models without external memory. A four-core parallel and programmable architecture adapts to various neural network configurations with high utilization. We introduce a customized RRAM macro with a dynamic clamping offset-canceling sense amplifier (DCOCSA) that achieves sub-microampere input offset. The on-chip decompression and memory error-resilient scheme enables 16 million (M) 8-bit (decompressed) weights on a single-chip using 24 Mb RRAM. The proposed RRAM-DNN is the first digital DNN accelerator featuring 24 Mb RRAM as all-on-chip weight storage to eliminate energy-consuming off-chip memory accesses. The fabricated design performs the complete inference process of the ResNet-18 model while consuming 127.9 mW power in TSMC-22 nm ULL CMOS. The RRAM-DNN accelerator achieves peak performance of 123 GOPs with 8-bit precision, exhibiting measured energy efficiency of 0.96 TOPs/W. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
48. PUMA-V - Polyhedral User Mapping Assistant and Visualizer Final Technical Report
- Author
-
Lethin, Richard [Reservoir Labs, Inc., New York, NY (United States)]
- Published
- 2016
49. A 3-D Crossbar Architecture for Both Pipeline and Parallel Computations.
- Author
-
Aljafar, Muayad J. and Acken, John M.
- Subjects
- *
MEMRISTORS , *PARALLEL processing , *BEHAVIORAL assessment , *LOGIC circuits , *COMPUTER architecture - Abstract
A 3D architecture made up of a CMOS layer combined with a 3D stack of bipolar memristor crossbar arrays provides an innovative approach to hardware support for utilizing the strength of CMOS combined with the strength of memristors. Memristors have been evaluated for implementing a broad spectrum of applications such as memory, computations, hardware-based security primitives, cryptography, etc., and numerous studies have shown that memristors are desirable candidates for such applications. This paper proposes a novel 3D memristive crossbar architecture (i.e., a stack of memristive crossbar arrays built on top of CMOS substrate) with a specific focus on the way of connecting the crossbar arrays to the CMOS layer. The proposed architecture is configurable and allows restructuring crossbar arrays and creating 1D arrays with adjustable sizes. The proposed architecture enables parallel and pipeline computations where data can move or be processed in planes perpendicular to the stacked crossbar arrays. In addition, the proposed architecture is scalable meaning that stacks of crossbar arrays can be connected without additional overhead. This paper shows examples of implementing a full adder, a 4-bit look-ahead carry generator, and an 8-bit multiplexer. Simulations and area, delay, and power analysis demonstrate the behavior of the proposed 3D circuit. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
50. Parallel Simulation in Subsurface Hydrology: Evaluating the Performance of Modeling Computers.
- Author
-
Sarris, Theo S., Scott, David M., and Close, Murray E.
- Subjects
- *
COMPUTER performance , *COMPUTER simulation , *COMPUTER architecture , *MODERN architecture , *PARALLEL processing , *SUBSURFACE drainage , *BOTTLENECKS (Manufacturing) - Abstract
Monte Carlo uncertainty analysis, model calibration and optimization applications in hydrology, usually involve a very large number of forward transient model solutions, often resulting in computational bottlenecks. Parallel processing can significantly reduce overall simulation time, benefiting from the architecture of modern computers. This work investigates system performance using two realistic flow and transport modeling scenarios, applied to various modeling hardware, to provide information on the expected performance of parallel simulations and inform investment decisions. We investigate how performance, measured in terms of speedup and efficiency, changes with increasing number of parallel processes. We conclude that the maximum performance achieved by parallelization can range from 40% to 100% of the theoretical limit, with the lower increases associated with multi‐CPU servers. The number of parallel processes required to maximize performance is application dependent, and in contrast to common practice, often needs to be significantly larger than the total number of system CPU cores. Further testing is required to better understand how the physical problem being simulated affects the optimal number of parallel processes needed. Finally, when laptops are considered for modeling applications, careful consideration should be given not only to the specifications but also to the intended use designated by the manufacturer. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.