Descriptor: "Computer Science - Hardware Architecture" / Publisher: ieee - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Computer Science - Hardware Architecture"' showing total 311 results

Start Over Descriptor "Computer Science - Hardware Architecture" Publisher ieee

311 results on '"Computer Science - Hardware Architecture"'

1. Dustin: A 16-Cores Parallel Ultra-Low-Power Cluster With 2b-to-32b Fully Flexible Bit-Precision and Vector Lockstep Execution Mode

Author: Angelo Garofalo, Francesco Conti, DAVIDE ROSSI, Giuseppe Tagliavini, GIANMARCO OTTAVI, LUCA BENINI, and Alfio Di Mauro
Subjects: FOS: Computer and information sciences, Hardware and Architecture, Hardware Architecture (cs.AR), Electrical and Electronic Engineering, Computer Science - Hardware Architecture, QNN inference, mixed-precision, SIMD, MIMD, RISC-V
Abstract: Computationally intensive algorithms such as Deep Neural Networks (DNNs) are becoming killer applications for edge devices. Porting heavily data-parallel algorithms on resource-constrained and battery-powered devices poses several challenges related to memory footprint, computational throughput, and energy efficiency. Low-bitwidth and mixed-precision arithmetic have been proven to be valid strategies for tackling these problems. We present Dustin, a fully programmable compute cluster integrating 16 RISC-V cores capable of 2- to 32-bit arithmetic and all possible mixed-precision permutations. In addition to a conventional Multiple-Instruction Multiple-Data (MIMD) processing paradigm, Dustin introduces a Vector Lockstep Execution Mode (VLEM) to minimize power consumption in highly data-parallel kernels. In VLEM, a single leader core fetches instructions and broadcasts them to the 15 follower cores. Clock gating Instruction Fetch (IF) stages and private caches of the follower cores leads to 38\% power reduction with minimal performance overhead (, 13 pages, 17 figures, 2 tables, Journal
Published: 2023

2. MetaML: automating customizable cross-stage design-flow for deep learning acceleration

Author: Que, Zhiqiang, Liu, Shuo, Rognlien, Markus, Guo, Ce, Coutinho, Jose G. F., and Luk, Wayne
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: This paper introduces a novel optimization framework for deep neural network (DNN) hardware accelerators, enabling the rapid development of customized and automated design flows. More specifically, our approach aims to automate the selection and configuration of low-level optimization techniques, encompassing DNN and FPGA low-level optimizations. We introduce novel optimization and transformation tasks for building design-flow architectures, which are highly customizable and flexible, thereby enhancing the performance and efficiency of DNN accelerators. Our results demonstrate considerable reductions of up to 92\% in DSP usage and 89\% in LUT usage for two networks, while maintaining accuracy and eliminating the need for human effort or domain expertise. In comparison to state-of-the-art approaches, our design achieves higher accuracy and utilizes three times fewer DSP resources, underscoring the advantages of our proposed framework., Comment: 5 pages, Accepted at FPL'23
Published: 2023

3. HoloBeam: Paper-Thin Near-Eye Displays

Author: Akşit, Kaan and Itoh, Yuta
Subjects: I.3.1, FOS: Computer and information sciences, I.3.2, I.3.7, Computer Science - Human-Computer Interaction, FOS: Physical sciences, Graphics (cs.GR), Human-Computer Interaction (cs.HC), Computer Science - Graphics, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Physics - Optics, Optics (physics.optics)
Abstract: An emerging alternative to conventional Augmented Reality (AR) glasses designs, Beaming displays promise slim AR glasses free from challenging design trade-offs, including battery-related limits or computational budget-related issues. These beaming displays remove active components such as batteries and electronics from AR glasses and move them to a projector that projects images to a user from a distance (1-2 meters), where users wear only passive optical eyepieces. However, earlier implementations of these displays delivered poor resolutions (7 cycles per degree) without any optical focus cues and were introduced with a bulky form-factor eyepiece (50 mm thick). This paper introduces a new milestone for beaming displays, which we call HoloBeam. In this new design, a custom holographic projector populates a micro-volume located at some distance (1-2 meters) with multiple planes of images. Users view magnified copies of these images from this small volume with the help of an eyepiece that is either a Holographic Optical Element (HOE) or a set of lenses. Our HoloBeam prototypes demonstrate the thinnest AR glasses to date with a submillimeter thickness (e.g., HOE film is only 120 um thick). In addition, HoloBeam prototypes demonstrate near retinal resolutions (24 cycles per degree) with a 70 degrees-wide field of view., Comment: 15 pages, 18 Figures, 1 Table, 1 Listing
Published: 2023

4. Versatile and Concurrent FPGA-Based Architecture for Practical Quantum Communication Systems

Author: Andrea Stanco, Francesco B. L. Santagiustina, Luca Calderaro, Marco Avesani, Tommaso Bertapelle, Daniele Dequal, Giuseppe Vallone, and Paolo Villoresi
Subjects: FOS: Computer and information sciences, quantum key distribution (QKD), Standards, Computer Science - Emerging Technologies, FOS: Physical sciences, Receivers, Optical transmitters, embedded system, quantum random number generator (QRNG), field-programmable gate array (FPGA), Hardware Architecture (cs.AR), Computer architecture, Atomic physics. Constitution and properties of matter, Computer Science - Hardware Architecture, Materials of engineering and construction. Mechanics of materials, Quantum Physics, quantum communication (QC), Field programmable gate arrays, Quantum system, Security, Commercial off-the-shelf (COTS), Emerging Technologies (cs.ET), TA401-492, Quantum Physics (quant-ph), QC170-197
Abstract: This work presents a hardware and software architecture which can be used in those systems that implement practical Quantum Key Distribution (QKD) and Quantum Random Number Generation (QRNG) schemes. This architecture fully exploits the capability of a System-on-a-Chip (SoC) which comprehends both a Field Programmable Gate Array (FPGA) and a dual core CPU unit. By assigning the time-related tasks to the FPGA and the management to the CPU, we built a flexible system with optimized resource sharing on a commercial off-the-shelf (COTS) evaluation board which includes a SoC. Furthermore, by changing the dataflow direction, the versatile system architecture can be exploited as a QKD transmitter, QKD receiver and QRNG control-acquiring unit. Finally, we exploited the dual core functionality and realized a concurrent stream device to implement a practical QKD transmitter where one core continuously receives fresh data at a sustained rate from an external QRNG source while the other operates with the FPGA to drive the qubits transmission to the QKD receiver. The system was successfully tested on a long-term run proving its stability and security. This demonstration paves the way towards a more secure QKD implementation, with fully unconditional security as the QKD states are entirely generated by a true random process and not by deterministic expansion algorithms. Eventually, this enables the realization of a standalone quantum transmitter, including both the random numbers and the qubits generation., Comment: 7 pages, 4 figures
Published: 2022

5. ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design

Author: You, Haoran, Sun, Zhanyi, Shi, Huihong, Yu, Zhongzhi, Zhao, Yang, Zhang, Yongan, Li, Chaojian, Li, Baopu, and Lin, Yingyan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Hardware Architecture (cs.AR), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively., Accepted to HPCA 2023
Published: 2023

6. Duet: Creating Harmony between Processors and Embedded FPGAs

Author: Li, Ang, Ning, August, and Wentzlaff, David
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: The demise of Moore's Law has led to the rise of hardware acceleration. However, the focus on accelerating stable algorithms in their entirety neglects the abundant fine-grained acceleration opportunities available in broader domains and squanders host processors' compute power. This paper presents Duet, a scalable, manycore-FPGA architecture that promotes embedded FPGAs (eFPGA) to be equal peers with processors through non-intrusive, bi-directionally cache-coherent integration. In contrast to existing CPU-FPGA hybrid systems in which the processors play a supportive role, Duet unleashes the full potential of both the processors and the eFPGAs with two classes of post-fabrication enhancements: fine-grained acceleration, which partitions an application into small tasks and offloads the frequently-invoked, compute-intensive ones onto various small accelerators, leveraging the processors to handle dynamic control flow and less accelerable tasks; hardware augmentation, which employs eFPGA-emulated hardware widgets to improve processor efficiency or mitigate software overheads in certain execution models. An RTL-level implementation of Duet is developed to evaluate the architecture with high fidelity. Experiments using synthetic benchmarks show that Duet can reduce the processor-accelerator communication latency by up to 82% and increase the bandwidth by up to 9.5x. The RTL implementation is further evaluated with seven application benchmarks, achieving 1.5-24.9x speedup., Comment: Accepted to HPCA 2023
Published: 2023

7. A Storage-Effective BTB Organization for Servers

Author: Asheim, Truls, Grot, Boris, and Kumar, Rakesh
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: Many contemporary applications feature multimegabyte instruction footprints that overwhelm the capacity of branch target buffers (BTB) and instruction caches (L1-I), causing frequent front-end stalls that inevitably hurt performance. BTB capacity is crucial for performance as a sufficiently large BTB enables the front-end to accurately resolve the upcoming execution path and steer instruction fetch appropriately. Moreover, it also enables highly effective fetch-directed instruction prefetching that can eliminate a large portion L1-I misses. For these reasons, commercial processors allocate vast amounts of storage capacity to BTBs. #This work aims to reduce BTB storage requirements by optimizing the organization of BTB entries. Our key insight is that storing branch target offsets, instead of full or compressed targets, can drastically reduce BTB storage cost as the vast majority of dynamic branches have short offsets requiring just a handful of bits to encode. Based on this insight, we size the ways of a set associative BTB to hold different number of target offset bits such that each way stores offsets within a particular range. Doing so enables a dramatic reduction in storage for target addresses. Our final design, called BTB-X, uses an 8-way set associative BTB with differently sized ways that enables it to track about 2.24x more branches than a conventional BTB and 1.3x more branches than a storage-optimized state-of-the-art BTB organization, called PDede, with the same storage budget.
Published: 2023

8. SGCN: Exploiting Compressed-Sparse Features in Deep Graph Convolutional Network Accelerators

Author: Yoo, Mingi, Song, Jaeyong, Lee, Jounghoo, Kim, Namhyung, Kim, Youngsok, and Lee, Jinho
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: Graph convolutional networks (GCNs) are becoming increasingly popular as they overcome the limited applicability of prior neural networks. A GCN takes as input an arbitrarily structured graph and executes a series of layers which exploit the graph's structure to calculate their output features. One recent trend in GCNs is the use of deep network architectures. As opposed to the traditional GCNs which only span around two to five layers deep, modern GCNs now incorporate tens to hundreds of layers with the help of residual connections. From such deep GCNs, we find an important characteristic that they exhibit very high intermediate feature sparsity. We observe that with deep layers and residual connections, the number of zeros in the intermediate features sharply increases. This reveals a new opportunity for accelerators to exploit in GCN executions that was previously not present. In this paper, we propose SGCN, a fast and energy-efficient GCN accelerator which fully exploits the sparse intermediate features of modern GCNs. SGCN suggests several techniques to achieve significantly higher performance and energy efficiency than the existing accelerators. First, SGCN employs a GCN-friendly feature compression format. We focus on reducing the off-chip memory traffic, which often is the bottleneck for GCN executions. Second, we propose microarchitectures for seamlessly handling the compressed feature format. Third, to better handle locality in the existence of the varying sparsity, SGCN employs sparsity-aware cooperation. Sparsity-aware cooperation creates a pattern that exhibits multiple reuse windows, such that the cache can capture diverse sizes of working sets and therefore adapt to the varying level of sparsity. We show that SGCN achieves 1.71x speedup and 43.9% higher energy efficiency compared to the existing accelerators., To appear at HPCA'23
Published: 2023

9. DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling

Author: Mei, Linyan, Goetschalckx, Koen, Symons, Arne, and Verhelst, Marian
Subjects: FOS: Computer and information sciences, Computer Science - Distributed, Parallel, and Cluster Computing, Hardware Architecture (cs.AR), Distributed, Parallel, and Cluster Computing (cs.DC), Computer Science - Hardware Architecture
Abstract: DNN workloads can be scheduled onto DNN accelerators in many different ways: from layer-by-layer scheduling to cross-layer depth-first scheduling (a.k.a. layer fusion, or cascaded execution). This results in a very broad scheduling space, with each schedule leading to varying hardware (HW) costs in terms of energy and latency. To rapidly explore this vast space for a wide variety of hardware architectures, analytical cost models are crucial to estimate scheduling effects on the HW level. However, state-of-the-art cost models are lacking support for exploring the complete depth-first scheduling space, for instance focusing only on activations while ignoring weights, or modeling only DRAM accesses while overlooking on-chip data movements. These limitations prevent researchers from systematically and accurately understanding the depth-first scheduling space. After formalizing this design space, this work proposes a unified modeling framework, DeFiNES, for layer-by-layer and depth-first scheduling to fill in the gaps. DeFiNES enables analytically estimating the hardware cost for possible schedules in terms of both energy and latency, while considering data access at every memory level. This is done for each schedule and HW architecture under study by optimally choosing the active part of the memory hierarchy per unique combination of operand, layer, and feature map tile. The hardware costs are estimated, taking into account both data computation and data copy phases. The analytical cost model is validated against measured data from a taped-out depth-first DNN accelerator, DepFiN, showing good modeling accuracy at the end-to-end neural network level. A comparison with generalized state-of-the-art demonstrates up to 10X better solutions found with DeFiNES., Comment: Accepted by HPCA 2023
Published: 2023

10. MERCURY: Accelerating DNN Training By Exploiting Input Similarity

Author: Janfaza, Vahid, Weston, Kevin, Razavi, Moein, Mandal, Shantanu, Mahmud, Farabi, Hilty, Alex, and Muzahid, Abdullah
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: Deep Neural Networks (DNN) are computationally intensive to train. It consists of a large number of multidimensional dot products between many weights and input vectors. However, there can be significant similarity among input vectors. If one input vector is similar to another, its computations with the weights are similar to those of the other and, therefore, can be skipped by reusing the already-computed results. We propose a novel scheme, called MERCURY, to exploit input similarity during DNN training in a hardware accelerator. MERCURY uses Random Projection with Quantization (RPQ) to convert an input vector to a bit sequence, called Signature. A cache (MCACHE) stores signatures of recent input vectors along with the computed results. If the Signature of a new input vector matches that of an already existing vector in the MCACHE, the two vectors are found to have similarities. Therefore, the already-computed result is reused for the new vector. To the best of our knowledge, MERCURY is the first work that exploits input similarity using RPQ for accelerating DNN training in hardware. The paper presents a detailed design, workflow, and implementation of the MERCURY. Our experimental evaluation with twelve different deep learning models shows that MERCURY saves a significant number of computations and speeds up the model training by an average of 1.97X with an accuracy similar to the baseline system., 13 pages, 18 figures, 4 tables
Published: 2023

11. TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU

Author: Fan, Shengyu, Wang, Zhiwei, Xu, Weizhi, Hou, Rui, Meng, Dan, and Zhang, Mingzhe
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Cryptography and Security (cs.CR)
Abstract: In this paper, we propose TensorFHE, an FHE acceleration solution based on GPGPU for real applications on encrypted data. TensorFHE utilizes Tensor Core Units (TCUs) to boost the computation of Number Theoretic Transform (NTT), which is the part of FHE with highest time-cost. Moreover, TensorFHE focuses on performing as many FHE operations as possible in a certain time period rather than reducing the latency of one operation. Based on such an idea, TensorFHE introduces operation-level batching to fully utilize the data parallelism in GPGPU. We experimentally prove that it is possible to achieve comparable performance with GPGPU as with state-of-the-art ASIC accelerators. TensorFHE performs 913 KOPS and 88 KOPS for NTT and HMULT (key FHE kernels) within NVIDIA A100 GPGPU, which is 2.61x faster than state-of-the-art FHE implementation on GPGPU; Moreover, TensorFHE provides comparable performance to the ASIC FHE accelerators, which makes it even 2.9x faster than the F1+ with a specific workload. Such a pure software acceleration based on commercial hardware with high performance can open up usage of state-of-the-art FHE algorithms for a broad set of applications in real systems., To be appeared in the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29), 2023
Published: 2023

12. Unsupervised Recycled FPGA Detection Using Symmetry Analysis

Author: Tarique, Tanvir Ahmad, Ahmed, Foisal, Jenihhin, Maksim, and Ali, Liakot
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: Recently, recycled field-programmable gate arrays (FPGAs) pose a significant hardware security problem due to the proliferation of the semiconductor supply chain. Ring oscillator (RO) based frequency analyzing technique is one of the popular methods, where most studies used the known fresh FPGAs (KFFs) in machine learning-based detection, which is not a realistic approach. In this paper, we present a novel recycled FPGA detection method by examining the symmetry information of the RO frequency using unsupervised anomaly detection method. Due to the symmetrical array structure of the FPGA, some adjacent logic blocks on an FPGA have comparable RO frequencies, hence our method simply analyzes the RO frequencies of those blocks to determine how similar they are. The proposed approach efficiently categorizes recycled FPGAs by utilizing direct density ratio estimation through outliers detection. Experiments using Xilinx Artix-7 FPGAs demonstrate that the proposed method accurately classifies recycled FPGAs from 10 fresh FPGAs by x fewer computations compared with the conventional method.
Published: 2022

13. FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design

Author: Hashimoto, Nobuho and Takamaeda-Yamazaki, Shinya
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: 3D reconstruction from videos has become increasingly popular for various applications, including navigation for autonomous driving of robots and drones, augmented reality (AR), and 3D modeling. This task often combines traditional image/video processing algorithms and deep neural networks (DNNs). Although recent developments in deep learning have improved the accuracy of the task, the large number of calculations involved results in low computation speed and high power consumption. Although there are various domain-specific hardware accelerators for DNNs, it is not easy to accelerate the entire process of applications that alternate between traditional image/video processing algorithms and DNNs. Thus, FPGA-based end-to-end acceleration is required for such complicated applications in low-power embedded environments. This paper proposes a novel FPGA-based accelerator for DeepVideoMVS, a DNN-based depth estimation method for 3D reconstruction. We employ HW/SW co-design to appropriately utilize heterogeneous components in modern SoC FPGAs, such as programmable logic (PL) and CPU, according to the inherent characteristics of the method. As some operations are unsuitable for hardware implementation, we determine the operations to be implemented in software through analyzing the number of times each operation is performed and its memory access pattern, and then considering comprehensive aspects: the ease of hardware implementation and degree of expected acceleration by hardware. The hardware and software implementations are executed in parallel on the PL and CPU to hide their execution latencies. The proposed accelerator was developed on a Xilinx ZCU104 board by using NNgen, an open-source high-level synthesis (HLS) tool. Experiments showed that the proposed accelerator operates 60.2 times faster than the software-only implementation on the same FPGA board with minimal accuracy degradation., Comment: 9 pages, 8 figures, 3 tables, FPT 2022 (Full paper), Program: https://fpt22.hkust.edu.hk/program#tools, GitHub: https://github.com/casys-utokyo/fadec, Slides: https://speakerdeck.com/hashi0203/sw-co-design-fpt-2022-8082a83d-3167-461c-8560-60f77959a3d5, Movie: https://youtu.be/NFULXQeu6Vw, Profile: https://n-hassy.info
Published: 2022

14. LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning

Author: Yang, Je, Kim, JaeUk, and Kim, Joo-Young
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent reinforcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. \par In this paper, we present a real-time sparse training acceleration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create sparsity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators., Comment: This paper will be published in IEEE International Conference on Field-Programmable Technology 2022
Published: 2022

15. Spatiotemporal 2-D Channel Coding for Very Low Latency Reliable MIMO Transmission

Author: You, Xiaohu, Zhang, Chuan, Sheng, Bin, Huang, Yongming, Ji, Chen, Shen, Yifei, Zhou, Wenyue, and Liu, Jian
Subjects: FOS: Computer and information sciences, Computer Science - Information Theory, Information Theory (cs.IT), Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: To fully support vertical industries, 5G and its corresponding channel coding are expected to meet requirements of different applications. However, for applications of 5G and beyond 5G (B5G) such as URLLC, the transmission latency is required to be much shorter than that in eMBB. Therefore, the resulting channel code length reduces drastically. In this case, the traditional 1-D channel coding suffers a lot from the performance degradation and fails to deliver strong reliability with very low latency. To remove this bottleneck, new channel coding scheme beyond the existing 1-D one is in urgent need. By making full use of the spacial freedom of massive MIMO systems, this paper devotes itself in proposing a spatiotemporal 2-D channel coding for very low latency reliable transmission. For a very short time-domain code length $N^{\text{time}}=16$, $64 \times 128$ MIMO system employing the proposed spatiotemporal 2-D coding scheme successfully shows more than $3$\,dB performance gain at $\text{FER}=10^{-3}$, compared to the 1-D time-domain channel coding. It is noted that the proposed coding scheme is suitable for different channel codes and enjoys high flexibility to adapt to difference scenarios. By appropriately selecting the code rate, code length, and the number of codewords in the time and space domains, the proposed coding scheme can achieve a good trade-off between the transmission latency and reliability.
Published: 2022

16. Gradient descent-based programming of analog in-memory computing cores

Author: Büchel, Julian, Vasilopoulos, Athanasios, Kersting, Benedikt, Odermatt, Frederic, Brew, Kevin, Ok, Injo, Choi, Sam, Saraf, Iqbal, Chan, Victor, Philip, Timothy, Saulnier, Nicole, Narayanan, Vijay, Gallo, Manuel Le, and Sebastian, Abu
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: The precise programming of crossbar arrays of unit-cells is crucial for obtaining high matrix-vector-multiplication (MVM) accuracy in analog in-memory computing (AIMC) cores. We propose a radically different approach based on directly minimizing the MVM error using gradient descent with synthetic random input data. Our method significantly reduces the MVM error compared with conventional unit-cell by unit-cell iterative programming. It also eliminates the need for high-resolution analog-to-digital converters (ADCs) to read the small unit-cell conductance during programming. Our method improves the experimental inference accuracy of ResNet-9 implemented on two phase-change memory (PCM)-based AIMC cores by 1.26%.
Published: 2022

17. Design and Evaluation of a Rack-Scale Disaggregated Memory Architecture For Data Centers

Author: Puri, Amit, Jose, John, and Venkatesh, Tamarapalli
Subjects: FOS: Computer and information sciences, Computer Science - Distributed, Parallel, and Cluster Computing, Hardware Architecture (cs.AR), Distributed, Parallel, and Cluster Computing (cs.DC), Computer Science - Hardware Architecture
Abstract: Memory disaggregation is being considered as a strong alternative to traditional architecture to deal with the memory under-utilization in data centers. Disaggregated memory can adapt to dynamically changing memory requirements for the data center applications like data analytics, big data, etc., that require in-memory processing. However, such systems can face high remote memory access latency due to the interconnect speeds. In this paper, we explore a rack-scale disaggregated memory architecture and discuss the various design aspects. We design a trace-driven simulator that combines an event-based interconnect and a cycle-accurate memory simulator to evaluate the performance of disaggregated memory system at the rack scale. Our study shows that not only the interconnect but the contention in the remote memory queues also adds significantly to remote memory access latency. We introduces a memory allocation policy to reduce the latency compared to the conventional policies. We conduct experiments using various benchmarks with diverse memory access patterns. Our study shows encouraging results towards the rack-scale memory disaggregation and acceptable average memory access latency.
Published: 2022

18. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks

Author: Seshadri, Kiran, Akin, Berkin, Laudon, James, Narayanaswami, Ravi, and Yazdanbakhsh, Amir
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: Edge TPUs are a domain of accelerators for low-power, edge devices and are widely used in various Google products such as Coral and Pixel devices. In this paper, we first discuss the major microarchitectural details of Edge TPUs. Then, we extensively evaluate three classes of Edge TPUs, covering different computing ecosystems, that are either currently deployed in Google products or are the product pipeline, across 423K unique convolutional neural networks. Building upon this extensive study, we discuss critical and interpretable microarchitectural insights about the studied classes of Edge TPUs. Mainly, we discuss how Edge TPU accelerators perform across convolutional neural networks with different structures. Finally, we present our ongoing efforts in developing high-accuracy learned machine learning models to estimate the major performance metrics of accelerators such as latency and energy consumption. These learned models enable significantly faster (in the order of milliseconds) evaluations of accelerators as an alternative to time-consuming cycle-accurate simulators and establish an exciting opportunity for rapid hard-ware/software co-design., Comment: 13 pages, 15 figures, 8 tables, published in IISWC 2022
Published: 2022

19. GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation

Author: Sykora, Ondrej, Phothilimthana, Phitchaya Mangpo, Mendis, Charith, and Yazdanbakhsh, Amir
Subjects: Performance (cs.PF), FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Performance, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: Analytical hardware performance models yield swift estimation of desired hardware performance metrics. However, developing these analytical models for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding of target microarchitecture's internal structure. In this paper, we introduce GRANITE, a new machine learning model that estimates the throughput of basic blocks across different microarchitectures. GRANITE uses a graph representation of basic blocks that captures both structural and data dependencies between instructions. This representation is processed using a graph neural network that takes advantage of the relational information captured in the graph and learns a rich neural representation of the basic block that allows more precise throughput estimation. Our results establish a new state-of-the-art for basic block performance estimation with an average test error of 6.9% across a wide range of basic blocks and microarchitectures for the x86-64 target. Compared to recent work, this reduced the error by 1.7% while improving training and inference throughput by approximately 3.0x. In addition, we propose the use of multi-task learning with independent multi-layer feed forward decoder networks. Our results show that this technique further improves precision of all learned models while significantly reducing per-microarchitecture training costs. We perform an extensive set of ablation studies and comparisons with prior work, concluding a set of methods to achieve high accuracy for basic block performance estimation., 13 pages; 5 figures; published at IISWC 2022; Included IEEE copyright
Published: 2022

20. BonnBot-I: A Precise Weed Management and Crop Monitoring Platform

Author: Ahmadi, Alireza, Halstead, Michael, and McCool, Chris
Subjects: Computer Science - Robotics, Computer Science - Multiagent Systems, Computer Science - Hardware Architecture, Electrical Engineering and Systems Science - Systems and Control
Abstract: Cultivation and weeding are two of the primary tasks performed by farmers today. A recent challenge for weeding is the desire to reduce herbicide and pesticide treatments while maintaining crop quality and quantity. In this paper we introduce BonnBot-I a precise weed management platform which can also performs field monitoring. Driven by crop monitoring approaches which can accurately locate and classify plants (weed and crop) we further improve their performance by fusing the platform available GNSS and wheel odometry. This improves tracking accuracy of our crop monitoring approach from a normalized average error of 8.3% to 3.5%, evaluated on a new publicly available corn dataset. We also present a novel arrangement of weeding tools mounted on linear actuators evaluated in simulated environments. We replicate weed distributions from a real field, using the results from our monitoring approach, and show the validity of our work-space division techniques which require significantly less movement (a 50% reduction) to achieve similar results. Overall, BonnBot-I is a significant step forward in precise weed management with a novel method of selectively spraying and controlling weeds in an arable field, Comment: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Published: 2022

21. Integral Sampler and Polynomial Multiplication Architecture for Lattice-based Cryptography

Author: Wang, Antian, Tan, Weihang, Parhi, Keshab K., and Lao, Yingjie
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Cryptography and Security (cs.CR)
Abstract: With the surge of the powerful quantum computer, lattice-based cryptography proliferated the latest cryptography hardware implementation due to its resistance against quantum computers. Among the computational blocks of lattice-based cryptography, the random errors produced by the sampler play a key role in ensuring the security of these schemes. This paper proposes an integral architecture for the sampler, which can reduce the overall resource consumption by reusing the multipliers and adders within the modular polynomial computation. For instance, our experimental results show that the proposed design can effectively reduce the discrete Ziggurat sampling method in DSP usage., 6 pages, accepted by 35th IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems
Published: 2022

22. Logic and Reduction Operation based Hardware Trojans in Digital Design

Author: Das, Mayukhmali, Dutta, Sounak, and Chatterjee, Sayan
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Cryptography and Security (cs.CR)
Abstract: In this paper, we will demonstrate Hardware Trojan Attacks on four different digital designs implemented on FPGA. The hardware trojan is activated based on special logical and reduction-based operations on vectors which makes the trojan-activity as silent and effective as possible. In this paper, we have introduced 5 novel trojan attack methodologies., 2 pages, 2 figures, accepted in ISOCC 2022
Published: 2022

23. Hardware-in-the-Loop Simulation for Evaluating Communication Impacts on the Wireless-Network-Controlled Robots

Author: Lv, Honghao, Pang, Zhibo, Xiao, Ming, and Yang, Geng
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Robotics (cs.RO)
Abstract: More and more robot automation applications have changed to wireless communication, and network performance has a growing impact on robotic systems. This study proposes a hardware-in-the-loop (HiL) simulation methodology for connecting the simulated robot platform to real network devices. This project seeks to provide robotic engineers and researchers with the capability to experiment without heavily modifying the original controller and get more realistic test results that correlate with actual network conditions. We deployed this HiL simulation system in two common cases for wireless-network-controlled robotic applications: (1) safe multi-robot coordination for mobile robots, and (2) human-motion-based teleoperation for manipulators. The HiL simulation system is deployed and tested under various network conditions in all circumstances. The experiment results are analyzed and compared with the previous simulation methods, demonstrating that the proposed HiL simulation methodology can identify a more reliable communication impact on robot systems., Comment: 6 pages, 11 figures, to appear in 48th Annual Conference of the Industrial Electronics Society IECON 2022 Conference
Published: 2022

24. Real-Time Scheduling of Machine Learning Operations on Heterogeneous Neuromorphic SoC

Author: Das, Anup
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: Neuromorphic Systems-on-Chip (NSoCs) are becoming heterogeneous by integrating general-purpose processors (GPPs) and neural processing units (NPUs) on the same SoC. For embedded systems, an NSoC may need to execute user applications built using a variety of machine learning models. We propose a real-time scheduler, called PRISM, which can schedule machine learning models on a heterogeneous NSoC either individually or concurrently to improve their system performance. PRISM consists of the following four key steps. First, it constructs an interprocessor communication (IPC) graph of a machine learning model from a mapping and a self-timed schedule. Second, it creates a transaction order for the communication actors and embeds this order into the IPC graph. Third, it schedules the graph on an NSoC by overlapping communication with the computation. Finally, it uses a Hill Climbing heuristic to explore the design space of mapping operations on GPPs and NPUs to improve the performance. Unlike existing schedulers which use only the NPUs of an NSoC, PRISM improves performance by enabling batch, pipeline, and operation parallelism via exploiting a platform's heterogeneity. For use-cases with concurrent applications, PRISM uses a heuristic resource sharing strategy and a non-preemptive scheduling to reduce the expected wait time before concurrent operations can be scheduled on contending resources. Our extensive evaluations with 20 machine learning workloads show that PRISM significantly improves the performance per watt for both individual applications and use-cases when compared to state-of-the-art schedulers., To appear at MEMOCODE 2022
Published: 2022

25. HLS-based Optimization of Tau Triggering Algorithm for LHC: a case study

Author: Cherezova, Natalia, Mihhailov, Dmitri, Devadze, Sergei, and Jutman, Artur
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: With the current increase in the data produced by the Large Hadron Collider (LHC) at CERN, it becomes important to process this data in a corresponding manner. To begin with, to efficiently select events that contain relevant information from a massive flow of data. This is the task of the tau lepton decay triggering algorithm. The implementation is based on the High-Level Synthesis (HLS) approach that allows generating a hardware description of the design from the algorithm written in a high-level programming language like C++. HLS tools are intended to decrease the time and complexity of hardware design development, however, their capabilities are limited. The development of an efficient application requires substantial knowledge of the hardware design and HLS specifics. This paper presents the optimizations introduced to the algorithm that improved latency and area and more importantly solved the problems with the routing, making it possible to implement the algorithm on the FPGA fabric., 6 pages, 5 figures, 2022 18th Biennial Baltic Electronics Conference (BEC)
Published: 2022

26. Gradient Backpropagation based Feature Attribution to Enable Explainable-AI on the Edge

Author: Bhat, Ashwin, Assoa, Adou Sangbone, and Raychowdhury, Arijit
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Hardware Architecture (cs.AR), Image and Video Processing (eess.IV), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: There has been a recent surge in the field of Explainable AI (XAI) which tackles the problem of providing insights into the behavior of black-box machine learning models. Within this field, \textit{feature attribution} encompasses methods which assign relevance scores to input features and visualize them as a heatmap. Designing flexible accelerators for multiple such algorithms is challenging since the hardware mapping of these algorithms has not been studied yet. In this work, we first analyze the dataflow of gradient backpropagation based feature attribution algorithms to determine the resource overhead required over inference. The gradient computation is optimized to minimize the memory overhead. Second, we develop a High-Level Synthesis (HLS) based configurable FPGA design that is targeted for edge devices and supports three feature attribution algorithms. Tile based computation is employed to maximally use on-chip resources while adhering to the resource constraints. Representative CNNs are trained on CIFAR-10 dataset and implemented on multiple Xilinx FPGAs using 16-bit fixed-point precision demonstrating flexibility of our library. Finally, through efficient reuse of allocated hardware resources, our design methodology demonstrates a pathway to repurpose inference accelerators to support feature attribution with minimal overhead, thereby enabling real-time XAI on the edge., To appear in 30th IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC 2022
Published: 2022

27. Hardware-Efficient Template-Based Deep CNNs Accelerator Design

Author: Alhussain, Azzam and Lin, Mingjie
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Image and Video Processing (eess.IV), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Hardware Architecture
Abstract: Acceleration of Convolutional Neural Network (CNN) on edge devices has recently achieved a remarkable performance in image classification and object detection applications. This paper proposes an efficient and scalable CNN-based SoC-FPGA accelerator design that takes pre-trained weights with a 16-bit fixed-point quantization and target hardware specification to generate an optimized template capable of achieving higher performance versus resource utilization trade-off. The template analyzed the computational workload, data dependency, and external memory bandwidth and utilized loop tiling transformation along with dataflow modeling to convert convolutional and fully connected layers into vector multiplication between input and output feature maps, which resulted in a single compute unit on-chip. Furthermore, the accelerator was examined among AlexNet, VGG16, and LeNet networks and ran at 200-MHz with a peak performance of 230 GOP/s depending on ZYNQ boards and state-space exploration of different compute unit configurations during simulation and synthesis. Lastly, our proposed methodology was benchmarked against the previous development on Ultra96 for higher performance measurement., 4 pages, 4 figures, 16th IEEE International Conference on Networking, Architecture, and Storage (IEEE NAS 2022), 3-4 October 2022
Published: 2022

28. Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design

Author: Fan Hongxiang, Thomas Chau, Stylianos Venieris, Royson Lee, Alexandros Kouris, Wayne Luk, Nicholas D. Lane, and Mohamed Abdelfattah
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Butterfly Sparsity, Hardware Architecture (cs.AR), Attention-based NNs, Computer Science - Hardware Architecture, FPGA, Machine Learning (cs.LG)
Abstract: Attention-based neural networks have become pervasive in many AI tasks. Despite their excellent algorithmic performance, the use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources, which often compromises their hardware performance. Although various sparse variants have been introduced, most approaches only focus on mitigating the quadratic scaling of attention on the algorithm level, without explicitly considering the efficiency of mapping their methods on real hardware designs. Furthermore, most efforts only focus on either the attention mechanism or the FFNs but without jointly optimizing both parts, causing most of the current designs to lack scalability when dealing with different input lengths. This paper systematically considers the sparsity patterns in different variants from a hardware perspective. On the algorithmic level, we propose FABNet, a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs. On the hardware level, a novel adaptable butterfly accelerator is proposed that can be configured at runtime via dedicated hardware control to accelerate different butterfly layers using a single unified hardware engine. On the Long-Range-Arena dataset, FABNet achieves the same accuracy as the vanilla Transformer while reducing the amount of computation by 10 to 66 times and the number of parameters 2 to 22 times. By jointly optimizing the algorithm and hardware, our FPGA-based butterfly accelerator achieves 14.2 to 23.2 times speedup over state-of-the-art accelerators normalized to the same computational budget. Compared with optimized CPU and GPU designs on Raspberry Pi 4 and Jetson Nano, our system is up to 273.8 and 15.1 times faster under the same power budget., Comment: Paper accepted by MICRO'22
Published: 2022

29. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Author: Hong, Seongmin, Moon, Seungjae, Kim, Junsoo, Lee, Sungjae, Kim, Minsub, Lee, Dongsoo, and Kim, Joo-Young
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Hardware Architecture (cs.AR), FOS: Electrical engineering, electronic engineering, information engineering, Systems and Control (eess.SY), Computer Science - Hardware Architecture, Electrical Engineering and Systems Science - Systems and Control, Machine Learning (cs.LG)
Abstract: Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters., Comment: Extension of HOTCHIPS 2022 and accepted in MICRO 2022
Published: 2022

30. LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning

Author: Singh, Gagandeep, Diamantopoulos, Dionysios, Gómez-Luna, Juan, Stuijk, Sander, Corporaal, Henk, and Mutlu, Onur
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: Machine learning has recently gained traction as a way to overcome the slow accelerator generation and implementation process on an FPGA. It can be used to build performance and resource usage models that enable fast early-stage design space exploration. First, training requires large amounts of data (features extracted from design synthesis and implementation tools), which is cost-inefficient because of the time-consuming accelerator design and implementation process. Second, a model trained for a specific environment cannot predict performance or resource usage for a new, unknown environment. In a cloud system, renting a platform for data collection to build an ML model can significantly increase the total-cost-ownership (TCO) of a system. Third, ML-based models trained using a limited number of samples are prone to overfitting. To overcome these limitations, we propose LEAPER, a transfer learning-based approach for prediction of performance and resource usage in FPGA-based systems. The key idea of LEAPER is to transfer an ML-based performance and resource usage model trained for a low-end edge environment to a new, high-end cloud environment to provide fast and accurate predictions for accelerator implementation. Experimental results show that LEAPER (1) provides, on average across six workloads and five FPGAs, 85% accuracy when we use our transferred model for prediction in a cloud environment with 5-shot learning and (2) reduces design-space exploration time for accelerator implementation on an FPGA by 10x, from days to only a few hours.
Published: 2022

31. HiRA: Hidden Row Activation for Reducing Refresh Latency of Off-the-Shelf DRAM Chips

Author: Yağlıkçı, Abdullah Giray, Olgun, Ataberk, Patel, Minesh, Luo, Haocong, Hassan, Hasan, Orosa, Lois, Ergin, Oğuz, and Mutlu, Onur
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Cryptography and Security (cs.CR)
Abstract: DRAM is the building block of modern main memory systems. DRAM cells must be periodically refreshed to prevent data loss. Refresh operations degrade system performance by interfering with memory accesses. As DRAM chip density increases with technology node scaling, refresh operations also increase because: 1) the number of DRAM rows in a chip increases; and 2) DRAM cells need additional refresh operations to mitigate bit failures caused by RowHammer, a failure mechanism that becomes worse with technology node scaling. Thus, it is critical to enable refresh operations at low performance overhead. To this end, we propose a new operation, Hidden Row Activation (HiRA), and the HiRA Memory Controller (HiRA-MC). HiRA hides a refresh operation's latency by refreshing a row concurrently with accessing or refreshing another row within the same bank. Unlike prior works, HiRA achieves this parallelism without any modifications to off-the-shelf DRAM chips. To do so, it leverages the new observation that two rows in the same bank can be activated without data loss if the rows are connected to different charge restoration circuitry. We experimentally demonstrate on 56% real off-the-shelf DRAM chips that HiRA can reliably parallelize a DRAM row's refresh operation with refresh or activation of any of the 32% of the rows within the same bank. By doing so, HiRA reduces the overall latency of two refresh operations by 51.4%. HiRA-MC modifies the memory request scheduler to perform HiRA when a refresh operation can be performed concurrently with a memory access or another refresh. Our system-level evaluations show that HiRA-MC increases system performance by 12.6% and 3.73x as it reduces the performance degradation due to periodic refreshes and refreshes for RowHammer protection (preventive refreshes), respectively, for future DRAM chips with increased density and RowHammer vulnerability., Comment: To appear in the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022
Published: 2022

32. Revisiting Residue Codes for Modern Memories

Author: Evgeny Manzhosov, Adam Hastings, Meghna Pancholi, Ryan Piersma, Mohamed Tarek Ibn Ziad, and Simha Sethumadhavan
Subjects: FOS: Computer and information sciences, Hardware and Architecture, Hardware Architecture (cs.AR), Electrical and Electronic Engineering, Computer Science - Hardware Architecture, Software
Abstract: Residue codes have been traditionally used for compute error correction rather than storage error correction. In this paper, we use these codes for storage error correction with surprising results. We find that adapting residue codes to modern memory systems offers a level of error correction comparable to traditional schemes such as Reed-Solomon with fewer bits of storage. For instance, our adaptation of residue code -- MUSE ECC -- can offer ChipKill protection using approximately 30% fewer bits. We show that the storage gains can be used to hold metadata needed for emerging security functionality such as memory tagging or to provide better detection capabilities against Rowhammer attacks. Our evaluation shows that memory tagging in a MUSE-enabled system shows a 12% reduction in memory bandwidth utilization while providing the same level of error correction as a traditional ECC baseline without a noticeable loss of performance. Thus, our work demonstrates a new, flexible primitive for co-designing reliability with security and performance.
Published: 2022

33. Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Author: Abdelkhalik, Hamdy, Arafa, Yehia, Santhi, Nandakishore, and Badawy, Abdel-Hameed
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instructions with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. Using microbenchmarks, we measure the clock cycles for PTX ISA instructions and their SASS ISA instructions counterpart. we further calculate the clock cycle needed to access each memory unit. We also demystify the new version of the tensor core unit found in the Ampere architecture by using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. The results found in this work should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware.
Published: 2022

34. An FPGA framework for Interferometric Vision-Based Navigation (iVisNav)

Author: Bhaskara, Ramchander Rao, Sung, Kookjin, and Majji, Manoranjan
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Computer Science - Robotics, Hardware Architecture (cs.AR), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing, Computer Science - Hardware Architecture, Robotics (cs.RO)
Abstract: Interferometric Vision-Based Navigation (iVisNav) is a novel optoelectronic sensor for autonomous proximity operations. iVisNav employs laser emitting structured beacons and precisely characterizes six degrees of freedom relative motion rates by measuring changes in the phase of the transmitted laser pulses. iVisNav's embedded package must efficiently process high frequency dynamics for robust sensing and estimation. A new embedded system for least squares-based rate estimation is developed in this paper. The resulting system is capable of interfacing with the photonics and implement the estimation algorithm in a field-programmable gate array. The embedded package is shown to be a hardware/software co-design handling estimation procedure using finite precision arithmetic for high-speed computation. The accuracy of the finite precision FPGA hardware design is compared with the floating-point software evaluation of the algorithm on MATLAB to benchmark its performance and statistical consistency with the error measures. Implementation results demonstrate the utility of FPGA computing capabilities for high-speed proximity navigation using iVisNav., Replacement comments: 1. added information on timing and latency, 2. corrected error percentages in results, 3. corrected typos, 4. Plots and results unchanged
Published: 2022

35. MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V Cores

Author: Bertaccini, Luca, Paulin, Gianna, Fischer, Tim, Mach, Stefan, and Benini, Luca
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inference by reducing the memory footprint of the NN models and improving the energy efficiency of the underlying hardware architectures. Narrow integer data types have been vastly investigated for NN inference and have successfully been pushed to the extreme of ternary and binary representations. In contrast, most training-oriented platforms use at least 16-bit floating-point (FP) formats. Lower-precision data types such as 8-bit FP formats and mixed-precision techniques have only recently been explored in hardware implementations. We present MiniFloat-NN, a RISC-V instruction set architecture extension for low-precision NN training, providing support for two 8-bit and two 16-bit FP formats and expanding operations. The extension includes sum-of-dot-product instructions that accumulate the result in a larger format and three-term additions in two variations: expanding and non-expanding. We implement an ExSdotp unit to efficiently support in hardware both instruction types. The fused nature of the ExSdotp module prevents precision losses generated by the non-associativity of two consecutive FP additions while saving around 30% of the area and critical path compared to a cascade of two expanding fused multiply-add units. We replicate the ExSdotp module in a SIMD wrapper and integrate it into an open-source floating-point unit, which, coupled to an open-source RISC-V core, lays the foundation for future scalable architectures targeting low-precision and mixed-precision NN training. A cluster containing eight extended cores sharing a scratchpad memory, implemented in 12 nm FinFET technology, achieves up to 575 GFLOPS/W when computing FP8-to-FP16 GEMMs at 0.8 V, 1.26 GHz., This work has been submitted to the ARITH22 - IEEE Symposium on Computer Arithmetic for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 8 pages
Published: 2022

36. EQUAL: Improving the Fidelity of Quantum Annealers by Injecting Controlled Perturbations

Author: Ramin Ayanzadeh, Poulami Das, Swamit Tannu, and Moinuddin Qureshi
Subjects: FOS: Computer and information sciences, Quantum Physics, Emerging Technologies (cs.ET), Hardware Architecture (cs.AR), FOS: Physical sciences, Computer Science - Emerging Technologies, Quantum Physics (quant-ph), Computer Science - Hardware Architecture
Abstract: Quantum computing is an information processing paradigm that uses quantum-mechanical properties to speedup computationally hard problems. Although promising, existing gate-based quantum computers consist of only a few dozen qubits and are not large enough for most applications. On the other hand, existing QAs with few thousand of qubits have the potential to solve some domain-specific optimization problems. QAs are single instruction machines and to execute a program, the problem is cast to a Hamiltonian, embedded on the hardware, and a single quantum machine instruction (QMI) is run. Unfortunately, noise and imperfections in hardware result in sub-optimal solutions on QAs even if the QMI is run for thousands of trials. The limited programmability of QAs mean that the user executes the same QMI for all trials. This subjects all trials to a similar noise profile throughout the execution, resulting in a systematic bias. We observe that systematic bias leads to sub-optimal solutions and cannot be alleviated by executing more trials or using existing error-mitigation schemes. To address this challenge, we propose EQUAL (Ensemble Quantum Annealing). EQUAL generates an ensemble of QMIs by adding controlled perturbations to the program QMI. When executed on the QA, the ensemble of QMIs steers the program away from encountering the same bias during all trials and thus, improves the quality of solutions. Our evaluations using the 2041-qubit D-Wave QA show that EQUAL bridges the difference between the baseline and the ideal by an average of 14% (and up to 26%), without requiring any additional trials. EQUAL can be combined with existing error mitigation schemes to further bridge the difference between the baseline and ideal by an average of 55% (and up to 68%).
Published: 2022

37. Chameleon Cache: Approximating Fully Associative Caches with Random Replacement to Prevent Contention-Based Cache Attacks

Author: Unterluggauer, Thomas, Harris, Austin, Constable, Scott, Liu, Fangfei, and Rozas, Carlos
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Cryptography and Security (cs.CR)
Abstract: Randomized, skewed caches (RSCs) such as CEASER-S have recently received much attention to defend against contention-based cache side channels. By randomizing and regularly changing the mapping(s) of addresses to cache sets, these techniques are designed to obfuscate the leakage of memory access patterns. However, new attack techniques, e.g., Prime+Prune+Probe, soon demonstrated the limits of RSCs as they allow attackers to more quickly learn which addresses contend in the cache and use this information to circumvent the randomization. To yet maintain side-channel resilience, RSCs must change the random mapping(s) more frequently with adverse effects on performance and implementation complexity. This work aims to make randomization-based approaches more robust to allow for reduced re-keying rates and presents Chameleon Cache. Chameleon Cache extends RSCs with a victim cache (VC) to decouple contention in the RSC from evictions observed by the user. The VC allows Chameleon Cache to make additional use of the multiple mappings RSCs provide to translate addresses to cache set indices: when a cache line is evicted from the RSC to the VC under one of its mappings, the VC automatically reinserts this evicted line back into the RSC by using a different mapping. As a result, the effects of previous RSC set contention are hidden and Chameleon Cache exhibits side-channel resistance and eviction patterns similar to fully associative caches with random replacement. We show that Chameleon Cache has performance overheads of < 1% and stress that VCs are more generically helpful to increase side-channel resistance and re-keying intervals of randomized caches., 12 pages, 9 figures, 6 algorithms, 1 table
Published: 2022

38. EmuNoC: Hybrid Emulation for Fast and Flexible Network-on-Chip Prototyping on FPGAs

Author: Tan, Yee Yang, Staudigl, Felix, Jünger, Lukas, Drewes, Anna, Leupers, Rainer, and Joseph, Jan Moritz
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: Networks-on-Chips (NoCs) recently became widely used, from multi-core CPUs to edge-AI accelerators. Emulation on FPGAs promises to accelerate their RTL modeling compared to slow simulations. However, realistic test stimuli are challenging to generate in hardware for diverse applications. In other words, both a fast and flexible design framework is required. The most promising solution is hybrid emulation, in which parts of the design are simulated in software, and the other parts are emulated in hardware. This paper proposes a novel hybrid emulation framework called EmuNoC. We introduce a clock-synchronization method and software-only packet generation that improves the emulation speed by 36.3x to 79.3x over state-of-the-art frameworks while retaining the flexibility of a pure-software interface for stimuli simulation. We also increased the area efficiency to model up to an NoC with 169 routers on a single FPGA, while previous frameworks only achieved 64 routers.
Published: 2022

39. RAD-Sim: Rapid Architecture Exploration for Novel Reconfigurable Acceleration Devices

Author: Boutros, Andrew, Nurvitadhi, Eriko, and Betz, Vaughn
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: With the continued growth in field-programmable gate array (FPGA) capacity and their incorporation into new environments such as datacenters, we have witnessed the introduction of a new class of reconfigurable acceleration devices (RADs) that go beyond conventional FPGA architectures. These devices combine a reconfigurable fabric with coarse-grained domain-specialized accelerator blocks all connected via a high-performance packet-switched network-on-chip (NoC) for efficient system-wide communication. However, we lack the tools necessary to efficiently explore the huge design space for RADs, study the complex interactions between their different components and evaluate various combinations of design choices. In this work, we develop RAD-Sim, a cycle-level architecture simulator that allows rapid application-driven exploration of the design space of novel RADs. To showcase the capabilities of RADSim, we map and simulate a state-of-the-art deep learning (DL) inference overlay on a RAD instance incorporating an FPGA fabric and a complex of hard matrix-vector multiplication engines, communicating over a system-wide NoC. Through this example, we show how RAD-Sim can help architects quantify the effect of changing specific architecture parameters on end-to-end application performance., Published in the 2022 proceedings of the International Conference on Field-Programmable Logic and Applications (FPL)
Published: 2022

40. CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for Energy-Efficient Low-precision Deep Convolutional Neural Networks

Author: Muhammad Abdullah Hanif, Giuseppe Maria Sarda, Alberto Marchisio, Guido Masera, Maurizio Martina, and Muhammad Shafique
Subjects: Energy consumption, FOS: Computer and information sciences, Computer Science - Machine Learning, Hardware Architecture (cs.AR), Deep learning, Quantization (signal), Correlation, Costs, Neural networks, Energy resolution, Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs) have become ubiquitous due to their state-of-the-art performance in complex real-world applications. The high computational complexity of these networks, which translates to increased energy consumption, is the foremost obstacle towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP) implementations achieved through post-training quantization are commonly used to curtail the energy consumption of these networks. However, the uniform quantization intervals in FP restrict the bit-width of data structures to large values due to the need to represent most of the numbers with sufficient resolution and avoid high quantization errors. In this paper, we leverage the key insight that (in most of the scenarios) DNN weights and activations are mostly concentrated near zero and only a few of them have large magnitudes. We propose CoNLoCNN, a framework to enable energy-efficient low-precision deep convolutional neural network inference by exploiting: (1) non-uniform quantization of weights enabling simplification of complex multiplication operations; and (2) correlation between activation values enabling partial compensation of quantization errors at low cost without any run-time overheads. To significantly benefit from non-uniform quantization, we also propose a novel data representation format, Encoded Low-Precision Binary Signed Digit, to compress the bit-width of weights while ensuring direct use of the encoded weight for processing using a novel multiply-and-accumulate (MAC) unit design., Comment: 8 pages, 15 figures, 2 tables
Published: 2022

41. LOSTIN: Logic Optimization via Spatio-Temporal Information with Hybrid Graph Models

Author: Wu, Nan, Lee, Jiwon, Xie, Yuan, and Hao, Cong
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: Despite the stride made by machine learning (ML) based performance modeling, two major concerns that may impede production-ready ML applications in EDA are stringent accuracy requirements and generalization capability. To this end, we propose hybrid graph neural network (GNN) based approaches towards highly accurate quality-of-result (QoR) estimations with great generalization capability, specifically targeting logic synthesis optimization. The key idea is to simultaneously leverage spatio-temporal information from hardware designs and logic synthesis flows to forecast performance (i.e., delay/area) of various synthesis flows on different designs. The structural characteristics inside hardware designs are distilled and represented by GNNs; the temporal knowledge (i.e., relative ordering of logic transformations) in synthesis flows can be imposed on hardware designs by combining a virtually added supernode or a sequence processing model with conventional GNN models. Evaluation on 3.3 million data points shows that the testing mean absolute percentage error (MAPE) on designs seen and unseen during training are no more than 1.2% and 3.1%, respectively, which are 7-15X lower than existing studies.
Published: 2022

42. PiDRAM: An FPGA-based Framework for End-to-end Evaluation of Processing-in-DRAM Techniques

Author: Olgun, Ataberk, Luna, Juan Gomez, Kanellopoulos, Konstantinos, Salami, Behzad, Hassan, Hasan, Ergin, Oguz, and Mutlu, Onur
Subjects: FOS: Computer and information sciences, Hardware_MEMORYSTRUCTURES, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: DRAM-based main memory is used in nearly all computing systems as a major component. One way of overcoming the main memory bottleneck is to move computation near memory, a paradigm known as processing-in-memory (PiM). Recent PiM techniques provide a promising way to improve the performance and energy efficiency of existing and future systems at no additional DRAM hardware cost. We develop the Processing-in-DRAM (PiDRAM) framework, the first flexible, end-to-end, and open source framework that enables system integration studies and evaluation of real PiM techniques using real DRAM chips. We demonstrate a prototype of PiDRAM on an FPGA-based platform (Xilinx ZC706) that implements an open-source RISC-V system (Rocket Chip). To demonstrate the flexibility and ease of use of PiDRAM, we implement two PiM techniques: (1) RowClone, an in-DRAM copy and initialization mechanism (using command sequences proposed by ComputeDRAM), and (2) D-RaNGe, an in-DRAM true random number generator based on DRAM activation-latency failures. Our end-to-end evaluation of RowClone shows up to 14.6X speedup for copy and 12.6X initialization operations over CPU copy (i.e., conventional memcpy) and initialization (i.e., conventional calloc) operations. Our implementation of D-RaNGe provides high throughput true random numbers, reaching 8.30 Mb/s throughput. Over the Verilog and C++ basis provided by PiDRAM, implementing the required hardware and software components, implementing RowClone end-to-end takes 198 (565) and implementing D-RaNGe end-to-end takes 190 (78) lines of Verilog (C++) code. PiDRAM is open sourced on Github: https://github.com/CMU-SAFARI/PiDRAM., To appear in ISVLSI 2022 Special Session on Processing in Memory. arXiv admin note: text overlap with arXiv:2111.00082
Published: 2022

43. Heterogeneous Data-Centric Architectures for Modern Data-Intensive Applications: Case Studies in Machine Learning and Databases

Author: Oliveira, Geraldo F., Boroumand, Amirali, Ghose, Saugata, Gómez-Luna, Juan, and Mutlu, Onur
Subjects: Computer Science - Machine Learning, Computer Science - Databases, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Hardware Architecture
Abstract: Today's computing systems require moving data back-and-forth between computing resources (e.g., CPUs, GPUs, accelerators) and off-chip main memory so that computation can take place on the data. Unfortunately, this data movement is a major bottleneck for system performance and energy consumption. One promising execution paradigm that alleviates the data movement bottleneck in modern and emerging applications is processing-in-memory (PIM), where the cost of data movement to/from main memory is reduced by placing computation capabilities close to memory. Naively employing PIM to accelerate data-intensive workloads can lead to sub-optimal performance due to the many design constraints PIM substrates impose. Therefore, many recent works co-design specialized PIM accelerators and algorithms to improve performance and reduce the energy consumption of (i) applications from various application domains; and (ii) various computing environments, including cloud systems, mobile systems, and edge devices. We showcase the benefits of co-designing algorithms and hardware in a way that efficiently takes advantage of the PIM paradigm for two modern data-intensive applications: (1) machine learning inference models for edge devices and (2) hybrid transactional/analytical processing databases for cloud systems. We follow a two-step approach in our system design. In the first step, we extensively analyze the computation and memory access patterns of each application to gain insights into its hardware/software requirements and major sources of performance and energy bottlenecks in processor-centric systems. In the second step, we leverage the insights from the first step to co-design algorithms and hardware accelerators to enable high-performance and energy-efficient data-centric architectures for each application.
Published: 2022

44. TNN7: A Custom Macro Suite for Implementing Highly Optimized Designs of Neuromorphic TNNs

Author: Nair, Harideep, Vellaisamy, Prabhu, Bhasuthkar, Santha, and Shen, John Paul
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Emerging Technologies (cs.ET), Hardware Architecture (cs.AR), Computer Science - Emerging Technologies, Computer Science - Neural and Evolutionary Computing, Neural and Evolutionary Computing (cs.NE), Computer Science - Hardware Architecture, Machine Learning (cs.LG)
Abstract: Temporal Neural Networks (TNNs), inspired from the mammalian neocortex, exhibit energy-efficient online sensory processing capabilities. Recent works have proposed a microarchitecture framework for implementing TNNs and demonstrated competitive performance on vision and time-series applications. Building on these previous works, this work proposes TNN7, a suite of nine highly optimized custom macros developed using a predictive 7nm Process Design Kit (PDK), to enhance the efficiency, modularity and flexibility of the TNN design framework. TNN prototypes for two applications are used for evaluation of TNN7. An unsupervised time-series clustering TNN delivering competitive performance can be implemented within 40 uW power and 0.05 mm^2 area, while a 4-layer TNN that achieves an MNIST error rate of 1% consumes only 18 mW and 24.63 mm^2. On average, the proposed macros reduce power, delay, area, and energy-delay product by 14%, 16%, 28%, and 45%, respectively. Furthermore, employing TNN7 significantly reduces the synthesis runtime of TNN designs (by more than 3x), allowing for highly-scaled TNN implementations to be realized., To be published in ISVLSI 2022
Published: 2022

45. Automatic datapath optimization using e-graphs

Author: Coward, Samuel, Constantinides, George A., Drane, Theo, and Intel Corporation
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: Manual optimization of Register Transfer Level (RTL) datapath is commonplace in industry but holds back development as it can be very time consuming. We utilize the fact that a complex transformation of one RTL into another equivalent RTL can be broken down into a sequence of smaller, localized transformations. By representing RTL as a graph and deploying modern graph rewriting techniques we can automate the circuit design space exploration, allowing us to discover functionally equivalent but optimized architectures. We demonstrate that modern rewriting frameworks can adequately capture a wide variety of complex optimizations performed by human designers on bit-vector manipulating code, including significant error-prone subtleties regarding the validity of transformations under complex interactions of bitwidths. The proposed automated optimization approach is able to reproduce the results of typical industrial manual optimization, resulting in a reduction in circuit area by up to 71%. Not only does our tool discover optimized RTL, but also correctly identifies that the optimal architecture to implement a given arithmetic expression can depend on the width of the operands, thus producing a library of optimized designs rather than the single design point typically generated by manual optimization. In addition, we demonstrate that prior academic work on maximally exploiting carry-save representation and on multiple constant multiplication are both generalized and extended, falling out as special cases of this paper.
Published: 2022

46. A Flexible HLS Hoeffding Tree Implementation for Runtime Learning on FPGA

Author: Luis Miguel Sousa, Nuno Paulino, Joao Canas Ferreira, and Joao Bispo
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture
Abstract: Decision trees are often preferred when implementing Machine Learning in embedded systems for their simplicity and scalability. Hoeffding Trees are a type of Decision Trees that take advantage of the Hoeffding Bound to allow them to learn patterns in data without having to continuously store the data samples for future reprocessing. This makes them especially suitable for deployment on embedded devices. In this work we highlight the features of an HLS implementation of the Hoeffding Tree. The implementation parameters include the feature size of the samples (D), the number of output classes (K), and the maximum number of nodes to which the tree is allowed to grow (Nd). We target a Xilinx MPSoC ZCU102, and evaluate: the design's resource requirements and clock frequency for different numbers of classes and feature size, the execution time on several synthetic datasets of varying sample sizes (N), number of output classes and the execution time and accuracy for two datasets from UCI. For a problem size of D3, K5, and N40000, a single decision tree operating at 103MHz is capable of 8.3x faster inference than the 1.2GHz ARM Cortex-A53 core. Compared to a reference implementation of the Hoeffding tree, we achieve comparable classification accuracy for the UCI datasets.
Published: 2022

47. e-G2C: A 0.14-to-8.31 µJ/Inference NN-based Processor with Continuous On-chip Adaptation for Anomaly Detection and ECG Conversion from EGM

Author: Zhao, Yang, Zhang, Yongan, Fu, Yonggan, Ouyang, Xu, Wan, Cheng, Wu, Shang, Banta, Anton, John, Mathews M., Post, Allison, Razavi, Mehdi, Cavallaro, Joseph, Aazhang, Behnaam, and Lin, Yingyan
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Hardware Architecture (cs.AR), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing, Computer Science - Hardware Architecture
Abstract: This work presents the first silicon-validated dedicated EGM-to-ECG (G2C) processor, dubbed e-G2C, featuring continuous lightweight anomaly detection, event-driven coarse/precise conversion, and on-chip adaptation. e-G2C utilizes neural network (NN) based G2C conversion and integrates 1) an architecture supporting anomaly detection and coarse/precise conversion via time multiplexing to balance the effectiveness and power, 2) an algorithm-hardware co-designed vector-wise sparsity resulting in a 1.6-1.7$\times$ speedup, 3) hybrid dataflows for enhancing near 100% utilization for normal/depth-wise(DW)/point-wise(PW) convolutions (Convs), and 4) an on-chip detection threshold adaptation engine for continuous effectiveness. The achieved 0.14-8.31 $\mu$J/inference energy efficiency outperforms prior arts under similar complexity, promising real-time detection/conversion and possibly life-critical interventions, Comment: Accepted by 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)
Published: 2022

48. STBPU: A Reasonably Secure Branch Prediction Unit

Author: Zhang, Tao, Lesch, Timothy, Koltermann, Kenneth, and Evtyushkin, Dmitry
Subjects: FOS: Computer and information sciences, Computer Science - Cryptography and Security, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture, Cryptography and Security (cs.CR)
Abstract: Modern processors have suffered a deluge of threats exploiting branch instruction collisions inside the branch prediction unit (BPU), from eavesdropping on secret-related branch operations to triggering malicious speculative executions. Protecting branch predictors tends to be challenging from both security and performance perspectives. For example, partitioning or flushing BPU can stop certain collision-based exploits but only to a limited extent. Meanwhile, such mitigations negatively affect branch prediction accuracy and further CPU performance. This paper proposes Secret Token Branch Prediction Unit (STBPU), a secure BPU design to defend against collision-based transient execution attacks and BPU side channels while incurring minimal performance overhead. STBPU resolves the challenges above by customizing data representation inside BPU for each software entity requiring isolation. In addition, to prevent an attacker from using brute force techniques to trigger malicious branch instruction collisions, STBPU actively monitors the prediction-related events and preemptively changes BPU data representation., Comment: 14 pages
Published: 2022

49. Understanding RowHammer Under Reduced Wordline Voltage: An Experimental Study Using Real DRAM Devices

Author: Yağlıkçı, A. Giray, Luo, Haocong, de Oliviera, Geraldo F., Olgun, Ataberk, Patel, Minesh, Park, Jisung, Hassan, Hasan, Kim, Jeremie S., Orosa, Lois, and Mutlu, Onur
Subjects: FOS: Computer and information sciences, Hardware_MEMORYSTRUCTURES, Computer Science - Cryptography and Security, Hardware Architecture (cs.AR), Hardware_INTEGRATEDCIRCUITS, Hardware_PERFORMANCEANDRELIABILITY, Computer Science - Hardware Architecture, Cryptography and Security (cs.CR)
Abstract: RowHammer is a circuit-level DRAM vulnerability, where repeatedly activating and precharging a DRAM row, and thus alternating the voltage of a row's wordline between low and high voltage levels, can cause bit flips in physically nearby rows. Recent DRAM chips are more vulnerable to RowHammer: with technology node scaling, the minimum number of activate-precharge cycles to induce a RowHammer bit flip reduces and the RowHammer bit error rate increases. Therefore, it is critical to develop effective and scalable approaches to protect modern DRAM systems against RowHammer. To enable such solutions, it is essential to develop a deeper understanding of the RowHammer vulnerability of modern DRAM chips. However, even though the voltage toggling on a wordline is a key determinant of RowHammer vulnerability, no prior work experimentally demonstrates the effect of wordline voltage (VPP) on the RowHammer vulnerability. Our work closes this gap in understanding. This is the first work to experimentally demonstrate on 272 real DRAM chips that lowering VPP reduces a DRAM chip's RowHammer vulnerability. We show that lowering VPP 1) increases the number of activate-precharge cycles needed to induce a RowHammer bit flip by up to 85.8% with an average of 7.4% across all tested chips and 2) decreases the RowHammer bit error rate by up to 66.9% with an average of 15.2% across all tested chips. At the same time, reducing VPP marginally worsens a DRAM cell's access latency, charge restoration, and data retention time within the guardbands of system-level nominal timing parameters for 208 out of 272 tested chips. We conclude that reducing VPP is a promising strategy for reducing a DRAM chip's RowHammer vulnerability without requiring modifications to DRAM chips., To appear in DSN 2022
Published: 2022

50. DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Author: Lois Orosa, Juan Gómez-Luna, Nandita Vijaykumar, Geraldo F. Oliveira, Saugata Ghose, Mohammad Sadrosadati, Onur Mutlu, and Ivan Fernandez
Subjects: Instruction prefetch, FOS: Computer and information sciences, General Computer Science, Relation (database), Computer science, Distributed computing, Data type, Hardware Architecture (cs.AR), General Materials Science, Computer Science - Hardware Architecture, Profiling (computer programming), Computer Science - Performance, General Engineering, data movement, memory systems, TK1-9971, Performance (cs.PF), Benchmarking, near-data processing, Memory management, Computer Science - Distributed, Parallel, and Cluster Computing, Scalability, Benchmark (computing), Cache, Distributed, Parallel, and Cluster Computing (cs.DC), Electrical engineering. Electronics. Nuclear engineering, performance, energy
Abstract: Data movement between the CPU and main memory is a first-order obstacle against improv ing performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarch ies, aggressive hardware prefetcher s) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Prior NDP works investigate the root causes of data movement bottlenecks using different profiling methodologies and tools. However, there is still a lack of understanding about the key metrics that can identify different data movement bottlenecks and their relation to traditional and emerging data movement mitigation mechanisms. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques (e.g., cach ing and prefetch ing) to more memory-centric techniques (e.g., NDP), thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV., IEEE Access, 9, ISSN:2169-3536
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

311 results on '"Computer Science - Hardware Architecture"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources