Descriptor: "computer architecture" / Database: OpenDissertations - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"computer architecture"' showing total 192 results

Start Over Descriptor "computer architecture" Database OpenDissertations

192 results on '"computer architecture"'

1. Architecting Efficient, Large-Scale AI: An Algorithm-System Co-Design Approach

Author: Hsia, Samuel Cheng-Yuan
Subjects: Computer Architecture, Generative AI, Hardware Accelerators, Hardware-Software Co-Design, Machine Learning, Recommender Systems, Computer science, Computer engineering, Electrical engineering
Abstract: Driven by significant advancements in algorithmic techniques and the emergence of new multimodal generative applications, deep learning has entered the era of "large-scale AI". As leading models dramatically increase in size and complexity, the hardware and software requirements also become significantly more demanding. If efficient solutions are not developed in a timely manner, model exploration will grind to a halt and at-scale serving will be infeasible. End-to-end co-design solutions must address three key themes: the unique technical challenges posed by the large-scale nature of these models, the distinct requirements for training versus inference, and the critical need for efficiency. This dissertation presents three case studies for navigating the complexities of large-scale AI. The first case involves a cross-stack characterization of large-scale models, identifying performance bottlenecks and potential avenues for optimization across different system layers. The second case study explores redesigning embedding-centric models through data and hardware-aware observations, aiming for substantial improvements from novel embedding representations. The third case study develops tools that help researchers gain better insights into mapping of increasingly complex models onto physical infrastructures, addressing the logistical and operational challenges of deploying large-scale AI systems in data centers. Looking forward, the dissertation identifies areas for future research, including co-design strategies tailored for embedding-driven, multimodal AI models and the role of reliability versus resiliency in data center-scale training environments. Collectively, this work contributes to the foundational understanding and practical advancement of large-scale AI technology, setting a course for future innovations in the field.
Published: 2024

2. Digital Beamforming Implemented in Hardware

Author: Dulieu, Nicole
Subjects: phased arrays, array signal processing, beamforming, digital beamforming, signal processing algorithms, computer architecture
Abstract: Digital beamforming is a popular method used in modern communication systems. The ability to track and locate a transmitting signal adaptively is necessary in communication systems. Beamforming is one solution to this problem. Beamforming uses an array or matrix of isotropic antenna elements. This eliminates the need to create a physically larger antennas to achieve the same radiation pattern and gain of a phased array of antenna elements. Additionally, the antennas are electronically controlled allowing the radiation pattern and gain to adapt quickly. It is necessary to use a digital platform for beamforming because hardware can digitize analog signals efficiently. The research done in this paper starts with a system created in Matlab’s Simulink environment. This system has both software and hardware beamforming algorithms. The results from the two algorithms is verified in waveforms. The hardware beamforming system is converted to hardware description language (HDL) using Simulink’s HDL Coder application. The HDL files are used in a simulation environment using Cadence Incisive simulator to verify the beamforming results. The digital beamforming HDL project is implemented on different application specific integrated circuit (ASIC) technologies. Using Synopsys compiler suites the project is synthesized, placed and routed on the ASIC technologies. This paper will analyze the results from implementing a beamforming algorithm on digital hardware.
Published: 2024

3. Safety-Aware System Optimization for Autonomous Machines

Author: Hsiao, Yu-Shun
Subjects: Autonomous Systems, Computer Architecture, Robotics, Artificial intelligence, Computer science
Abstract: Autonomous machines such as vehicles, drones, and robotic manipulators promise to transform the world by unleashing humans from repetitive, dangerous, and labor-intensive tasks. However, their widespread deployment requires advances in safety, real-time performance, and resilience. This thesis tackles these key challenges through end-to-end system optimization that maintains safety while improving performance and fault tolerance. We optimize the entire Perception-Planning-Control (PPC) computing pipeline that takes in sensor readings and output control commands. First, this thesis develops a model to quantify the perception processing rate requirements for safe autonomous driving in complex scenarios, connecting the varying real-time latency requirements with the operating scenarios. Second, we accelerate the time-consuming 3D mapping in perception. A specialized accelerator is designed that achieves substantially higher throughput and energy efficiency over a CPU, enabling real-time perception for 3D mapping. Third, we accelerate the computationally expensive optimization-based motion planning algorithms with a variable precision search method that reduces memory bandwidth pressure without sacrificing positional and orientational precision. Lastly, we assess autonomous machines' fault tolerance characteristics against real-world noises and errors to generate reliable control commands. We propose a fault characterization framework that evaluates the impact of silent data corruptions (SDCs) on application-level metrics. To mitigate SDCs, we propose lightweight anomaly detection techniques to recover failures in the computing pipeline with insignificant overhead. This dissertation enables the development of safe, real-time, and resilient autonomous machines. The contributions chart a path toward robust deployment of autonomous machines that can transform society.
Published: 2024

4. Prototyping Hardware-compressed Memory for Multi-tenant Systems

Author: Liu, Yuqing
Subjects: Computer Architecture, Memory Controller, DRAM, Compression, Cloud
Abstract: Software memory compression has been a common practice among operating systems. Since then, prior works have explored hardware memory compression to reduce the load on the CPU by offloading memory compression to hardware. However, prior works on hardware memory compression cannot provide critical isolation in multi-tenant systems like cloud servers. Our evaluation of prior work (TMCC) shows that a tenant can be slowed down by more than 12x due to the lack of isolation. This work, Compressed Memory Management Unit (CMMU), prototypes hardware compression for multi-tenant systems. CMMU provides critical isolation for multi-tenant systems.First, CMMU allows OS to control individual tenants' usage of physical memory. Second, CMMU compresses a tenant's memory to an OS-specified physical usage target. Finally, CMMU notifies the OS to start swapping the memory to the storage if it fails to compress the memory to the target. We prototype CMMU with a real compression module on an FPGA board. CMMU runs with a Linux kernel modified to support CMMU. The prototype virtually expands the memory capacity to 4X. CMMU stably supports the modified Linux kernel with multiple tenants and applications. While achieving this, CMMU only requires several extra cycles of overhead besides the essential data structure accesses. ASIC synthesis results show CMMU fits within 0.00931mm2 of silicon and operates at 3GHz while consuming 36.90mW of power. It is a negligible cost to modern server systems.
Published: 2023

5. Understanding and Benchmarking CPU and GPU-based spatial databases

Author: Parvanov, Martin
Subjects: database, Geospatial, computer architecture
Abstract: The spatial join is one of the most expensive operations in modern spatial database systems. Throughout the years a number of indices, join algorithms, and different approaches have been proposed, yet no clear optimal strategy for performing the join has been found. Recent research has been heavily focused on introducing a number of system level optimisations and utilizing hardware acceleration to speed up the spatial join. In this work we set out to better understand the spatial join land- scape, in order to best support current research into the spatial join carried out at the ETH Systems Group. We benchmark a number of existing open source CPU and GPU-based systems, diving into de- sign decisions and tradeoffs. We also implement our own single- and multi-threaded version of two well-known spatial algorithms that have shown to be the best performing in their class, namely R-tree Syn- chronous Traversal and Partition Based Sweep Merge. We then bench- mark all these systems on datasets consisting of millions of entries. These datasets include both synthetic and real-world data. Eventually, we find that the our multi-threaded Synchronous Traversal implemen- tation is the best performing algorithm.
Published: 2023

6. LANGUAGE–BASED TECHNIQUES FOR BUILDING TIMING CHANNEL SECURE HARDWARE–SOFTWARE SYSTEMS

Author: Zagieboylo, Drew
Subjects: Computer Architecture, Computer Security, Information Flow Control, Programming Languages
Abstract: We rely on a deep stack of abstractions to efficiently build software applications without having to completely understand the nuance of language run- times, operating systems, and processor architectures. Each layer in the stack relies on the guarantees of the layer below, with all software relying on the functionality provided by the hardware on which it executes. Similarly, when we build secure software, we define security in terms of high level application policies and rely on a stack of abstractions to enforce those policies. Therefore, all of software security relies on the guarantees provided by processor hardware. However, those guarantees offer less protection than we have traditionally assumed, and real processor implementations routinely exhibit vulnerabilities that undermine traditional assumptions about hardware behavior. Modern processors incorporate a host of optimizations to execute software as quickly and efficiently as possible; unfortunately, these optimizations are at the root of some serious security weaknesses. In particular, researchers have recently discovered easily exploitable timing-channel vulnerabilities that arise due to processor speculation, like Spectre, Meltdown, and the many variants that have since been uncovered. Concerningly, these vulnerabilities are not the result of cutting-edge, untested optimizations; they are fundamental to the de- signs of almost all processors in the last 20 years. The existence of these vulnerabilities highlights the need for a well-defined contract between software and hardware that does not allow the hardware to leak software’s secrets arbitrarily, especially via timing channels. Furthermore, we need tools to enable the construction and verification of secure processors that adhere to these new contracts. As functional processor correctness is al- ready a difficult verification problem, we likely need new approaches to prove processor security. This dissertation addresses the above concerns by applying Information Flow Control (IFC) to both the hardware–software interface and to Hardware Description Languages (HDL) themselves. By using IFC as the de facto language of security, we can define a hardware–software contract capable of providing timing-channel security without exposing extraneous details about processor internals. Intuitively, using IFC as a tool to then build processors also enables proving that real processor implementations refine this IFC contract. This dissertation also addresses the problem of constructing correct processors by introducing a high-level HDL that targets the design of efficient processor pipelines. By raising the abstraction of hardware design, we can more easily connect the implementation’s semantics to the hardware–software contract. We can also reason statically about complex optimizations such as speculation by providing abstractions that generate correct circuitry by construction. We hope that future processors and interfaces are designed with timing- channel security in mind, and that these new abstractions will percolate back up the software stack to make timing-channel security available and efficient for all applications.
Published: 2023

7. Kokkos-enhanced ExaMPI

Author: Suggs, Evan Drake
Subjects: High performance computing--Research, Computer architecture, C++ (Computer program language), Application program interfaces (Computer software)
Abstract: Kokkos provides in-memory advanced data structures, concurrency, and algorithm to support advanced C++ parallel programming. MPI provides the most widely used message passing model for inter-node communication. Many programmers use both Kokkos and the Message Passing Interface (MPI) together. In this thesis, Kokkos is integrated within an MPI implementation to obtain performance and functionality benefits both for the MPI itself, and for applications that use both Kokkos+MPI. For instance, it will be possible in this model to pass first-class Kokkos objects directly to extended C++-based MPI APIs. In particular, efforts to achieve this type of integrated model is expressed using ExaMPI, a C++17-based subset implementation of MPI-4 developed at UTC with collaborators. Working with C++-friendly APIs, and Kokkos extensions, examples of the benefits of functionality and performance are shown. We explain why direct use of Kokkos within the certain parts of the MPI implementation are crucial to getting added performance in addition to expressivity. We also motivate why making Kokkos memory spaces visible to the MPI implementation generalizes the idea of “CPU memory” and “GPU memory” in ways that provide for further optimizations in heterogeneous Exascale architectures. Besides showing the current state of the prototype, we describe future goals, and show how these mesh both with a possible future C++ API for MPI-5 as well as the potential for accelerating MPI on architectures that incorporate accelerators.
Published: 2023

8. Disaggregated Heterogeneous System for Retrieval-Augmented Language Models

Author: Zeller, Marco
Subjects: COMPUTER SYSTEMS, natural language processing, FPGA, computer architecture
Abstract: Retrieval-augmented language models (RALMs) present an evolving frontier in natural language processing, combining the capabilities of conventional large language models (LLMs) with vector-based content retrieval from extensive knowledge databases. Efficiently deploying RALMs remains challenging due to the distinct workload characteristics of RALM components and the vast size of knowledge databases. This thesis discusses the deployment of RALMs in disaggregated heterogeneous architectures, juxtaposing benefits with inherent complexities. We propose and prototype a disaggregated system, comprising a CPU-based retrieval coordinator, a GPU-based language model, and an FPGA-based retrieval system. For the transformer inference we lean on Fairseq and employ FAISS to implement a vector search system which we use as a baseline model. Comparing this baseline system against our newly implemented FPGA-based retrieval system, we unequivocally illustrate that the FPGA-centric system excels the baseline in latency and throughput in combination with several language models. Furthermore, for all evaluated workloads, it performs at a minimum on par with the baseline. Through meticulous accelerator ratio analyses, we discern the optimal configurations for various model sizes and retrieval intervals, revealing a dynamic interplay between FPGAs and GPUs. Disaggregated architectures, unconstrained by traditional limitations such as PCIe bandwidth, offer substantial flexibility, allowing FPGA-retrieval systems to cater to multiple GPU servers. Furthermore, these architectures provide unparalleled adaptability to evolving workloads. In conclusion, while disaggregation introduces certain complexities, it emerges as imperative for achieving optimal RALM performance when combined with FPGA-based retrieval systems.
Published: 2023

9. Advancing Synthesizable Verilog/SystemVerilog Education with Open-Source Tools and Autograders

Author: Sifferman, Ethan Joseph
Subjects: Computer engineering, Computer Aided Design, Computer Architecture, Education, Open-source, SystemVerilog, Very Large Scale Integration
Abstract: In the rapidly expanding semiconductor industry, there is an increasing demand for skilled chip developers. Yet, the steep learning curve associated with Hardware Description Languages (HDLs) often acts as a significant barrier for students hoping to pursue a career in digital design. Drawing upon my experience as a HDL educator, which includes teaching Verilog to UCSB's IEEE student chapter and serving as a Teaching Assistant for UCSB's Verilog courses, I have meticulously developed and refined a comprehensive set of methods and resources for Verilog education. My objective encompassed two key facets: equipping students with quality industry-preparation and kindling passion for exploring hardware design. Through a strategic blend of approaches consisting of the integration of accessible open-source tools, the enforcement of popular coding style guides, the implementation of autograders for personalized feedback, and the incorporation of open-source IP blocks into lessons, students can attain proficiency in designing RTL (Register Transfer Level) for rigorously verified hardware systems. These strategies help reduce Verilog's steep learning curve while also expediting the introduction of more advanced topics in digital design and computer architecture. The methods and resources detailed in this thesis will prepare students for the expectations of the semiconductor industry, enhance their coding skills, and promote an accessible and engaging learning environment, ultimately meeting the growing demand for chip developers.
Published: 2023

10. Self-aware Memory Management for Emerging Architectures

Author: Maity, Biswadip
Subjects: Computer science, Autonomous Vehicles, Computer Architecture, Datacenters, Memory Systems, Self-Awareness
Abstract: The ever-increasing demands of data-intensive applications and the rapid evolution of computer architectures have posed significant challenges in memory performance and energy efficiency. Efficient memory management is crucial to meet the requirements of these applications while optimizing the utilization of memory resources. Traditional approaches that rely on workload-specific optimizations and static memory configurations are no longer sufficient to address the dynamic nature of modern computing systems. To overcome these challenges, the concept of computational self-awareness (CSA) has emerged as a promising approach. Computational self-awareness draws inspiration from psychology and neuroscience and aims to develop intelligent systems that can learn from past experiences, reason about their current state, and make informed decisions at runtime.In this thesis, I explore the application of computational self-awareness in the context of memory management. I investigate the different degrees of self-awareness applied across the memory subsystem and examine their benefits on memory performance and energy consumption. The results highlight the potential of computational self-awareness in addressing the challenges posed by data-intensive applications and evolving computer architectures, paving the way for improved performance, energy efficiency, and bandwidth utilization in memory systems.
Published: 2023

11. FPGA-based range-limited molecular dynamics acceleration

Author: Wu, Chunshu
Subjects: Computer engineering, Computer architecture, FPGA, Molecular dynamics
Abstract: Molecular Dynamics (MD) is a computer simulation technique that executes iteratively over discrete, infinitesimal time intervals. It has been a widely utilized application in the fields of material sciences and computer-aided drug design for many years, serving as a crucial benchmark in high-performance computing (HPC). Numerous MD packages have been developed and effectively accelerated using GPUs. However, as the limits of Moore's Law are reached, the performance of an individual computing node has reached its bottleneck, while the performance of multiple nodes is primarily hindered by scalability issues, particularly when dealing with small datasets. In this thesis, the acceleration with respect to small datasets is the main focus. With the recent COVID-19 pandemic, drug discovery has gained significant attention, and Molecular Dynamics (MD) has emerged as a crucial tool in this process. Particularly, in the critical domain of drug discovery, small simulations involving approximately ~50K particles are frequently employed. However, it is important to note that small simulations do not necessarily translate to faster results, as long-term simulations comprising billions of MD iterations and more are essential in this context. In addition to dataset size, the problem of interest is further constrained. Referred to as the most computationally demanding aspect of MD, the evaluation of range-limited (RL) forces not only accounts for 90% of the MD computation workload but also involves irregular mapping patterns of 3-D data onto 2-D processor networks. To emphasize, this thesis centers around the acceleration of RL MD specifically for small datasets. In order to address the single-node bottleneck and multi-node scaling challenges, the thesis is organized into two progressive stages of investigation. The first stage delves extensively into enhancing single-node efficiency by examining various factors such as workload mapping from 3-D to 2-D, data routing, and data locality. The second stage focuses on studying multi-node scalability, with a particular emphasis on strong scaling, bandwidth demands, and the synchronization mechanisms between nodes. Through our study, the results show our design on a Xilinx U280 FPGA achieves 51.72x and 4.17x speedups with respect to an Intel Xeon Gold 6226R CPU, and a Quadro RTX 8000 GPU. Our research towards strong scaling also demonstrates that 8 Xilinx U280 FPGAs connected to a switch achieves 4.67x speedup compared to an Nvidia V100 GPU
Published: 2023

12. Temperature-aware 3D-integrated systolic array DNN accelerators

Author: Shukla, Prachi
Subjects: Computer engineering, Computer architecture, Deep neurual networks, Die-stacking, Monolithic 3D, Systolic arrays, Thermal awareness
Abstract: Deep neural networks (DNNs) are extensively used for inference in a wide range of emerging mobile and edge application domains, including autonomous vehicles, drones, augmented and virtual reality (AR/VR), etc. Due to the increasing popularity of these applications, there has been an increasing demand for mobile/edge DNN accelerators to achieve low inference latency and high efficiency. Furthermore, these mobile/edge applications also need to execute multi-DNN workloads, where multiple independent DNNs execute subtasks to complete one large task. This thesis aims to optimize the efficiency of systolic arrays for DNN acceleration because they are among the most popular architectures for DNN inference in mobile/edge systems due to their straightforward design and dataflow. Systolic arrays provide several degrees of freedom to co-optimize performance, power, area, and temperature–namely, die/chiplet architecture (number of processing elements, on-chip memory capacity and its architecture), quantity, placement, and dataflow. While recent works have focused on 2D DNN systolic arrays, 2D scaling has been saturating and, thus, improving the performance and power characteristics of computing systems is becoming increasingly challenging. To overcome traditional scaling bottlenecks, 3D integration has emerged as a promising integration technology. 3D technology provides several benefits over 2D systems such as high integration density, high bandwidth, high energy efficiency, and footprint savings. This thesis focuses on two 3D integration technologies: (i) die-stacked 3D (TSV3D), and (ii) monolithic 3D (MONO3D). Both of these 3D technologies provide significant performance and power benefits over 2D systems and thus, are potent technologies for energy efficient design of systolic arrays for DNNs. However, the dense integration in 3D causes high power densities and inter-tier thermal coupling, further escalating thermal issues and resulting in hot spots across tiers. Furthermore, mobile/edge devices have tight area, power, and thermal constraints due to the absence of heat sinks and fans. Thus, temperature is a critical design concern in 3D DNN accelerators for mobile/edge devices. This thesis states that to glean the benefits of 3D technology in mobile/edge devices to improve energy efficiency and satisfy performance and power constraints, it is imperative to design thermally-aware 3D systolic arrays for DNNs. To realize this statement, this thesis makes the following contributions: (i) it designs a thermally-aware optimization flow to select a near-optimal MONO3D DNN systolic array for a given DNN and an optimization goal under a performance constraint. The optimizer is facilitated by circuit and architecture-level cross-layer performance/power models that are developed as part of this thesis. (ii) It introduces thermal awareness in tuning a given TSV3D systolic array chiplet architecture and the chiplet’s placement in a multi-chip module (MCM) executing a multi-DNN workload to balance both cost and power of the MCM, while satisfying latency, area, power, thermal packaging, and workload constraints. (iii) It optimizes a dataflow implementation by utilizing the massive bandwidth available in MONO3D systolic arrays with a dense on-chip resistive RAM to improve energy efficiency while satisfying the thermal and performance constraints. Results demonstrate 81% improvement in inference per second per watt over 2D systolic arrays due to high-density and high-bandwidth resistive RAM interface using monolithic inter-tier vias (MIVs). We also demonstrate up to 44% MCM cost savings and 63% DRAM power savings over temperature-unaware optimization at iso-frequency and iso-MCM area for TSV3D MCMs. In addition, we show that optimization without thermal awareness leads to over-estimation of efficiency gains and thermal violations in both MONO3D and TSV3D systolic arrays.
Published: 2023

13. Mitigating Microarchitectural Vulnerabilities to Improve Cloud Security and Reliability

Author: Loughlin, Kevin
Subjects: hardware security, systems security, operating systems, computer microarchitecture, hardware-software co-design, computer architecture
Abstract: Cloud providers must isolate each execution context—e.g., a virtual machine (VM)—atop shared hardware. Unfortunately, commodity hardware only strongly enforces context isolation at the architectural level, failing to enforce isolation in the microarchitectural implementation of hardware. The lack of microarchitectural isolation yields a wide range of threats to system security and reliability, including denial-of-service, data loss, data leakage, and even system subversion. Accordingly, this dissertation presents mitigations for two of the most prominent classes of modern microarchitectural vulnerabilities: transient execution attacks on CPUs---which allow arbitrary data to be leaked from processors via mis-speculation and timing side channels---and Rowhammer---which corrupts and potentially leaks data in DRAM via memory access patterns that produce silicon-level disturbance effects. In particular, DOLMA provides the first hardware mitigation against all demonstrated transient execution attacks at the time of publication. Stop! Hammer Time presents hardware primitives upon which scalable and flexible software defenses can be built across the taxonomy of Rowhammer mitigations. MOESI-prime introduces coherence-induced hammering, the first form of hammering shown to occur in non-malicious code, and provides a corresponding coherence protocol-based mitigation. Finally, Siloz isolates different VMs to private DRAM subarray groups (across which Rowhammer attacks are ineffective), thereby preventing inter-VM Rowhammer bit flips.
Published: 2023

14. Achieving Security and Privacy via Encrypted Architectures

Author: Biernacki, Lauren
Subjects: Hardware Security, Computer Architecture, Control-Flow Attacks, Encrypted Architecture, Data Privacy, Data-Oblivious Programming
Abstract: There are increasing incidences of high-profile data breaches and clever new attacks that exploit weaknesses throughout the software stack, with recent attacks moving into the hardware layer (e.g., Spectre [16] and Meltdown [17]). Yet, the security landscape consists not only of these novel exploits but also of exploits we have known about for decades that leverage vulnerabilities equally as pervasive. Despite the prevalence of these known vulnerabilities and significant efforts to defend against them, exploits remain widespread. Understanding the landscape of security attacks can aid in the design of more durable defenses. Security attacks take on a similar structure, despite their diverse forms. Attackers leverage one or more vulnerabilities and system information assets to synthesize their exploit. Looking across attacks, we view these components as forming an inverted pyramid, with a small group of information assets leveraged alongside a larger group of vulnerabilities to commit an even larger number of exploits. Ultimately, classes of information assets that are instrumental for attacks (e.g., pointers, data layout, cache organization) are lesser in number than that of vulnerabilities. Thus, by applying protections to a few critical pieces of information, defenses can potentially achieve broad coverage against security exploits. Hardware provides an advantageous place to situate information asset protections as they can be applied systematically, regardless of program-level semantics. Namely, low-level hardware implementations isolate critical information assets from higher-level layers of the stack, providing broad coverage against exploits. Architectural approaches enable us to design vulnerability-agnostic systems, as any software running atop the architecture, including programs that contain vulnerabilities, are insulated from attack. Further, hardware-based approaches can often be optimized to enable more efficient implementations, reducing runtime overheads that degrade system performance. Based on these insights, this dissertation explores how encrypted architectures—processors that encrypt information domains (e.g., memory addresses, instructions, or data) directly in hardware—can provide comprehensive security and privacy guarantees. Our research has evolved from using encryption minimally in an ensemble of moving target defenses to comprehensively applying encryption with small but powerful architectural extensions. The first half of this dissertation studies the protection of code and pointers to thwart control-flow attacks, following the evolution of the Morpheus secure architecture. The second half of this dissertation discusses how encrypted architectures can comprehensively protect sensitive data and be safely optimized. Vulnerability-tolerant design undercuts all our proposed defenses, as we aim to protect systems in the presence of pervasive software vulnerabilities. Further, we work toward employing strong encryption and building side-channel resilience while maintaining reasonable performance overheads. Our work demonstrates that architectural approaches can emerge as dynamic, expressive, and performant security and privacy solutions.
Published: 2023

15. Near-Memory Computing Architectures and Circuits for Ultra-Low Power Near-Sensor Processors

Author: Eggimann, Manuel
Subjects: near-memory computing, computer architecture, digital circuits, hyper dimensional computing, vector symbolic architectures (VSAs), hardware acceleration, Artificial Intelligence, memory hierarchy, Wake-up circuit, Electric engineering, Data processing, computer science
Abstract: Artificial intelligence has started to permeate the entire technological fabric of our interconnected world and has long found its way to the network edge. Be it as part of novel biomedical devices that constantly monitor patient health, smart sensors in industrial settings where they drive the transformation from reactive- to pro-active maintenance in industry 4.0 or integrated into the next evolution of human-machine interfaces like XR, ML-enacting near-sensor circuits are omnipresent. The tight power and latency constraints of these applications fuel a paradigm shift in the hardware architecture domain, away from conventional von-Neumann-based computing, which is bound by the memory bandwidth bottleneck of the data- and control path. The "new golden age of computer architecture", as the famous computer pioneer David Patterson calls it, is marked by innovative Near- and In-memory architectures around conventional CMOS as well as novel "beyond-CMOS" technologies like PCM or ReRAM. However, many of these techniques are explored with an isolated, device-level-focused view, whereas advances at the system-level demand a holistic multi-objective optimization approach that involves hardware-software co-design and the exploration of new computing paradigms. This thesis investigates energy-efficient digital hardware architectures at both the circuit- and the system level and develops adequate strategies to enable energy-proportionality for general-purpose near-sensor analytics, i.e. the proportionality of energy consumption to vastly varying dynamic changes in workload compute intensity. We follow a multi-stage architectural approach where highly energy-efficient circuits based on the compute framework of Vector-Symbolic Architectures make up the first, always-on stage of our architecture "stack". In the second part of this thesis, we shift focus to the next, more computationally performant stage around heterogeneous compute cluster architectures, with an emphasis on the memory hierarchy. Here we propose a novel architectural design pattern to tightly couple NVM to the hardware accelerator. Both aspects are demonstrated and evaluated in several silicon realizations in 65nm bulk, 22nm FDSOI and 16nm FinFET technology.
Published: 2023

16. Efficient Machine Learning Acceleration at the Edge

Author: Romaszkan, Wojciech
Subjects: Computer engineering, Electrical engineering, Computer architecture, Domain-specific acceleration, Machine Learning, Stochastic Computing
Abstract: My thesis is a result of a confluence of several trends that have emerged in recent years. First, the rapid proliferation of deep learning across the application and hardware landscapes is creating an immense demand for computing power. Second, the waning of Moore's Law is paving the way for domain-specific acceleration as a means of delivering performance improvements. Third, deep learning's inherent error tolerance is reviving long-forgotten approximate computing paradigms. Fourth, latency, energy, and privacy considerations are increasingly pushing deep learning towards edge inference, with its stringent deployment constraints. All of the above have created a unique, once-in-a-generation opportunity for accelerated widespread adoption of new classes of hardware and algorithms, provided they can deliver fast, efficient, and accurate deep learning inference within a tight area and energy envelope. One approach towards efficient machine learning acceleration that I have explored attempts to push a neural network model size to its absolute minimum. 3PXNet - Pruned, Permuted, Packed XNOR Networks combines two widely used model compression techniques: binarization and sparsity to deliver usable models with a size down to single kilobytes. It uses an innovative combination of weight permutation and packing to create structured sparsity that can be implemented efficiently in both software and hardware. 3PXNet has been deployed as an open-source library targeting microcontroller-class devices with various software optimizations, further improving runtime and storage requirements. The second line of work I have pursued is the application of stochastic computing (SC). It is an approximate, stream-based computing paradigm enabling extremely area-efficient implementations of basic arithmetic operations such as multiplication and addition. SC has been enjoying a renaissance over the past few years due to its unique synergy with deep learning. On the one hand, SC makes it possible to implement extremely dense multiply-accumulate (MAC) computational fabric well suited towards computing large linear algebra kernels, which are the bread-and-butter of deep neural networks. On the other hand, those neural networks exhibit immense approximation tolerance levels, making SC a viable implementation candidate. However, several issues need to be solved to make the SC acceleration of neural networks feasible. The area efficiency comes at the cost of long stream processing latency. The conversion cost between fixed-point and stochastic representations can cancel out the gains from computation efficiency if not managed correctly. The above issues lead to a question on how to design an accelerator architecture that best takes advantage of SC's benefits and minimizes its shortcomings. To address this, I proposed the ACOUSTIC (Accelerating Convolutional Neural Networks through Or-Unipolar Skipped Stochastic Computing) architecture and its extension - GEO (Generation and Execution Optimized Stochastic Computing Accelerator for Neural Networks). ACOUSTIC is an architecture that tries to maximize SC's compute density to amortize conversion costs and memory accesses, delivering system-level reduction in inference energy and latency. It has taped out and demonstrated in silicon, using a 14nm fabrication process. GEO addresses some of the shortcomings of ACOUSTIC. Through the introduction of near-memory computation fabric, GEO enables a more flexible selection of dataflows. Novel progressive buffering scheme unique to SC lowers the reliance on high memory bandwidth. Overall, my work tries to approach accelerator design from the systems perspective, making it stand apart from most recent SC publications targeting point improvements in the computation itself. As an extension to the above line of work, I have explored the combination of SC and sparsity, to apply it to new classes of applications, and enable further benefits. I have proposed the first SC accelerator that supports weight sparsity - SASCHA (Sparsity-Aware Stochastic Computing Hardware Architecture for Neural Network Acceleration), which can improve performance on pruned neural networks, while maintaining the throughput when processing dense ones. SASCHA solves a series of unique, non-trivial challenges of combining SC with sparsity. On the other hand, I have also designed an architecture for accelerating event-based camera object tracking - SCIMITAR. Event-based cameras are relatively new imaging devices which only transmit information about pixels that have changed in brightness, resulting in very high input sparsity. SCIMITAR combines SC with computing-in-memory (CIM), and, through a series of architectural optimizations, is able to take advantage of this new data format to deliver low-latency object detection for tracking applications.
Published: 2023

17. Hardware implementation and analysis of memory interfaces to integrate a vector accelerator into a manycore Network-on-Chip

Author: Lu, Ci Chian
Subjects: Computer engineering, Computer science, Electrical engineering, Computer Architecture, Network-on-Chip, System-on-Chip, Vector Processor
Abstract: In recent years, there has been a growing demand for vector processors due to their increasing application in deep-learning applications. On the other hand, with the strong need for energy efficiency and high performance, heterogeneous architecture plays an important role and becomes increasingly complex. However, the way of connecting the memory hierarchy to the vector processor in SOC (System-on-Chip) is critical to the system’s performance [8]. This work presents tile design which is based on OpenPiton and BYOC [4] [3]. Tile consists of a 64-bit, single-issue, in-order RISC-V core Ariane [14], along with a 64-bit vector processor ARA [7] [13] which implemented RISC-V V extension version 1.0. This work makes the following contributions. First, it involves the design and implementation of an adapter (bridge) that converts memory request from AMBA AXI to OpenPiton NoC. This adapter enables ARA memory access functionality and facilitates the integration of future accelerators into OpenPiton. Secondly, a tile design is presented, which includes ARA, a RISC-V vector processor, Ariane (a RISC-V core), L1.5 cache, L2 cache, and the implemented bridge. The performance of the tile is evaluated using different versions of bridges connected to the last-level cache (LLC) or off-chip memory. The analysis indicates that a wide data width bridge does not necessarily improve performance significantly. Several factors, such as NoC traffic confliction or unused data fetch, can narrow the performance gap between small and large width bridges. Furthermore, the experiments demonstrate that memory exhibits advantages when dealing with large data widths, and memory saturation also occurs during LLC access. Finally, the thesis proposes the implementation of MSHR (Miss Status Handling Register) and extends this design to manycore architectures to enhance performance.
Published: 2023

18. Developing, Synthesizing, and Automating Domain-Specific Accelerator

Author: Weng, Jian
Subjects: Computer science, Compiler, Computer Architecture, Design Automation
Abstract: The once exponential general purpose processors’ (e.g. CPUs) growth of speedup driven bytransistor scaling is fading, which urges both industry and academia to find more energy- efficient and performant architecture organization. Therefore, research on accelerators specialized for applications of interest emerges because of their promising speedup and energy saving while retaining flexibility. To design and implement specialized accelerators, intensive human effort is required to study the target applications and determine tradeoffs between performance and cost. In addition, these newly proposed hardware often implies lagging compilation techniques, which hinders the programming productivity. All these facts significantly limits the programmable accelerator adoption. Moreover, all the prior development effort can hardly be reused in other applicable do- mains, because the current software/hardware co-designed innovations seldom consider modularity for future integration. Therefore, research projects presented in this dissertation aim at significantly reforming the full-stack reconfigurable accelerator design paradigm: Ideally, each software/hardware co-design feature can be comprised in a universal design space for further accelerator composition so that people no longer build accelerators from scratch. Further, an accelerator can be automatically generated based on the given applications of interest written in a unified high-level programming interface. To achieve this goal, this dissertation develops the framework, DSAGEN, including an accelerator design space with rich software/hardware co-design features, a compiler targets to accelerators with arbitrary design points within this space, and a design automation algorithm that efficiently searches this space. According to our evaluation, the compiler can robustly target multiple application suites on hardware with arbitrary feature combinations. The framework-generated accelerators can have comparable perf/mm2 compared with prior handcrafted domain-specific accelerators. In addition, to demonstrate the wide applicability of our approach, the insights and principles learned along with this goal are also applied to applicable research questions: By deploying the DSAGEN-generated accelerator as a reconfigurable overlay on FPGA, it saves orders-of-magnitude time on compilation and reconfiguration compared with conventional high-level synthesis, while retaining flexibility. This approach suggests that a deeply spe- cialized programmable overlay accelerator can potentially supplement the existing FPGA’s high-level programming paradigm. Also, the compilation techniques for spatial architectures developed in DSAGEN can be applied to compiling an emerging instruction paradigm specialized for tensor operations — a productive and extensible compilation framework, UNIT, is presented for these instructions. The extensibility of this framework allows developers to easily integrate new instructions by describing the instruction semantics. High-performance code, that outperforms vendor provided libraries up to 2.2�, for end-to-end inferences can be generated by tensorized rewriting, accompanied with our automated tuning strategies.
Published: 2023

19. Formal Specification and Verification of Secure Information Flow for Hardware Platforms

Author: Cheang, Kevin
Subjects: Computer science, Computer architecture, Formal methods, Hardware platforms, Non-interference, Secure information flow, Security
Abstract: Hardware platforms, such as microprocessors and Trusted Execution Environments (TEEs), aim to provide strong memory isolation properties. However, in recent years, this has been shown not to be the case through hardware attacks such as the class of transient execution attacks. These attacks affect programs executing on widely-used microprocessor designs in our present-day devices. Although mitigations have been proposed, many have not been adopted and lack formal guarantees. As a result, security-critical applications have been conservative in using hardware platforms without some form of cryptographic approach for secure computation, despite the additional computational overhead. One approach to ensure safety for this class of attacks is to use formal methods to prove information flow properties. Yet, there is limited work in verifying attacks on hardware platforms that are heterogeneous in nature, namely those that contain hardware and software in the trusted computing base.This thesis defines a notion of secure information flow for hardware platforms and proposes methods to formally verify non-interference-based properties efficiently using abstractions and composition. To accomplish the former, we formalize the trace property-dependent observational determinism property for capturing a new class of non-interference properties. This property is motivated by verifying transient execution attacks and the need for secure speculation. To enable efficient verification on hardware platforms, we introduce an efficient proof system, SymboTaint, and the formalism of information flow state machines to reason about secure information flow compositionally. Finally, we explore a complementary method to enforce secure information flow for general programs by relaxing the programming model of a family of TEE designs and by formally verifying them. This direction builds on top of existing abstractions of TEEs to provide memory isolation guarantees with an efficient memory-sharing scheme on TEEs through combined design and verification. Together, this provides a methodology for enforcing memory isolation for heterogeneous systems, where joint modeling and analysis of hardware and software have become imperative for security.
Published: 2023

20. Extracting Reusable Primitives of Key-Value Operations and Efficient Architecture Support

Author: wang, Bangyan
Subjects: Computer engineering, Computer Architecture, key-value, SIMD, Sparse Computing
Abstract: The advancement of general-purpose architecture has reached a juncture where the continuing investment to improve instruction per cycle (IPC) yields diminishing returns. While domain-specific architecture more efficiently converts silicon resources into throughput, economic viability hinders their wider adoption, except in a few areas. A more feasible way is to extract reusable operations that can be used across multiple domains and then find efficient architecture to support them. This dissertation focuses on operations involving pairs of keys and values. The application spans a wide range of domains, including database, graph computing, genomics, and sparse computing. The processing of key-value pairs is divided into two categories: ordered and unordered. For the ordered category, we optimized the general merge style operations on a sorted key-value array by creating a set of highly composable primitives. Next, we show that many widely used ordered data structures and algorithms, such as heap and binary search, can be accelerated by rewriting them to use merge operation as a building block. For the unordered ones, we observe that reduce-by-key is a common bottleneck in many domains. We propose the design of the Reduce-By-Key core and introduce a new algorithm to accelerate this operation. We also analytically prove that our method is close to optimal. Lastly, we investigate the decomposition operation on sparse tensors - a special form of key-value pairs. We show how a PE-interactive architecture can be used to significantly increase data reuse.
Published: 2023

21. Advancing architecture optimizations with Bespoke Analysis and Machine Learning

Author: Sethumurugan, Subhash
Subjects: Bespoke processors, Computer Architecture, Machine Learning, Reinforcement Learning, Symbolic Simulation
Abstract: With transistor scaling nearing atomic dimensions and leakage power dissipation imposing strict energy limitations, it has become increasingly difficult to improve energy efficiency in modern processors without sacrificing performance and functionality. One way to avoid this tradeoff and reduce energy without reducing performance or functionality is to take a cue from application behavior and eliminate energy in areas that will not impact application performance. This approach is especially relevant in embedded systems, which often have ultra-low power and energy requirements and typically run a single application over and over throughout their operational lifetime. In such processors, application behavior can be effectively characterized and leveraged to identify opportunities for ``free'' energy savings. We find that in addition to instruction-level sequencing, constraints imposed by program-level semantics can be used to automate processor customization and further improve energy efficiency. This dissertation describes automated techniques to identify, form, propagate, and enforce application-based constraints in gate-level simulation to reveal opportunities to optimize a processor at the design level. While this can significantly improve energy efficiency, if the goal is truly to maximize energy efficiency, it is important to consider not only design-level optimizations but also architectural optimizations. That being said, architectural optimization presents several challenges. First, the symbolic simulation tool used to characterize gate-level behavior of an application must be written anew for each new architecture. Given the expansiveness of the architectural parameter space, this is not feasible. To overcome this barrier, we developed a generic symbolic simulation tool that can handle any design, technology, or architecture, making it possible to explore application-specific architectural optimizations. However, exploring each parameter variation still requires synthesizing a new design and performing application-specific optimizations, which again becomes infeasible due to the large architecture parameter space. Given the wide usage of Machine Learning (ML) for effective design space exploration, we sought the aid of ML to efficiently explore the architectural parameter space. We built a tool that takes into account the impacts of architectural optimizations on an application and predicts the architectural parameters that result in near-optimal energy efficiency for an application. This dissertation explores the objective, training, and inference of the ML model in detail. Inspired by the ability of ML-based tools to automate architecture optimization, we also apply ML-guided architecture design and optimization for other challenging problems. Specifically, we target cache replacement, which has historically been a difficult area to improve performance. Furthermore, improvements have historically been ad hoc and highly based on designer skill and creativity. We show that ML can be used to automate the design of a policy that meets or exceeds the performance of the current state-of-art.
Published: 2023

22. Ephemeral Vector Engines

Author: Al-Hawaj, Khalid
Subjects: Compute-in-Memory, Computer Architecture, Processing-in-Memory, SIMD, Vector, VLSI
Abstract: With the recent end of Dennard’s scaling and slowdown of Moore’s law, computer architects have turned to specialization to retain the regular improvements in performance and efficiency conventionally obtained through process advancements. Although traditional fixed-function acceleration is able to achieve high performance and efficiency by leveraging specialization, it struggles with low programmability and lack of flexibility. Previous work has shown the ability of next-generation vector abstraction to balance programmability and specialization. The recent rise in popularity of next-generation vector architectures highlights the inherent tension between its two traditional micro-architectures: integrated vector unit and dedicated vector engine. While integrated vector unit achieves modest performance with low area-overhead, dedicated vector engine achieves higher performance at the expense of higher area-overhead. This thesis leverages recent advancements in in-situ compute-in-memory to address this tension. The culmination of this thesis, ephemeral vector engines (EVE), aims at solving this tension. EVE is a novel next-generation vector micro-architecture leveraging SRAM-based compute-in-memory (S-CIM) circuits to reconfigure private L2 caches on-the-fly to support next-generation vector execution. While previous work on S-CIM has explored bit-serial execution, this thesis further explores bit-parallel execution with the following conclusion: bit-serial achieves high-throughput but high-latency, while bit-parallel lowers the latency greatly at the expense of lower throughput. This thesis considers a bit-hybrid approach instead to balance throughput and latency. To evaluate the area and cycle-time of EVE, this thesis presents a detailed circuit template that enables bit-hybrid S-CIM with varying parallelization factor. To evaluate the performance of EVE, this thesis leverages high-fidelity cycle-approximate models for an integrated vector unit, a decoupled vector engine, and EVE. By leveraging S-CIM, EVE increases performance by 4.59x over an integrated vector unit, thus matching the performance of a decoupled vector engine while incurring a tenth of its area-overhead. In summary, EVE leverages S-CIM to achieve a performance comparable to the decoupled vector engine, while incurring an area-overhead equivalent to that of the integrated vector unit.
Published: 2022

23. Using convolutional neural networks to improve branch prediction

Author: Zangeneh Kamali, Siavash
Subjects: Branch prediction, Computer architecture, High-performance processors, Machine learning, Neural networks, Convolutional neural networks, Microarchitecture
Abstract: The state-of-the-art branch predictor, TAGE, remains inefficient at identifying correlated branches deep in a noisy global branch history. This dissertation argues this inefficiency is a fundamental limitation of runtime branch prediction and not a coincidental artifact due to the design of TAGE. To further improve branch prediction, we need to relax the constraint of runtime only training and adopt more sophisticated prediction mechanisms. To this end, I propose using convolutional neural networks (CNNs) that are trained at compile-time to accurately predict branches that TAGE cannot. Given enough profiling coverage, CNNs learn input-independent branch correlations that can accurately predict branches when running a program with unseen inputs. I describe two practical approaches for using CNNs. First, I build on the work of Tarsa et al. and introduce BranchNet, a CNN with a storage-efficient on-chip inference engine tailored to the needs of branch prediction. At runtime, BranchNet predicts a few hard-to-predict branches, while TAGE-SC-L predicts the remaining branches. This hybrid approach reduces the MPKI of SPEC2017 Integer benchmarks by 9.6% (and up to 17.7%) compared to a 64KB TAGE-SC-L without increasing the prediction latency. Alternatively, instead of using BranchNet as a black-box predictor, I use it to explicitly identify correlated branches and filter the global branch history of TAGE to include only the outcomes of correlated branches. Filtering the branch history leads to less allocation pressure and faster warmup time in TAGE, resulting in improved prediction accuracy and better storage-efficiency. Filtering TAGE histories achieves a notable fraction of BranchNet's accuracy improvements (average 3.7% MPKI reduction, up to 9.4%) with a simpler predictor design.
Published: 2022

24. Maintaining high performance in the presence of impossible-to-predict branches

Author: Pruett, Stephen M.
Subjects: Computer architecture, Microarchitecture, Branch prediction, Pre-computation, Control independence, Merge point prediction, Reconvergence
Abstract: High performance microprocessors have relied on accurate branch predictors to maintain high instruction supply for over 30 years. Unfortunately, as instruction windows and pipeline widths have continued to grow, misprediction penalties have gotten worse. Branch predictors have failed to improve at a fast enough rate to counteract these penalties. Impossible-to-predict branches, such as data-dependent branches, have become the worst offender since, so far, no viable predictor exists for these branches. I propose to identify such branches at runtime, and replace the inaccurate branch prediction with a more accurate merge point prediction. Doing so enables techniques that can either pre-compute the result of the branch, as is the case for Branch Runahead, or avoid the misprediction altogether by dynamically predicating instructions, or fetching instructions out-of-order; i.e., from the merge point until the branch direction has been determined. This dissertation presents a new merge point prediction algorithm that achieves a higher accuracy and coverage than prior work, and uses it to enable three mechanisms for dealing with impossible-to-predict branches: Branch Runahead, Dynamic Predication, and Delayed Fetch.
Published: 2022

25. A System-Level Framework for Privacy

Author: Dangwal, Deeksha
Subjects: Computer science, computer architecture, privacy, private architecture, private traces, safer sharing
Abstract: Privacy in the digital age has become increasingly difficult to achieve. While there is consensus on the importance of building privacy into systems that deal with sensitive information, our ability to reason about system-level privacy is severely limited. In this work, I introduce wringing, a new computer architecture approach for building privacy in systems to minimize information leakage. I detail how wringing enhances the privacy of program traces and how it opens up a new optimization space between privacy and utility.Next, I demonstrate how wringing generalizes beyond traces: in computer vision pipelines that rely on streaming user data for localization tasks in augmented reality settings. We discover a new reverse engineering attack on localization pipelines that can compromise user privacy and show that data minimizing wringing serves as a mitigation for such attacks.Finally, I present a new architecture that builds privacy into personal devices. Our architecture supports both data minimizing techniques like wringing and differential privacy to protect streaming data being crowd-sourced by a central aggregator. With this hardware implementation, we can enforce the user's privacy settings and prevent unintended data leakage.
Published: 2022

26. Photonic Deep Neural Network Accelerators for Scaling to the Next Generation of High-Performance Processing

Author: Shiflett, Kyle D.
Subjects: Electrical Engineering, Computer Engineering, Computer Science, computer architecture, hardware accelerators, networks-on-chip, machine learning, neural networks, silicon photonics, emerging technology, analog computing, mixed-signed computing
Abstract: Improvements from electronic processor and interconnect performance scaling arenarrowing due to fundamental challenges faced at the device level. Compounding theissue, increasing demand for large, accurate deep neural network models has placedsignificant pressure on the current generation of processors. The slowing of Moore’s lawand the breakdown of Dennard scaling leaves no room for innovative solutions intraditional digital architectures to meet this demand. To address these scaling issues,architectures have moved away from general-purpose computation towards fixed-functionhardware accelerators to handle demanding computation. Although electronic acceleratorsalleviate some of the pressure of deep neural network workloads, they are still burdenedby electronic device and interconnect scaling problems. There is potential to further scalecomputer architectures by utilizing emerging technology, such as photonics.The low-loss interconnects and energy-efficient modulators provided by photonicscould help drive future performance scaling. This could innovate the next generation ofhigh-bandwidth, bandwidth-dense interconnects, and high-speed, energy-efficientprocessors by taking advantage of the inherent parallelism of light. This dissertationinvestigates photonic architectures for communication and computation acceleration tomeet the machine learning processing requirements of future systems. The benefits ofphotonics is explored for bit-level parallelism, data-level parallelism, and in-networkcomputation. The research performed in this dissertation shows that photonics has the4potential to enable the next generation of deep neural network application performance byimproving energy-efficiency and reducing compute latency.The evaluations in this dissertation conclude that photonic accelerators can: (1)Reduce energy-delay product by 73.9% at the bit-level on convolutional neural networkworkloads; (2) Improve throughput by 110× and improve energy-delay product by 74× onconvolutions neural network workloads by exploiting data-level parallelism; (3) Improvenetwork utilization while giving a 3.6× speedup, and reducing energy-delay product by9.3× by performing in-network computation.
Published: 2022

27. In-SRAM Computing for Neural Network Acceleration

Author: Eckert, Charles
Subjects: Computer Architecture, Neural Network Accelerator, In-memory computing
Abstract: For decades, the computing paradigm has been composed of separate memory and compute units. Processing-in-Memory(PIM) has often been proposed as a solution to break past the memory wall. With PIM, compute logic is moved near the memory, which can reduce the data movement. In-memory computing expands on PIM by morphing the memory into hybrid memory compute units, where data can be stored and computed on in-place. Recent work has modified SRAM arrays to allow logical operations to be performed directly inside the arrays. Our work extends basic logical operations and additionally adds support for arithmetic operations. Coinciding with the rise of increasing memory on-chip and more focus on near and in-memory computing is the ascendance of neural networks. Neural networks are highly data-parallel applications that are challenging to accelerate due to being data-bound, compute-bound, or both. In-memory computation reduces on-chip data movement and will increase the amount of compute available as well as the amount of storage in a custom chip. These factors can greatly alleviate compute and data bottlenecks. First, this thesis observes that SRAM memory has increasingly dominated the on-chip area for general-purpose processors. This area comes at the cost of compute potential and can be repurposed to function as a dual storage and compute unit. The benefits of such repurposing are greatly expanding the parallel compute capability of the chip while also reducing the on-chip data movement, all with minimal area increase. When SRAM is repurposed, the storage area can be reclaimed with minimal overhead. Modifications to the SRAM arrays are presented that allow the SRAM to function as hybrid compute/storage units capable of arithmetic operations. Additionally, this work presents a mapping strategy for supporting CNNs in the hybrid SRAM storage compute arrays. Second, this thesis proposes a custom ASIC called Eidetic that utilizes hybrid compute/storage SRAM arrays as both its primary storage and compute units. Repurposing a processor's SRAM is hamstrung by maintaining the cache's original functions and area footprint. Too many modifications to the cache would render the solution undesirable to chip designers. By further customizing the SRAM we can create more efficient PE units. Additionally, the increased SRAM storage allows more weights to be stored on-chip. Finally, the custom ASIC allows for a control logic that supports a graph-based programming model that further reduces off-chip data movement. These customizations allow Eidetic to target data-bound applications such as RNNs and MLPs. Third, we propose a detailed comparison between the in-cache and ASIC approaches to ML acceleration. Between repurposing the cache and creating an SRAM-based custom ASIC, in-SRAM computing offers multiple viable approaches. In-Cache computing is cheaper but comes with limitations, while an ASIC design is more expensive due to the total cost of ownership (TCO). We compare the performance and energy efficiency of our repurposed cache with a server-class GPU and the baseline CPU. We similarly evaluate our custom ASIC to other state-of-the-art ASIC DNN accelerators. For both the repurposed cache and the ASIC, we develop cycle-accurate simulators to determine the performance.
Published: 2022

28. In-Memory Acceleration for General Data Parallel Applications

Author: Fujiki, Daichi
Subjects: In-memory computing, Processing in memory, Data parallel computing, Computer architecture
Abstract: General purpose processors and accelerators including system-on-a-chip and graphics processing units are composed of three principal components: processor, memory, and interconnection of these two. This simple but powerful architecture model has been the basis of computer architecture for decades. However, the recent data-intensive trend in computation workloads has observed bottlenecks in this fundamental paradigm of computers. Studies show that data communication takes 1,000x time and 40x power compared to arithmetic performed in the processors. Processing-in-Memory (PIM) has long been an attractive idea that has the potential to break the well known memory wall problem. PIM moves compute logic near the memory, and thereby reduces data movement. In contrast, certain memories have been shown that they can morph themselves into compute units by exploiting the physical properties of the memory cells, making them intrinsically more efficient than PIM. Modern computing systems devote a large portion (more than 90%) of aggregate die area for passive memories; thus, re-purposing them for active computing units brings substantial benefits. However, prior work has only provided low-level interfaces for computation or relied on a manual mapping of machine learning kernels to the compute-capable memories. The main goal of this dissertation is to extend the compute capability of memory arrays and make them applicable to a wide range of data-parallel applications. First, a processor architecture is proposed that re-purposes resistive memory to support data-parallel in-memory computation. The proposed execution model seeks to expose the available parallelism in a memory array by supporting a programming model that merges the concepts of data-flow and vector processing. This is empowered by a compiler that transforms Data Flow Graphs of tensor programs to a set of data-parallel code modules with memory ISA. Second, this dissertation presents Duality Cache architecture that flexibly transforms caches on demand into an in-memory accelerator that can execute arbitrary data-parallel programs. The proposed architecture adopts the SIMT execution model and uses CUDA/OpenACC framework as the programming frontend. We develop a backend compiler that compiles PTX, the intermediate representation for CUDA, for the proposed architecture. Finally, this dissertation presents a multi-layer in-memory computing framework. In-memory computing can be implemented across multiple layers of the memory hierarchy, and in such a system figuring out the right place to compute is an important question to be answered. We propose a framework that determines the appropriate level of memory hierarchy for in-memory computing and maximizes resource utilization. We compare the performance and energy efficiency of our in-memory accelerators with server class CPU and GPU using a variety of data-parallel applications. Our experimental results show that in-ReRAM computing achieves 7.5x average speedup for PARSEC applications and in-SRAM computing achieves 3.6x average speedup for Rodinia applications. Multi-layer in-memory computing can provide an overall speedup of 4.8x for Graph Neural Networks applications with a significant workload dynamism. Our multi-faceted approaches, mainly composed of enhanced arithmetic operations, parallel programming models with compilers, and parallel execution models, unlock massive compute capabilities and energy efficiency of in-memory computing for general data-parallel applications.
Published: 2022

29. Optimizing Emerging Graph Applications Using Hardware-Software Co-Design

Author: Talati, Nishil Rakeshkumar
Subjects: Computer architecture, Graph Analytics, CPU microarchitecture, Compiler analysis, Near-data processing, Hardware accelerators
Abstract: A graph is a ubiquitous data structure that models entities and their interactions through the collec- tions of nodes and edges. It is widely employed in several important application domains ranging from social media, navigation tools, search engines, physics simulations, and biology. Despite its prevalence, the performance of graph algorithms on commercial platforms is limited. This is mainly due to the irregular memory accesses and convoluted control flow instructions used in graph algorithms while accessing large volumes of graph data (with billions of nodes/edges). Therefore, there is a pressing need for optimizing the performance of graph workloads. In this thesis, I present a systematic optimization study of a variety of graph workloads run- ning on both static and dynamic graphs. At a high level, I first analyze the unique challenges and execution bottlenecks of the state-of-the-art graph software frameworks running on commercial hardware platforms. I then use the insights obtained from this analysis to propose design optimiza- tions catered to the unique workload characteristics of a diversity of graph workloads. Specifically, first, I propose Prodigy—a hardware-software co-design solution to improve the performance of traditional graph processing algorithms (e.g., PageRank and SSSP) on multi-core CPUs. Second, I present an in-depth study of random walk–based graph learning algorithms on temporal graphs (a type of dynamic graph). Specifically, this study delivers high-performance, open-source CPU and GPU implementations of important graph learning applications, conducts a detailed performance analysis, and makes recommendations for future optimizations. Third, I showcase NDMiner—a domain-specialized Near Data Processing (NDP) architecture that signif- icantly improves the performance of Graph Pattern Mining (GPM) workloads. Last, I present Mint—a novel hardware accelerator architecture and an accompanying programming model for efficiently mining motifs in temporal graphs.
Published: 2022

30. A Loosely-Timed TLM-2.0 Model of a JPEG Encoder on a Checkerboard GPC

Author: Daroui, Arya
Subjects: Computer engineering, Computer science, Electrical engineering, Checkerboard, Computer architecture, Grid of processing cells, Memory bottleneck, System-level modeling, SystemC
Abstract: Common, classical computer architectures are based upon few computational cores that collaborate and communicate through larger, slower system memory. In this work, we introduce a configurable, checkerboard grid of processing cells architecture with distributed cores and memories designed to maximize the benefits of parallelization. We explore the checkerboard model and a classical model at a high level to compare their behaviors in a moderately parallelized JPEG encoder application benchmark. The models are simulated with a Loosely-Timed, SystemC TLM-2.0 test platform with timing by processor core, memory, and memory controller, and transaction. Our experimental results show a 66% faster execution speed and higher memory bandwidth headroom for the checkerboard architecture, compared to the classical architecture.
Published: 2022

31. Applied Machine Learning for Analyzing and Defending against Side Channel Threats

Author: Wang, Han
Subjects: Electrical engineering, Applied Machine Learning, Computer Architecture, Security, Side-Channel Attacks
Abstract: The sharing of hardware components in modern processors helps to achieve high performance and meet the increasing computation demand. Though isolation has been done among users and applications at operating system level, recent research shows that attacks can leverage sophisticated approaches to observe the behaviors of the shared hardware components and infer secrets including password, secret key, etc. Such observations and corresponding attacks are called as side channels and side-channel attacks (SCAs). A number of SCAs have been discovered including Flush+Reload, Flush+Flush, Prime+Probe, Spectre, Meltdown, Fallout, RIDL, ZombieLoad. SCAs have threatened the security of billions of hardware devices, including chips manufactured by Intel, Apple, ARM, etc. Therefore, it is urgent to address the security threats caused by SCAs.This dissertation pursues the use of machine learning to design effective defense mechanisms and obtain a comprehensive understanding of the side channel threats for emerging applications. In particular, we propose to tackle from three aspects: detection, mitigation and vulnerability analysis. For detection part, we leverage the microarchitecture level information, i.e. hardware performance counters, to build machine learning-based SCAs detectors. Eventually, we propose two customized machine learning classification models to capture SCAs at real-time and detect zero-day SCAs respectively. As the increase edge devices deployed in the network, we also investigate the machine learning-based detectors against malware and SCAs on autonomous vehicles, mobiles and laptops respectively. We find that hardware performance counters can effectively capture the SCAs with machine learning techniques. A second aspect of the dissertation is exploring the existing system level and hardware level settings for designing light-weight SCAs mitigation approaches. We find that randomizing the frequency and prefetchers can obfuscate side channel traces and protect against secret leakage. Based on the effectiveness of machine learning-based SCAs detection and randomization-based mitigation, we further developed a detection-mitigation defense approach to further minimize performance overhead incurred by adjusting hardware and system level parameters. In the last part of this dissertation, we evaluated the side channel leakage in more general applications which are mostly neglected in the prior side channel research community. We find that hardware performance counters can also be used by attackers to fingerprint websites users visited. Besides, we also discover that the inputs' labels of deep learning models are susceptible to be leaked via side-channel attack, i.e. Flush+Reload. To the best of our knowledge, we are the first group to identify the correlation between label information and side channel observations, highlighting the importance of reexamining the side channel vulnerability in general applications.
Published: 2022

32. Building Trusted Execution Environments

Author: Lee, Dayeol
Subjects: Computer science, Computer Architecture, Computer Security, Confidential Computing, Formal Methods, Hardware Enclave, Trusted Execution Environment
Abstract: Trusted Execution Environments (TEEs) offer hardware-based isolation, which protects the integrity and confidentiality of the in-use data of programs against various threats. Many hardware vendors have produced various TEE-enabled chips. However, there has been only a little public research on building TEEs. Building a TEE with different threat models and functionalities relies on design-space exploration. For example, a TEE must quickly adapt to various evolving threat models. In addition, a TEE can have different functionality requirements, which should not impact security guarantees. This thesis discusses research challenges in exploring the TEE design space. First, this thesis motivates why a TEE should not have a fixed threat model by demonstrating a novel off-chip side-channel attack on a TEE. Next, this thesis proposes Keystone, a software framework that enables building TEEs based on various needs, such as threat models and functionality requirements. Furthermore, this thesis discusses how to extend TEE functionality without breaking security guarantees using incremental verification.
Published: 2022

33. Rethinking the Programming Interface in Future Heterogeneous Computers

Author: Liu, Yu-Chia
Subjects: Computer science, Computer Architecture, Heterogeneous Computers, In-Storage Processing, Intelligent Storage Device, Programming Interface
Abstract: Computer systems have become more heterogeneous due to the breakdown of Dennard Scaling and the rapid growth of application demands. In addition to just having general-purpose processors, both factors have pushed modern computers to embrace hardware accelerators that are specialized for such as graphics and AI/ML domains. Besides hardware accelerators, because of the limited bandwidth provided by interconnection among the hardware components, we have seen the development of in-memory processing units and computational storage that also help with performance and thus diminish the boundary between processing units and memory in heterogeneous systems. Even though emerging hardware components in heterogeneous computers provide rich opportunities for performance improvement, programming frameworks that lack flexible programmability and proper interfaces limit the power of heterogeneous systems.In this dissertation, we envision an efficient and effective programming framework for future heterogeneous computers, and we propose the framework should contain the following characteristics. First, the interface for the heterogeneous systems must fulfill the demand of applications while maintaining the generality for a broad spectrum of applications to minimize the overhead of data representations in different system modules. Second, the programming framework for heterogeneous systems should intelligently identify the opportunities of using available hardware resources to deliver better performance and provide easy programmability. Finally, the programming interface must make applications easily adopt future accelerators or processing units. I have proposed three different works based on the envision. First, I have proposed NDS, an efficient storage interface that fulfills the various application demands of data objects and gauges the underlying memory-device architectures from application demands to minimize the overhead of transforming data representations. Second, I have proposed ActivePy, a programming framework that automatically identifies the potential code regions for computational storage, generates efficient code, and distributes tasks for the best performance without any programmer’s intervention. Lastly, I proposed UDSL, a potential programming paradigm that allows a program to scale easily with the advance of hardware accelerators or any future hardware.
Published: 2022

34. Architectural Support for Securing Systems Against Micro-Architectural Attacks

Author: Mohammadian Koruyeh, Esmaeil
Subjects: Computer science, Computer Architecture, Hardware Security, System Security
Abstract: Cybersecurity threats continue to grow as the number of attacks on all layers of computing systems by motivated and sophisticated attackers continues to grow over the past several years. The recent Meltdown and Spectre attacks have shown that computer architecture and hardware also offer software-exploitable attack surfaces that can be used to compromise systems. This dissertation investigates the boundary between hardware and software with respect to computer security, exploring attacks that originate in the hardware, and conversely architecture support for securing systems and software. In this dissertation, we introduce SpectreRSB, a new Spectre attack that we developed targeting the return stack buffer used to optimize the execution of return instructions on modern CPUs. We show that both local attacks (within the same process such as Spectre 1) and attacks on SGX are possible by constructing proof of concept attacks. We also analyze additional types of the attack on the kernel or across address spaces and show that under some practical and widely used conditions they are possible.Having demonstrated the possibility of Spectre attacks, the dissertation explores general defense approaches to counter this important vulnerability class. The first defense we contribute is SpecCFI, a new CPU design principle that secures modern processors against Spectre attacks with the help of program analysis while retaining the benefits of speculative execution. SpecCFI represents a new approach to securing architecture by using techniques that protect software to enforce secure operation even during speculative execution. We extended the idea of using program analysis during speculation, to defend against more variants of transient execution attacks. More specifically, we proposed the Speculative Execution Regulation (SER) as a general class of defense. Since speculative execution states are accessible to an attacker, SER seeks to ensure that security invariants are enforced even during speculation.The third contribution of the dissertation is a general approach to securing processors against transient execution attacks by making speculation leakage free in a principled way, enabling CPUs to retain the performance advantages of speculation while removing the security vulnerabilities it exposes. Our defense, SafeSpec, is a design principle where speculative state is stored in temporary shadow structures, that are not accessible to committed instructions. The final contribution of my dissertation is the possibility of side-channel attacks on new emerging memories to find potential vulnerabilities. More specifically we showed the possibility of side-channel attacks when Intel Optane persistent memory operates as the main memory in the system and DRAM is considered as the last level cache. The timing difference between accessing the DRAM and Non-Volatile RAM (NVRAM) can create a side channel.
Published: 2022

35. PRODUCTIVE AND EXTENSIBLE HARDWARE MODELING, SIMULATION, AND VERIFICATION METHODOLOGIES

Author: Jiang, Shunning
Subjects: computer architecture, hardware modeling, productive hardware design methodology
Abstract: As Dennard scaling broke down in the 2000s and Moore’s Law slowed down in the 2010s, computer engineers have been exploring new ways to extract more computing performance without increasing the power density or the transistor count. Various specialized hardware accelerators are integrated into existing multi-core architectures, creating heterogeneous system-on-chips (SoC). However, as more heterogeneous SoCs are built, the number of different hardware blocks in a single SoC is rapidly increasing. This trend significantly increases the non-recurring engineering (NRE) cost required to build new SoCs. Maximizing the reuse of hardware blocks across and inside SoC designs is one of the key ways to reduce the NRE cost. This requires both flexible parameterization of a single hardware design block and versatile composition of numerous different hardware design blocks. To enable and maximize such reuse of hardware blocks, productive hardware modeling methodologies play a critical role in the modern computer engineeringworkflow. This thesis takes an engineering research approach to explore productive and extensible hardware modeling, simulation, and verification methodologies. I identify four major challenges in state-of-the-art productive hardware modeling methodologies and formulate each challenge into a stand-alone research question. Then, I propose several techniques to address these research questions: (1) native in-memory intermediate representation (NIMIR), a novel modular framework architecture, to improve the flexibility and extensibility of hardware generation and simulation frameworks (HGSF); (2) unified modular ordering constraints (UMOC), a novel modeling technique coupled with scheduling algorithms, to unify cycle- and register-transfer-level modeling and achieve high model fidelity with little effort; (3) Mamba++, a series of HGSF-aware just-in-time compilation (JIT) techniques and JIT-aware HGSF design techniques, to close the simulation performance gap in HGSFs; and (4) PyH2, our vision and techniques for testing various hardware designs leveraging open-source software, to reduce testing/verification time for agile hardware design flows. Finally, in addition to addressing each individual research question, I created PyMTL3, a new hardware generation and simulation framework which incorporates the techniques proposed in this thesis. By implementing the techniques inside a real hardware modeling framework, the practicality of the proposed techniques is demonstrated. PyMTL3 has been used in courses at Cornell University, in various research projects, and in several advanced-node chip tape-outs.
Published: 2021

36. iLORE: Discovering a Lineage of Microprocessors

Author: Furman, Samuel Lewis
Subjects: Computer history, systems, computer architecture, microprocessors
Abstract: Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing. We propose iLORE, a data model designed to represent intricate relationships between computer system benchmarks and computer components. We detail the methods we used to implement and populate the iLORE data model using data harvested from publicly available sources. Finally, we demonstrate the validity and utility of our iLORE implementation through an analysis of the characteristics and lineage of commercial microprocessors. We encourage the research community to interact with our data and visualizations at csgenome.org.
Published: 2021

37. EFFICIENT FINE-GRAIN COOPERATIVE EXECUTION OF DYNAMIC TASK PARALLELISM ON HETEROGENEOUS MULTI/MANYCORE SYSTEMS

Author: Wang, Moyang
Subjects: cache coherence, computer architecture, parallel programming, task-based, work stealing
Abstract: Since the end of Dennard’s scaling, computer architects have fully embraced parallelism to con- tinue improving the performance and energy efficiency of general-purpose processors. Multicore processors with a few to tens of high performance processor cores have been the centerpiece of many computing platforms ranging from mobile devices to data centers. Manycore proces- sors with hundreds or thousands of simple processing elements have demonstrated their ability to achieve even higher throughput and energy efficiency when abundant explicit parallelism exists in the workloads. However, large-scale manycore processors often lack hardware-based cache co- herence. There is a growing trend towards a tighter integration between multicore and manycore processors, forming heterogeneous multi/manycore systems. These systems use heterogeneous cache coherence (HCC) with hardware-based cache coherence within the multicore and software- centric cache coherence with in the manycore. Unfortunately, programming heterogeneous multi/manycore systems to enable collaborative execution is challenging, especially when considering dynamic task parallelism. This thesis uses a combination of light-weight software and hardware techniques to elegantly address this problem. It provides a detailed description of how to imple- ment a work-stealing runtime to enable dynamic task parallelism on heterogeneous cache-coherent systems with a unified task-based programming model. This thesis also proposes direct task steal- ing (DTS), a new technique based on user-level interrupts to bypass the memory system and thus improve the performance and energy efficiency of work stealing. The cycle-level results in this thesis demonstrate that executing dynamic task-parallel applications on a 64-core system (4 big, 60 tiny) with complexity-effective HCC and DTS can achieve: 7× speedup over a single big core; 1.4x speedup over an area-equivalent eight big-core system with hardware-based cache coher- ence; and 21% better performance and similar energy efficiency compared to a 64-core system (4 big, 60 tiny) with full-system hardware-based cache coherence. This thesis also describes a realistic hardware implementation of heterogeneous multi/manycore systems based on an open-source hardware prototyping framework, OpenPiton. Using a VLSI methodology, this thesis shows that the heterogeneous multi/manycore approach achieves 3x hardware parallelism with the same area compared to a traditional homogeneous manycore.
Published: 2021

38. Design Space Exploration and Architecture Design for Inference and Training Deep Neural Networks

Author: Qi, Yangjie
Subjects: Electrical Engineering, Computer Engineering, Artificial Intelligence, deep neural network, DNN, computer architecture, DNN accelerator, design space exploration, edge computing, hardware architecture
Abstract: Deep Neural Networks (DNNs) are widely used in various application domains and achieve remarkable results. However, DNNs require a large number of computations for both the inference and training phases. Hardware accelerators are designed and implemented to compute DNN models efficiently. Many accelerators have been proposed for DNN inference, while only a limited set of DNN training accelerators has been proposed. Almost all of these accelerators are highly custom-designed and limited in the types of networks they can process. This dissertation focuses on designing novel architectures and tools for efficient training of deep neural networks, particularly for edge applications. We proposed several novel architectures and a design space exploration tool. Our proposed architecture can be used for efficient processing of DNNs, and the design space exploration model could help DNN architects explore the design space of DNN architecture design for both inference and training and help home in on the optimal architecture in different hardware constraints in applications.The first area of contribution in this dissertation is the design of Socrates-D-1, a digital multicore on-chip learning architecture for deep neural networks. This processing unit design demonstrates the capability to process the training phase of DNNs efficiently. A statically time-multiplexed routing mechanism and a co-designed mapping method are also introduced to improve overall throughput and energy efficiency. The experimental results show 6.8 to 22.3 times speedup and more than a thousand times energy efficiency over a GPGPU. The proposed architecture is also compared with several DNN training accelerators and achieves the best energy and area efficiencies.The second area of contribution in this dissertation is the design of Socrates-D-2, which is an enhanced version of Socrates-D-1. This architecture presents a novel neural processing unit design. A dual-ported eDRAM memory replaces the double eDRAM memory design used in Socrates-D-1. In addition, a new mapping method utilizing neural network pruning techniques is introduced and evaluated with several datasets. The co-designed mapping methods helped the architecture achieve both throughput and energy efficiency without loss of accuracy. Compared with Socrates-D-1, this new architecture shows an average of 1.2 times higher energy efficiency and 1.25 times better area efficiency.The third area of contribution in this dissertation is the development of TRIM, a design space exploration model for DNN accelerators. TRIM is an infrastructure model and can explore the design space of DNN accelerators for training and inference. It utilizes a very flexible hardware template, which can model a wide range of architectures. TRIM explores the design space of data partition and reuse strategies for each hardware architecture and estimates the optimal time and energy. Our experimental results show that TRIM can achieve more than eighty percent accuracy on time and energy estimations. To the best of our knowledge, TRIM is the first infrastructure to model and explore the design space of DNN accelerators for training and inference.The fourth area of contribution in this dissertation is a set of design space explorations using TRIM. Through several case studies, we explored the design space of DNN accelerators for training and inference. We compared different dataflows and showed the impact of dataflow on efficient processing DNNs. We showed how to use TRIM to optimize the dataflow. We explored the design space of spatial architectures and showed the results of varying different hardware choices. Based on the exploration results, several high throughput and energy-efficient DNN training accelerators were presented.The fifth area of contribution in this dissertation is the design of an FPGA-based training accelerator for edge devices. We designed a CPU-FPGA accelerator that can operate under 5W. TRIM is utilized for dataflow optimization and hardware parameter selection. The experimental results show that we could achieve a 1.93 times speedup and 1.43 times energy efficiency for end-to-end training over a CPU implementation.
Published: 2021

39. Enabling Hyperscale Web Services

Author: Sriraman, Akshitha
Subjects: Hyperscale computing, Data center, Web service, Computer architecture, Software systems
Abstract: Modern web services such as social media, online messaging, web search, video streaming, and online banking often support billions of users, requiring data centers that scale to hundreds of thousands of servers, i.e., hyperscale. In fact, the world continues to expect hyperscale computing to drive more futuristic applications such as virtual reality, self-driving cars, conversational AI, and the Internet of Things. This dissertation presents technologies that will enable tomorrow’s web services to meet the world’s expectations. The key challenge in enabling hyperscale web services arises from two important trends. First, over the past few years, there has been a radical shift in hyperscale computing due to an unprecedented growth in data, users, and web service software functionality. Second, modern hardware can no longer support this growth in hyperscale trends due to a decline in hardware performance scaling. To enable this new hyperscale era, hardware architects must become more aware of hyperscale software needs and software researchers can no longer expect unlimited hardware performance scaling. In short, systems researchers can no longer follow the traditional approach of building each layer of the systems stack separately. Instead, they must rethink the synergy between the software and hardware worlds from the ground up. This dissertation establishes such a synergy to enable futuristic hyperscale web services. This dissertation bridges the software and hardware worlds, demonstrating the importance of that bridge in realizing efficient hyperscale web services via solutions that span the systems stack. The specific goal is to design software that is aware of new hardware constraints and architect hardware that efficiently supports new hyperscale software requirements. This dissertation spans two broad thrusts: (1) a software and (2) a hardware thrust to analyze the complex hyperscale design space and use insights from these analyses to design efficient cross-stack solutions for hyperscale computation. In the software thrust, this dissertation contributes uSuite, the first open-source benchmark suite of web services built with a new hyperscale software paradigm, that is used in academia and industry to study hyperscale behaviors. Next, this dissertation uses uSuite to study software threading implications in light of today’s hardware reality, identifying new insights in the age-old research area of software threading. Driven by these insights, this dissertation demonstrates how threading models must be redesigned at hyperscale by presenting an automated approach and tool, uTune, that makes intelligent run-time threading decisions. In the hardware thrust, this dissertation architects both commodity and custom hardware to efficiently support hyperscale software requirements. First, this dissertation characterizes commodity hardware’s shortcomings, revealing insights that influenced commercial CPU designs. Based on these insights, this dissertation presents an approach and tool, SoftSKU, that enables cheap commodity hardware to efficiently support new hyperscale software paradigms, improving the efficiency of real-world web services that serve billions of users, saving millions of dollars, and meaningfully reducing the global carbon footprint. This dissertation also presents a hardware-software co-design, uNotify, that redesigns commodity hardware with minimal modifications by using existing hardware mechanisms more intelligently to overcome new hyperscale overheads. Next, this dissertation characterizes how custom hardware must be designed at hyperscale, resulting in industry-academia benchmarking efforts, commercial hardware changes, and improved software development. Based on this characterization’s insights, this dissertation presents Accelerometer, an analytical model that estimates gains from hardware customization. Multiple hyperscale enterprises and hardware vendors use Accelerometer to make well-informed hardware decisions.
Published: 2021

40. Memory and System Aware Architectures for Real-Time Machine Learning

Author: Pinkham, Reid
Subjects: Machine Learning, Computer Architecture, Real-time computation, augmented and virtual reality ar/vr
Abstract: There has been an explosion of growth in the field of Machine Learning (ML) enabled by the widespread availability of continually faster computing hardware. These new ML algorithms are increasingly used in real-time applications which introduces new challenges for the computational hardware. The real-time requirement requires moving the computation closer to the data source which places more importance on the efficiency and latency of computation. This thesis is comprised of four parts which introduce a mix of algorithmic and hardware advancements to enable real-time ML computation. The first part focuses on the task of performing a k-Nearest-Neighbor (kNN) search on un-ordered point clouds for autonomous vehicle applications. I present QuickNN, an FPGA based accelerator which can handle real-time kNN point cloud processing. QuickNN uses strategically placed caches and ordered external memory placement to alleviate the limited external memory bandwidth. It also introduces an efficient tree storage and accompanying traversal method to reduce the bottleneck of tree manipulation. The second part introduces a lightweight CNN-based compression algorithm which can be used on high-frame-rate streamed video data. This work aims to address the challenge of compressing video data quickly with relatively low overhead using a Convolutional Neural Network (CNN). Compared to previous CNN-based compression schemes, the presented method has comparable compression complexity to state of the art traditional compression schemes, but has the advantage of a near-zero overhead in decompression. The third part presents an in-depth design space exploration of the multi-processor Augmented and Virtual Reality (AR/VR) device. This work introduces an example AR/VR platform and performs analysis of the trade offs associated with splitting CNN computation between the multiple processors on the device, including the small on-sensor processors. Using these insights, some straightforward design rules are presented and shown to yield nearly optimal processor specification and algorithm mapping. Finally, two real-world processor limitations are discussed and how they impact the algorithm mapping and most suitable types of processors. The fourth part presents a design for a near-sensor CNN processing architecture which is adept to a dynamically varying workload. The presented architecture is intended for near-sensor compute comprised of a scalable processor and stacked high density non-volatile memories (NVM) which store the CNN weights and can be power gated at run time to save energy. The processing architecture consists of multiple connected tiles, each with multiple vector-matrix multiplier (VMM) units. Through supporting multiple mapping methods, dataflow schemes, and fine-grained power gating, the processing architecture can efficiently adapt to a wide range of real-time workloads. We demonstrate that the same architecture can be scaled in size to fit the design envelope of the system while maintaining efficiency, as well as quantify the impact of individual architecture improvements over a standard SIMD-based design. Together, these four works tie together aspects of real-time system design which are important for a diverse set of future applications. As the applications of ML algorithms continue to expand, so must the supporting compute architectures. Advancement of real-time architectures will enable the next wave of computing platforms, from autonomous vehicles to wearable AR devices, which will continue to lead to a safer and more connected world.
Published: 2021

41. Hardware / Software System for Portable and Low-Cost Genome Assembly

Author: Gnanasambandapillai, Vikkitharan ; https://orcid.org/0000-0003-0306-1952
Subjects: Hardware / Software co-design, Genome Assembly, Bio-informatics, Application-specific instruction-set processors, Computer architecture, Pipelined processor architecture, anzsrc-for: 460612 Service oriented computing
Abstract: “The enjoyment of the highest attainable standard of health is one of the fundamental rights of every human being without distinction of race, religion, political belief, economic or social condition” [56]. Genomics (the study of the entire DNA) provides such a standard of health for people with rare diseases and helps control the spread of pandemics. Still, millions of human beings are unable to access genomics due to its cost, and portability. In genomics, DNA sequencers digitise DNA information, and computers analyse the digitised information. We have desktop and thumb-sized DNA sequencers, that digitise the DNA data rapidly. But computations necessary for the analysis of this data are inevitably performed on high-performance computers (HPCs) and cloud computers. These computations not only require powerful computers but also necessitate high-speed networks since the data generated are in the hundreds of gigabytes. Relying on HPCs and high-speed networks, deny the benefits that can be reaped by genomics for the masses who live in remote areas and in poorer nations. Developing a low-cost and portable genomics computation platform would provide personalised treatment based on an individual’s DNA and identify the source of the fast-spreading epidemics in remote areas and areas without HPC or network infrastructure. But developing a low-cost and portable genome analysing computing platform is a challenging task. This thesis develops novel computer architecture solutions to assemble the whole human DNA and COVID-19 virus RNA on a low-cost and portable platform. The first phase of the solution describes a ring-pipelined processor architecture for a key genome assembly algorithm. The human genome is partitioned to fit into the small memory footprint of embedded processors. These techniques allow an entire human genome to be assembled using highly portable and low-cost embedded processor cores. These processor cores can be housed within a single chip. Each processor was only 0.08 mm 2 and consumed just 37.5 mW. It has only 2 GB memory, 32-bit instruction width, and a clock with a 1 GHz frequency. The second phase of the solution describes how application-specific instruction-set processors can be sped up to execute a key genome assembly algorithm. A fully automated design system is presented, which improves the performance of large applications (such as genome assembly algorithm) and generates application-specific instructions for a commercial processor design tool (Xtensa). The tool enhances the base processor, which was used in the ring pipeline processor architecture. Thus, the alignment algorithms execute 2.1 times faster with only 11% additional hardware. The energy-delay product was reduced by 7.3× compared to the base processor. This tool is the only one of its type which can handle applications which are large. The third phase of the solution designs a portable low-cost genome assembly computer (PGA). PGA enhances the ring pipeline architecture with the customised processor found in phase two and with improved inter-processor communication. The results show that the COVID-19 virus RNA can be assembled in under 10 minutes and the whole human genome can be assembled in 11 days on a portable platform (HPC take around two days) for 30× coverage. PGA has an area footprint of just 5.68 mm 2 in a 28 nm technology node and is far smaller than a high-performance computer processor chip. The PGA consumes only 4W of power, which is lower than the power requirement of a high-performance processor chip. The manufacturing cost of the PGA also would be much cheaper than the high-performance system cost, when produced in volume. The developed solution can be powered by a USB port of a laptop. This thesis is the first of its type to show the design of a single-chip solution to be able to process a complex genomic problem. This thesis contributes to attaining one of the fundamental rights of every human being wherever they may live.
Published: 2021

42. Architecture Supports and Optimizations for Memory-Centric Processing System

Author: gu, peng
Subjects: Computer engineering, accelerator, computer architecture, memory system
Abstract: For the past two decades, the scaling of main memory lags behind the advancement of computation in aspects of bandwidth and capacity. First, conventional compute-centric architecture faces challenges to scale memory bandwidth due to the limitation of off-chip interconnect resources and the energy-inefficiency of long distance data movement. Also, the emerging big data workloads have increasing demand for higher memory capacity, which cannot be satisfied by traditional DRAM technology scaling.To address these challenges, this dissertation focuses on exploring memory-centric architectures and design optimizations for higher memory bandwidth and larger memory capacity. Three categories of memory-centric designs have been researched. The analog process-in-memory architecture merges computation logics inside memory arrays. It employs the in-situ computing capabilities of resistive memory arrays to eliminate data movements and benefits from massive data parallelism. The digital process-near-memory architecture integrates computation units near memory arrays. The near-memory lightweight components can utilize abundant bandwidth of the internal memory arrays while the optimizations maintain hardware programmability. The enhanced memory design develops a simulation framework for emerging non-volatile memory technologies, which can greatly boost the memory capacity. Using both emerging non-volatile memory and 3D stacking memory technologies, this dissertation investigates four architectures and one simulation framework, covering a wide spectrum of application domains including deep learning, image processing, and high-performance parallel computing.
Published: 2021

43. Load Driven Branch Predictor (LDBP)

Author: Sridhar, Akash
Subjects: Computer engineering, Branch Prediction, Computer Architecture, Microarchitecture
Abstract: A larger instruction window on Out-of-Order (OoO) cores facilitates better exploitation of inherent Instruction Level Parallelism (ILP). Branch miss-speculation penalty restricts scaling to larger instruction window in OoO cores. Branch instructions dependent on hard-to-predict load data are the leading misprediction contributors. Computer architects continuously strive to optimize branch prediction algorithms and increase predictor size to mitigate mispredictions. Current state-of-the-art history-based branch predictors have low prediction accuracy for these branches. Prior research backs this observation by showing that increasing the size of a 256-KBit history-based branch predictor to its 1-MBit variant has just a 10% reduction in branch mispredictions.In this dissertation, I present the novel Load Driven Branch Predictor (LDBP), specifically targeting hard-to-predict branches dependent on a load instruction. Though random load data determines these branches’ outcomes, most of these data’s load address have a predictable pattern. This is an observable template in data structures like arrays and maps. The LDBP predictor model exploits this behavior to trigger future loads associated with branches ahead of time and use its data to predict its outcome. The predictable loads are tracked, and the branch instruction’s precomputed outcomes are buffered for making predictions. The experimental results show that on a modern Zen2-like OoO core, compared to a standalone 256-Kbit IMLI predictor, when LDBP is augmented to it, the average branch mispredictions reduce by 12% and the average IPC improves 7.14% for benchmarks from SPEC CINT2006 and GAP benchmark suite.
Published: 2021

44. Enabling Non-Volatile Memory for Data-intensive Applications

Author: Liu, Xiao
Subjects: Computer science, Computer architecture, Data-intensive applications, Memory hierarchy, Non-volatile memory
Abstract: The emerging Non-Volatile Memory (NVM) technologies are reforming the computer architecture. NVM holds advantages includes a byte-addressable interface, low latency, high capacity, and in-memory computing capability. However, data-intensive applications today demand compound features rather than just better performance. For instance, big data applications would require high availability and reliability. The neural network applications require scalability and power efficiency. Despite all the advantages of NVM, simply attaching the NVM to the memory hierarchy are unable to meet these demands. The decoupled reliability schemes among NVM and other devices fail to provide sufficient reliability. The vulnerability against overheating and hardware underutilization limit the performance and scalability of the in-memory computing NVM.Using the NVM for the data-intensive application requires redesign and customization. In this thesis, we focus on discussing the architecture designs that enable NVM for data-intensive applications. Our study includes two major types of data-intensive applications – big data applications and neural network applications. We first conduct a characteristic study against the persistent memory applications. Persistent memory implements over the NVM-based main memory and guarantees crash consistency. We explore the performance interaction across applications, persistent memory system software, and hardware components. Based on our characterization results, we provide a set of implications and recommendations for optimizing persistent memory designs. Second, we propose Binary Star for the generic data-intensive applications, which coordinates the reliability schemes and consistent cache writeback between 3D-stacked DRAM last-level cache and NVM main memory to maintain the reliability of the memory hierarchy. Binary Star significantly reduces the performance and storage overhead of consistent cache writeback by coordinating it with NVM wear leveling. For neural network applications, our first design explores the thermal effect over one representative NVM – resistive memory (RRAM). We find heat-induced interference decreases the computational accuracy in the RRAM-based neural network accelerator. We propose HR3AM, a heat resilience design, which improves accuracy and optimizes the thermal distribution. Results show that HR3AM improves classification accuracy and decreases both the maximum and average chip temperatures. Lastly, we present Mirage to improve parallelism and flexibility for pipeline-enabled RRAM-based accelerators. Mirage is a hardware/software co-design that addresses the data dependencies and inflexibility issues of existing accelerators. Our evaluation shows that Mirage achieves low inference latency and high throughput compared to state-of-the-art RRAM-based accelerators.
Published: 2021

45. Adaptive AI Algorithms for Generic Hardware and Unified Hardware Acceleration Architecture

Author: Shi, Feng
Subjects: Computer science, Electrical engineering, artificial intelligence, computer architecture
Abstract: We are now in an era of the Big Bang of artificial intelligence (AI). In this wave of revolution, both industry and academia have cast numerous funds and resources. Machine learning, especially Deep Learning, has been widely deployed to replace the traditional algorithms in many domains, from the euclidean data domain to the non-euclidean domain. As the complexity and scale of the AI algorithms increase, the system host these algorithms requires more computational power and resources than before. Using the design of the modules of the video analytic platform as the use cases, we analyze the workload cost for computational resource and memory allocation during the execution of the system. The video analytic platform is a complex system that comprises various computer vision and decision-making tasks. Every module accomplishing a specific task is a stage in the pipeline of the video analytic platform. With the analyses mentioned above, we synthesize the adaptive AI algorithms from availability and variability perspectives, such as optimization with tensorization or matricization. We conceive the sparse Transformer and segmented linear Transformer as the critical components for the human action recognition task. The Constraint Satisfaction Problem is employed to assist the decision-making in the scene parsing stage. To facilitate this fulfillment of this task, we designed a hybrid model for graph learning-based SAT solver. Graph matching is employed at the final stage for the scene understanding task. We implemented a hybrid model of GNN and Transformer architecture. Finally, we design the unified hardware acceleration architecture for both dense and sparse data based on the optimizations of algorithms. Our design of the architecture targets the arithmetic operation kernels, such as matrix multiplications, with the help of data transformation and rearrangement. We first transform the inputs and weights with Winograd transform for dense convolution operations, then we feed the transformed data to the matrix multiplication accelerator. While for sparse data, we need to utilize the index to nonzero to fetch data; therefore, the indexation, scattering, and gathering are crucial components, effective implementation will dramatically improve the system's overall performance. To improve the matrix multiplication accelerator's efficiency and reduce the number of heavy arithmetic operations and the number of memory accesses, we also conduct the hardware-based recursive algorithm, i.e., Strassen's algorithm for matrix multiplication.
Published: 2021

46. Near Memory Processing in Hybrid Memory System: 3D-DRAM vs. 3D-NVM

Author: S. Hosseini, Maryam
Subjects: Computer engineering, Computer Architecture, Near Memory Processing
Abstract: The cost of transferring data between the off-chip memory system and compute unit is the fundamental energy and performance bottleneck in conventional multi-core computing systems. Furthermore, in the era of big data and with the advent of emerging data-intensive applications, such as graph processing, machine learning, deep learning, media processing, data mining, computer vision, computational biology, and speech recognition, this bottleneck has continuously increased. For such applications, the expensive data movement between memory and compute unit dominates both execution time and energy/power consumption which results in impeding future performance scaling. Moreover, the technology scaling (the end of Moore's law and failure of Dennard scaling) has made all compute units energy and power constrained. In order to satisfy the energy and power constraints, researchers are forced to stop further increasing the frequency and to reduce the chip utilization. Thus, to continue scaling the performance, energy overhead must be minimized for every operation. To overcome these difficulties, different approaches either algorithmic-level or architectural-level can be applied. The later promising approach commonly referred to as Near Memory Processing (NMP) has become a potential and practical technology to transform the computation-centric systems towards memory-centric systems. The introduction of 3D die stacking technology and more importantly hybrid memory systems have revolutionized the concept of NMP. 3D die stacking, built using Through-Silicon Via (TSV), offers higher bandwidth, shorter wire lengths, lower power (due to short-length low-capacitance wires), and better performance compared to traditional 2D planner memories. This memory technology allows architects to implement practical NMP systems by vertically stacking multiple memory layers on top of a logic die in the same package. The logic layer is typically the most bottom layer which provides an area for adding a wide range of processing logic (general-purpose cores, FPGAs, ASICs, or a combination of all types). It enables higher density many-core architectures to happen and helps for improving the power-performance characteristics to increase capabilities of modern integrated circuits.The focus of this dissertation is to explore and evaluate the feasibility and efficacy of NMP architecture constructed based on an emerging Non-Volatile Memory (NVM) technology in a 3D structure. And to compare it with the conventional NMP architecture built based on 3D-DRAM in terms of performance and power consumption. To this purpose, first, a set of NMP-centric performance metrics are redefined in order to analyze the efficacy of mapping a given processing unit to a specific application. Leveraging the proposed metrics, a comprehensive characterization is conducted on a wide range of multi-threaded applications (various computation and memory patterns) from different domains as a case study to reveal their performance bottleneck. Then, two different NMP architectures are explored and the impact of constructing NMP architecture based on an emerging non-volatile memory technology (3D-NVM) is analyzed. Also the feasibility of having an NMP subsystem on a hybrid 3D memory system is motivated in this dissertation. Finally, the experimental results demonstrate that executing certain data-intensive (memory-intensive) applications on the evaluated NMP architectures (3D-PCM and 3D-DRAM) improve the performance by 1.3x to 5x and reduce memory power/energy consumption by an average of 47% compared to executing them on conventional multi-core Host CPU system. These improvements make the hybrid NMP system a great design technique for acceleration in performance and power across a wide range of data-intensive applications.
Published: 2021

47. Machine learning for prediction problems in computer architecture

Author: Shi, Zhan
Subjects: Computer architecture, Machine learning, Cache replacement, Prefetching, Deep learning accelerator, Bayesian optimization
Abstract: The solutions to many problems in computer architecture involve predictions, which are often based on heuristics. Given the success of machine learning in solving prediction problems, it is natural to wonder if machine learning can better solve architectural prediction problems. Unfortunately, despite vastly outperforming traditional heuristics in various fields, machine learning has seen limited impact on prediction problems in computer architecture. The main challenge is that each architectural prediction problem exhibits unique constraints that prevent off-the-shelf machine learning algorithms from being more effective than heuristics. For example, hardware prediction problems, such as branch prediction and cache replacement, impose severe latency and area constraints that make multi-layer neural networks largely infeasible. In this thesis, we propose machine learning solutions to three important problems in computer architecture, namely cache replacement, data prefetching, and the automatic design of neural network accelerators. In our solutions, we focus on not only the design of learning algorithms, but also the use of learning algorithms under the unique constraints of each problem. In particular, to deal with the extremely tight area and latency constraints of replacement policies and data prefetchers, we propose to first design powerful yet impractical neural network models, from which we derive important insights that can be used to design practical predictors. To deal with the highly constrained search space in the automated design of neural network accelerators, we propose a new constrained Bayesian optimization framework to effectively explore the search space where over 90\% of designs are infeasible.
Published: 2020

48. Design and prototyping of Hardware-Accelerated Locality-aware Memory Compression

Author: Srinivas, Raghavendra
Subjects: Computer Architecture, Memory Controller, DRAM, Accelerator, Compression
Abstract: Hardware Acceleration is the most sought technique in chip design to achieve better performance and power efficiency for critical functions that may be in-efficiently handled from traditional OS/software. As technology started advancing with 7nm products already in the market which can provide better power and performance consuming low area, the latency-critical functions that were handled by software traditionally now started moving as acceleration units in the chip. This thesis describes the accelerator architecture, implementation, and prototype for one of such functions namely "Locality-Aware memory compression" which is part of the "OS-controlled memory compression" scheme that has been actively deployed in today's OSes. In brief, OS-controlled memory compression is a new memory management feature that transparently, dramatically, and adaptively increases effective main memory capacity on-demand as software-level memory usage increases beyond physical memory system capacity. OS-controlled memory compression has been adopted across almost all OSes (e.g., Linux, Windows, macOS, AIX) and almost all classes of computing systems (e.g., smartphones, PCs, data centers, and cloud). The OS-controlled memory compression scheme is Locality Aware. But still under OS-controlled memory compression today, applications experience long-latency page faults when accessing compressed memory. To solve this per- performance bottle-neck, acceleration technique has been proposed to manage "Locality Aware Memory compression" within hardware thereby enabling applications to access their OS- compressed memory directly. This Accelerator is referred to as HALK throughout this work, which stands for "Hardware-accelerated Locality-aware Memory Compression". The literal mean- ing of the word HALK in English is 'a hidden place'. As such, this accelerator is neither exposed to the OS nor to the running applications. It is hidden entirely in the memory con- troller hardware and incurs minimal hardware cost. This thesis work explores developing FPGA design prototype and gives the proof of concept for the functionality of HALK by running non-trivial micro-benchmarks. This work also provides and analyses power, performance, and area of HALK for ASIC designs (at technology node of 7nm) and selected FPGA Prototype design.
Published: 2020

49. Practical irregular prefetching

Author: Wu, Hao (Ph. D. in computer science)
Subjects: Computer architecture, Cache, Memory system, Temporal prefetching
Abstract: Memory accesses continue to be a performance bottleneck for many programs, and prefetching is an effective and widely used method for alleviating the memory bottleneck. However, prefetching can be difficult for irregular workloads, which the hardware has no clear patterns like sequential or strided patterns. For irregular workloads, one promising approach is to perform temporal prefetching, which memorizes temporal correlations that happen in the past and use them to predict future memory accesses. To store these correlations, it requires megabytes of metadata which cannot be feasibly stored on-chip. As a result, previous temporal prefetchers store metadata off-chip in DRAM, which introduces hardware implementation difficulties, increases DRAM latencies and increases DRAM traffic overhead. For example, the STMS prefetcher proposed by Wenisch et al. has 3.42x DRAM traffic overhead for irregular SPEC2006 workloads. These problems make previous temporal prefetchers impractical to implement in commercial hardware. In this thesis, we propose three methods to alleviate the metadata storage problems in temporal prefetching and make it practical in hardware. First, we propose MISB, a new scheme that uses a metadata prefetcher to manage on-chip metadata. With only 1/5 traffic overhead compared to STMS, MISB achieves 22.7% performance speedup over a baseline with no prefetching compared to 10.6% for an idealized STMS and 4.5% for a realistic ISB. Second, we present Triage, the first temporal prefetcher that stores its entire metadata on chip, which reduces hardware complexity and DRAM traffic by re-purposing part of last level cache to store metadata. Triage reduces 60% traffic compared to MISB and achieves 13.9% performance speedup over a baseline with no prefetching. In a bandwidth constrained 8-core environment, Triage has 11.4% speedup compared to 8.0% for MISB. Third, we present a new resource management scheme for Triage's on-chip metadata. This scheme integrates ISP's compressed metadata representation and makes several improvements. For irregular benchmarks, this scheme reduces on-chip metadata storage requirement by 38% and achieves 29.6% speedup compared to Triage's 25.3%.
Published: 2020

50. Towards Using Free Memory to Improve Microarchitecture Performance

Author: Panwar, Gagandeep
Subjects: Computer Architecture, Memory, DRAM, HPC systems
Abstract: A computer system's memory is designed to accommodate the worst-case workloads with the highest memory requirement; as such, memory is underutilized when a system runs workloads with common-case memory requirements. Through a large-scale study of four production HPC systems, we find that memory underutilization problem in HPC systems is very severe. As unused memory is wasted memory, we propose exposing a compute node's unused memory to its CPU(s) through a user-transparent CPU-OS codesign. This can enable many new microarchitecture techniques that transparently leverage unused memory locations to help improve microarchitecture performance. We refer to these techniques as Free-memory-aware Microarchitecture Techniques (FMTs). In the context of HPC systems, we present a detailed example of an FMT called Free-memory-aware Replication (FMR). FMR replicates in-use data to unused memory locations to effectively reduce average memory read latency. On average across five HPC benchmark suites, FMR provides 13% performance and 8% system-level energy improvement.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

192 results on '"computer architecture"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources