Author: "Horowitz, Mark" / Topic: computer science - hardware architecture - Searchworks@Jio Institute Digital Library Search Results

1. Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network

Author: Han, Song, Liu, Xingyu, Mao, Huizi, Pu, Jing, Pedram, Ardavan, Horowitz, Mark A., and Dally, William J.
Subjects: Computer Science - Hardware Architecture
Abstract: EIE proposed to accelerate pruned and compressed neural networks, exploiting weight sparsity, activation sparsity, and 4-bit weight-sharing in neural network accelerators. Since published in ISCA'16, it opened a new design space to accelerate pruned and sparse neural networks and spawned many algorithm-hardware co-designs for model compression and acceleration, both in academia and commercial AI chips. In retrospect, we review the background of this project, summarize the pros and cons, and discuss new opportunities where pruning, sparsity, and low precision can accelerate emerging deep learning workloads., Comment: Invited retrospective paper at ISCA 2023
Published: 2023

2. Hardware Abstractions and Hardware Mechanisms to Support Multi-Task Execution on Coarse-Grained Reconfigurable Arrays

Author: Kong, Taeyoung, Koul, Kalhan, Raina, Priyanka, Horowitz, Mark, and Torng, Christopher
Subjects: Computer Science - Hardware Architecture
Abstract: Domain-specific accelerators are used in various computing systems ranging from edge devices to data centers. Coarse-grained reconfigurable arrays (CGRAs) represent an architectural midpoint between the flexibility of an FPGA and the efficiency of an ASIC and are a promising candidate for servicing multi-tasked workloads within an application domain. Unfortunately, scheduling multiple tasks onto a CGRA is challenging. CGRAs lack abstractions that capture hardware resources, leaving workload schedulers unable to reason about performance, energy, and utilization for different schedules. This work first proposes a CGRA architecture that can flexibly partition key resources, including the global buffer memory capacity, the global buffer memory bandwidth, and the compute resources. Partitioned resources serve as hardware abstractions that decouple compilation and resource allocation. The compiler uses these abstractions for coarse-grained resource mapping, and the scheduler uses them for flexible resource allocation at run time. We then propose two hardware mechanisms to support multi-task execution. A flexible-shape execution region increases the overall resource utilization by mapping multiple tasks with different resource requirements. Dynamic partial reconfiguration (DPR) enables a CGRA to update the hardware configuration as the scheduler makes decisions rapidly. We show that our abstraction can help automatic and efficient scheduling of multi-tasked workloads onto our target CGRA with high utilization, resulting in 1.05x-1.24x higher throughput and a 23-28% lower latency in a multi-tasked cloud workload and 60.8% reduced latency in an autonomous system workload when compared to a baseline CGRA running single tasks at a time.
Published: 2023

3. Vision Transformer Computation and Resilience for Dynamic Inference

Author: Sreedhar, Kavya, Clemons, Jason, Venkatesan, Rangharajan, Keckler, Stephen W., and Horowitz, Mark
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Hardware Architecture
Abstract: State-of-the-art deep learning models for computer vision tasks are based on the transformer architecture and often deployed in real-time applications. In this scenario, the resources available for every inference can vary, so it is useful to be able to dynamically adapt execution to trade accuracy for efficiency. To create dynamic models, we leverage the resilience of vision transformers to pruning and switch between different scaled versions of a model. Surprisingly, we find that most FLOPs are generated by convolutions, not attention. These relative FLOP counts are not a good predictor of GPU performance since GPUs have special optimizations for convolutions. Some models are fairly resilient and their model execution can be adapted without retraining, while all models achieve better accuracy with retraining alternative execution paths. These insights mean that we can leverage CNN accelerators and these alternative execution paths to enable efficient and dynamic vision transformer inference. Our analysis shows that leveraging this type of dynamic execution can lead to saving 28\% of energy with a 1.4\% accuracy drop for SegFormer (63 GFLOPs), with no additional training, and 53\% of energy for ResNet-50 (4 GFLOPs) with a 3.3\% accuracy drop by switching between pretrained Once-For-All models.
Published: 2022

4. Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays

Author: Melchert, Jackson, Zhang, Keyi, Mei, Yuchen, Horowitz, Mark, Torng, Christopher, and Raina, Priyanka
Subjects: Computer Science - Hardware Architecture
Abstract: The architecture of a coarse-grained reconfigurable array (CGRA) interconnect has a significant effect on not only the flexibility of the resulting accelerator, but also its power, performance, and area. Design decisions that have complex trade-offs need to be explored to maintain efficiency and performance across a variety of evolving applications. This paper presents Canal, a Python-embedded domain-specific language (eDSL) and compiler for specifying and generating reconfigurable interconnects for CGRAs. Canal uses a graph-based intermediate representation (IR) that allows for easy hardware generation and tight integration with place and route tools. We evaluate Canal by constructing both a fully static interconnect and a hybrid interconnect with ready-valid signaling, and by conducting design space exploration of the interconnect architecture by modifying the switch box topology, the number of routing tracks, and the interconnect tile connections. Through the use of a graph-based IR for CGRA interconnects, the eDSL, and the interconnect generation system, Canal enables fast design space exploration and creation of CGRA interconnects., Comment: Preprint version
Published: 2022

5. Cascade: An Application Pipelining Toolkit for Coarse-Grained Reconfigurable Arrays

Author: Melchert, Jackson, Mei, Yuchen, Koul, Kalhan, Liu, Qiaoyi, Horowitz, Mark, and Raina, Priyanka
Subjects: Computer Science - Hardware Architecture
Abstract: While coarse-grained reconfigurable arrays (CGRAs) have emerged as promising programmable accelerator architectures, pipelining applications running on CGRAs is required to ensure high maximum clock frequencies. Current CGRA compilers either lack pipelining techniques resulting in low performance or perform exhaustive pipelining resulting in high energy and resource consumption. We introduce Cascade, an application pipelining toolkit for CGRAs, including a CGRA application frequency model, automated pipelining techniques for CGRA application compilers that work with both dense and sparse applications, and hardware optimizations for improving application frequency. Cascade enables 7 - 34x lower critical path delays and 7 - 190x lower EDP across a variety of dense image processing and machine learning workloads, and 2 - 4.4x lower critical path delays and 1.5 - 4.2x lower EDP on sparse workloads, compared to a compiler without pipelining., Comment: Preprint version
Published: 2022

6. The Sparse Abstract Machine

Author: Hsu, Olivia, Strange, Maxwell, Sharma, Ritvik, Won, Jaeyeon, Olukotun, Kunle, Emer, Joel, Horowitz, Mark, and Kjolstad, Fredrik
Subjects: Computer Science - Hardware Architecture, Computer Science - Programming Languages
Abstract: We propose the Sparse Abstract Machine (SAM), an abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming dataflow abstraction with sparse primitives that encompass a large space of scheduled tensor algebra expressions. SAM dataflow graphs naturally separate tensor formats from algorithms and are expressive enough to incorporate arbitrary iteration orderings and many hardware-specific optimizations. We also present Custard, a compiler from a high-level language to SAM that demonstrates SAM's usefulness as an intermediate representation. We automatically bind from SAM to a streaming dataflow simulator. We evaluate the generality and extensibility of SAM, explore the performance space of sparse tensor algebra optimizations using SAM, and show SAM's ability to represent dataflow hardware., Comment: 18 pages, 17 figures, 3 tables
Published: 2022
Full Text: View/download PDF

7. Bringing Source-Level Debugging Frameworks to Hardware Generators

Author: Zhang, Keyi, Asgar, Zain, and Horowitz, Mark
Subjects: Computer Science - Hardware Architecture
Abstract: High-level hardware generators have significantly increased the productivity of design engineers. They use software engineering constructs to reduce the repetition required to express complex designs and enable more composability. However, these benefits are undermined by a lack of debugging infrastructure, requiring hardware designers to debug generated, usually incomprehensible, RTL code. This paper describes a framework that connects modern software source-level debugging frameworks to RTL created from hardware generators. Our working prototype offers an Integrated Development Environment (IDE) experience for generators such as RocketChip (Chisel), allowing designers to set breakpoints in complex source code, relate RTL simulation state back to source-level variables, and do forward and backward debugging, with almost no simulation overhead (less than 5%)., Comment: Design Automation Conference (DAC) 2022
Published: 2022

8. Enabling Reusable Physical Design Flows with Modular Flow Generators

Author: Carsello, Alex, Thomas, James, Nayak, Ankita, Chen, Po-Han, Horowitz, Mark, Raina, Priyanka, and Torng, Christopher
Subjects: Computer Science - Hardware Architecture
Abstract: Achieving high code reuse in physical design flows is challenging but increasingly necessary to build complex systems. Unfortunately, existing approaches based on parameterized Tcl generators support very limited reuse and struggle to preserve reusable code as designers customize flows for specific designs and technologies. We present a vision and framework based on modular flow generators that encapsulates coarse-grain and fine-grain reusable code in modular nodes and assembles them into complete flows. The key feature is a flow consistency and instrumentation layer embedded in Python, which supports mechanisms for rapid and early feedback on inconsistent composition. The approach gradually types the Tcl language and allows both automatic and user-annotated static assertion checks. We evaluate the design flows of successive generations of silicon prototypes designed in TSMC16, TSMC28, TSMC40, SKY130, and IBM180 technologies, showing how our approach can enable significant code reuse in future flows.
Published: 2021

9. Automating System Configuration

Author: Tsiskaridze, Nestan, Strange, Maxwell, Mann, Makai, Sreedhar, Kavya, Liu, Qiaoyi, Horowitz, Mark, and Barrett, Clark
Subjects: Computer Science - Formal Languages and Automata Theory, Computer Science - Hardware Architecture
Abstract: The increasing complexity of modern configurable systems makes it critical to improve the level of automation in the process of system configuration. Such automation can also improve the agility of the development cycle, allowing for rapid and automated integration of decoupled workflows. In this paper, we present a new framework for automated configuration of systems representable as state machines. The framework leverages model checking and satisfiability modulo theories (SMT) and can be applied to any application domain representable using SMT formulas. Our approach can also be applied modularly, improving its scalability. Furthermore, we show how optimization can be used to produce configurations that are best according to some metric and also more likely to be understandable to humans. We showcase this framework and its flexibility by using it to configure a CGRA memory tile for various image processing applications.
Published: 2021

10. Compiling Halide Programs to Push-Memory Accelerators

Author: Liu, Qiaoyi, Huff, Dillon, Setter, Jeff, Strange, Maxwell, Feng, Kathleen, Sreedhar, Kavya, Wang, Ziheng, Zhang, Keyi, Horowitz, Mark, Raina, Priyanka, and Kjolstad, Fredrik
Subjects: Computer Science - Hardware Architecture
Abstract: Image processing and machine learning applications benefit tremendously from hardware acceleration, but existing compilers target either FPGAs, which sacrifice power and performance for flexible hardware, or ASICs, which rapidly become obsolete as applications change. Programmable domain-specific accelerators have emerged as a promising middle-ground between these two extremes, but such architectures have traditionally been difficult compiler targets. The main obstacle is that these accelerators often use a different memory abstraction than CPUs and GPUs: push memories that send a data stream from one computation kernel to other kernels, possibly reordered. To address the compilation challenges caused by push memories, we propose that the representation of memory in the middle and backend of the compiler be altered to combine storage with address generation and control logic in a single structure -- a unified buffer. We show that this compiler abstraction can be implemented efficiently on a programmable accelerator, and design a memory mapping algorithm that combines polyhedral analysis and software vectorization techniques to target our accelerator. Our evaluation shows that the compiler supports programmability while maintaining high performance. It can compile a wide range of image processing and machine learning applications to our accelerator with 4.7x better runtime and 4.3x better energy-efficiency as compared to an FPGA.
Published: 2021

11. Automated Design Space Exploration of CGRA Processing Element Architectures using Frequent Subgraph Analysis

Author: Melchert, Jackson, Feng, Kathleen, Donovick, Caleb, Daly, Ross, Barrett, Clark, Horowitz, Mark, Hanrahan, Pat, and Raina, Priyanka
Subjects: Computer Science - Hardware Architecture
Abstract: The architecture of a coarse-grained reconfigurable array (CGRA) processing element (PE) has a significant effect on the performance and energy efficiency of an application running on the CGRA. This paper presents an automated approach for generating specialized PE architectures for an application or an application domain. Frequent subgraphs mined from a set of applications are merged to form a PE architecture specialized to that application domain. For the image processing and machine learning domains, we generate specialized PEs that are up to 10.5x more energy efficient and consume 9.1x less area than a baseline PE.
Published: 2021

12. Open-Source Synthesizable Analog Blocks for High-Speed Link Designs: 20-GS/s 5b ENOB Analog-to-Digital Converter and 5-GHz Phase Interpolator

Author: Kim, Sung-Jin, Myers, Zachary, Herbst, Steven, Lim, ByongChan, and Horowitz, Mark
Subjects: Computer Science - Hardware Architecture
Abstract: Using digital standard cells and digital place-and-route (PnR) tools, we created a 20 GS/s, 8-bit analog-to-digital converter (ADC) for use in high-speed serial link applications with an ENOB of 5.6, a DNL of 0.96 LSB, and an INL of 2.39 LSB, which dissipated 175 mW in 0.102 mm2 in a 16nm technology. The design is entirely described by HDL so that it can be ported to other processes with minimal effort and shared as open source., Comment: 2020 IEEE Symposium on VLSI Circuits
Published: 2020
Full Text: View/download PDF

13. fault: A Python Embedded Domain-Specific Language For Metaprogramming Portable Hardware Verification Components

Author: Truong, Lenny, Herbst, Steven, Setaluri, Rajsekhar, Mann, Makai, Daly, Ross, Zhang, Keyi, Donovick, Caleb, Stanley, Daniel, Horowitz, Mark, Barrett, Clark, and Hanrahan, Pat
Subjects: Computer Science - Software Engineering, Computer Science - Hardware Architecture
Abstract: While hardware generators have drastically improved design productivity, they have introduced new challenges for the task of verification. To effectively cover the functionality of a sophisticated generator, verification engineers require tools that provide the flexibility of metaprogramming. However, flexibility alone is not enough; components must also be portable in order to encourage the proliferation of verification libraries as well as enable new methodologies. This paper introduces fault, a Python embedded hardware verification language that aims to empower design teams to realize the full potential of generators., Comment: CAV 2020: 32nd International Conference on Computer-Aided Verification
Published: 2020

14. Fast FPGA emulation of analog dynamics in digitally-driven systems

Author: Herbst, Steven, Lim, Byong Chan, and Horowitz, Mark
Subjects: Computer Science - Hardware Architecture, Electrical Engineering and Systems Science - Signal Processing
Abstract: In this paper, we propose an architecture for FPGA emulation of mixed-signal systems that achieves high accuracy at a high throughput. We represent the analog output of a block as a superposition of step responses to changes in its analog input, and the output is evaluated only when needed by the digital subsystem. Our architecture is therefore intended for digitally-driven systems; that is, those in which the inputs of analog dynamical blocks change only on digital clock edges. We implemented a high-speed link transceiver design using the proposed architecture on a Xilinx FPGA. This design demonstrates how our approach breaks the link between simulation rate and time resolution that is characteristic of prior approaches. The emulator is flexible, allowing for the real-time adjustment of analog dynamics, clock jitter, and various design parameters. We demonstrate that our architecture achieves 1% accuracy while running 3 orders of magnitude faster than a comparable high-performance CPU simulation., Comment: ICCAD '18: Proceedings of the International Conference on Computer-Aided Design
Published: 2020
Full Text: View/download PDF

15. FPMax: a 106GFLOPS/W at 217GFLOPS/mm2 Single-Precision FPU, and a 43.7GFLOPS/W at 74.6GFLOPS/mm2 Double-Precision FPU, in 28nm UTBB FDSOI

Author: Pu, Jing, Galal, Sameh, Yang, Xuan, Shacham, Ofer, and Horowitz, Mark
Subjects: Computer Science - Hardware Architecture
Abstract: FPMax implements four FPUs optimized for latency or throughput workloads in two precisions, fabricated in 28nm UTBB FDSOI. Each unit's parameters, e.g pipeline stages, booth encoding etc., were optimized to yield 1.42ns latency at 110GLOPS/W (SP) and 1.39ns latency at 36GFLOPS/W (DP). At 100% activity, body-bias control improves the energy efficiency by about 20%; at 10% activity this saving is almost 2x. Keywords: FPU, energy efficiency, hardware generator, SOI
Published: 2016

16. Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era

Author: Pedram, Ardavan, Richardson, Stephen, Galal, Sameh, Kvatinsky, Shahar, and Horowitz, Mark A.
Subjects: Computer Science - Hardware Architecture, Computer Science - Performance
Abstract: The key challenge to improving performance in the age of Dark Silicon is how to leverage transistors when they cannot all be used at the same time. In modern SOCs, these transistors are often used to create specialized accelerators which improve energy efficiency for some applications by 10-1000X. While this might seem like the magic bullet we need, for most CPU applications more energy is dissipated in the memory system than in the processor: these large gains in efficiency are only possible if the DRAM and memory hierarchy are mostly idle. We refer to this desirable state as Dark Memory, and it only occurs for applications with an extreme form of locality. To show our findings, we introduce Pareto curves in the energy/op and mm$^2$/(ops/s) metric space for compute units, accelerators, and on-chip memory/interconnect. These Pareto curves allow us to solve the power, performance, area constrained optimization problem to determine which accelerators should be used, and how to set their design parameters to optimize the system. This analysis shows that memory accesses create a floor to the achievable energy-per-op. Thus high performance requires Dark Memory, which in turn requires co-design of the algorithm for parallelism and locality, with the hardware., Comment: 8 pages, To appear in IEEE Design and Test Journal
Published: 2016
Full Text: View/download PDF

17. EIE: Efficient Inference Engine on Compressed Deep Neural Network

Author: Han, Song, Liu, Xingyu, Mao, Huizi, Pu, Jing, Pedram, Ardavan, Horowitz, Mark A., and Dally, William J.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Hardware Architecture
Abstract: State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency., Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly: https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision: http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University: https://goo.gl/6lwuer. Published as a conference paper in ISCA 2016
Published: 2016

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

17 results on '"Horowitz, Mark"'

1. Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network

2. Hardware Abstractions and Hardware Mechanisms to Support Multi-Task Execution on Coarse-Grained Reconfigurable Arrays

3. Vision Transformer Computation and Resilience for Dynamic Inference

4. Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays

5. Cascade: An Application Pipelining Toolkit for Coarse-Grained Reconfigurable Arrays

6. The Sparse Abstract Machine

7. Bringing Source-Level Debugging Frameworks to Hardware Generators

8. Enabling Reusable Physical Design Flows with Modular Flow Generators

9. Automating System Configuration

10. Compiling Halide Programs to Push-Memory Accelerators

11. Automated Design Space Exploration of CGRA Processing Element Architectures using Frequent Subgraph Analysis

12. Open-Source Synthesizable Analog Blocks for High-Speed Link Designs: 20-GS/s 5b ENOB Analog-to-Digital Converter and 5-GHz Phase Interpolator

13. fault: A Python Embedded Domain-Specific Language For Metaprogramming Portable Hardware Verification Components

14. Fast FPGA emulation of analog dynamics in digitally-driven systems

15. FPMax: a 106GFLOPS/W at 217GFLOPS/mm2 Single-Precision FPU, and a 43.7GFLOPS/W at 74.6GFLOPS/mm2 Double-Precision FPU, in 28nm UTBB FDSOI

16. Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era

17. EIE: Efficient Inference Engine on Compressed Deep Neural Network

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

17 results on '"Horowitz, Mark"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources