Author: "Jose Maria Arnau" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Jose Maria Arnau"' showing total 26 results

Start Over Author "Jose Maria Arnau"

26 results on '"Jose Maria Arnau"'

1. SHARP: An adaptable, energy-efficient accelerator for recurrent neural networks

Author: Reza Yazdani Aminabadi, Olatunji Ruwase, Minjia Zhang, Yuxiong He, Jose-Maria Arnau, Antonio Gonazalez, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Neural networks (Computer science), Long-Short-Term Memory (LSTM), Hardware and Architecture, Scheduling, Low power, Recurrent Neural Network (RNN), Accelerator, Xarxes neuronals (Informàtica), Reconfigurability, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Software
Abstract: The effectiveness of Recurrent Neural Networks (RNNs) for tasks such as Automatic Speech Recognition has fostered interest in RNN inference acceleration. Due to the recurrent nature and data dependencies of RNN computations, prior work has designed customized architectures specifically tailored to the computation pattern of RNN, getting high computation efficiency for certain chosen model sizes. However, given that the dimensionality of RNNs varies a lot for different tasks, it is crucial to generalize this efficiency to diverse configurations. In this work, we identify adaptiveness as a key feature that is missing from today’s RNN accelerators. In particular, we first show the problem of low resource utilization and low adaptiveness for the state-of-the-art RNN implementations on GPU, FPGA, and ASIC architectures. To solve these issues, we propose an intelligent tiled-based dispatching mechanism for increasing the adaptiveness of RNN computation, in order to efficiently handle the data dependencies. To do so, we propose Sharp as a hardware accelerator, which pipelines RNN computation using an effective scheduling scheme to hide most of the dependent serialization. Furthermore, Sharp employs dynamic reconfigurable architecture to adapt to the model’s characteristics. Sharp achieves 2×, 2.8×, and 82× speedups on average, considering different RNN models and resource budgets, compared to the state-of-the-art ASIC, FPGA, and GPU implementations, respectively. Furthermore, we provide significant energy reduction with respect to the previous solutions, due to the low power dissipation of Sharp (321 GFLOPS/Watt). This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.
Published: 2023

2. E-BATCH: Energy-efficient and high-throughput RNN batching

Author: Franyell Silfa, Jose Maria Arnau, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: FOS: Computer and information sciences, Batching, Computer Science - Machine Learning, Energia -- Consum, Recurrent neural network, Hardware accelerators, Machine Learning (cs.LG), Long short term memory, Energy consumption, Neural networks (Computer science), Computer Science - Distributed, Parallel, and Cluster Computing, Hardware and Architecture, Hardware Architecture (cs.AR), Xarxes neuronals (Informàtica), Distributed, Parallel, and Cluster Computing (cs.DC), Computer Science - Hardware Architecture, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Software, Information Systems
Abstract: Recurrent Neural Network (RNN) inference exhibits low hardware utilization due to the strict data dependencies across time-steps. Batching multiple requests can increase throughput. However, RNN batching requires a large amount of padding since the batched input sequences may vastly differ in length. Schemes that dynamically update the batch every few time-steps avoid padding. However, they require executing different RNN layers in a short time span, decreasing energy efficiency. Hence, we propose E-BATCH, a low-latency and energy-efficient batching scheme tailored to RNN accelerators. It consists of a runtime system and effective hardware support. The runtime concatenates multiple sequences to create large batches, resulting in substantial energy savings. Furthermore, the accelerator notifies it when the evaluation of an input sequence is done. Hence, a new input sequence can be immediately added to a batch, thus largely reducing the amount of padding. E-BATCH dynamically controls the number of time-steps evaluated per batch to achieve the best trade-off between latency and energy efficiency for the given hardware platform. We evaluate E-BATCH on top of E-PUR and TPU. E-BATCH improves throughput by 1.8× and energy efficiency by 3.6× in E-PUR, whereas in TPU, it improves throughput by 2.1× and energy efficiency by 1.6×, over the state-of-the-art. This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.
Published: 2022

3. Energy-efficient stream compaction through filtering and coalescing accesses in GPGPU memory partitions

Author: Albert Segura, Jose-Maria Arnau, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Stream Compaction Unit (SCU), Computer science, Compaction, GPGPU architectures, Graph processing, 02 engineering and technology, Parallel computing, Unitats de processament gràfic, 020202 computer hardware & architecture, Theoretical Computer Science, Stream compaction, Computational Theory and Mathematics, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, General-purpose computing on graphics processing units, Irregular accesses Reorder Unit (IRU), Graphics processing units, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Software, Efficient energy use, Memory divergence
Abstract: Graph-based applications are essential in emerging domains such as data analytics or machine learning. Data gathering in a knowledge-based society requires great data processing efficiency. High-throughput GPGPU architectures are key to enable efficient graph processing. Nonetheless, irregular and sparse memory access patterns present in graph-based applications induce high memory divergence and contention, which result in poor GPGPU efficiency for graph processing. Recent work has pointed out the importance of stream compaction operations, and has proposed a Stream Compaction Unit (SCU) to offload them to a specialized hardware. On the other hand, memory contention caused by high divergence has been tackled with the Irregular accesses Reorder Unit (IRU), delivering improved memory coalescing. In this paper, we propose a new unit, the IRU-enhanced SCU (ISCU), that leverages the strengths of both approaches. The ISCU employs the efficient mechanisms of the IRU to improve SCU stream compaction efficiency and throughput limitations, achieving a synergistic effect for graph processing. We evaluate the ISCU for a wide variety of state-of-the-art graph-based algorithms and applications. Results show that the ISCU achieves a performance speedup of 2.2x and 90% energy savings derived from a high reduction of 78% memory accesses, while incurring in 8.5% area overhead.
Published: 2021
Full Text: View/download PDF

4. Boosting LSTM Performance Through Dynamic Precision Selection

Author: Jose-Maria Arnau, Franyell Silfa, Antonio González, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Computer Science - Machine Learning, Boosting (machine learning), Speedup, Computer science, 010501 environmental sciences, 01 natural sciences, Machine Learning (cs.LG), Neural networks (Computer science), Set (abstract data type), RNNs, Quantization, 0103 physical sciences, FOS: Electrical engineering, electronic engineering, information engineering, Xarxes neuronals (Informàtica), Overhead (computing), Electrical Engineering and Systems Science - Signal Processing, Quantization (image processing), Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], 0105 earth and related environmental sciences, 010302 applied physics, Profiling (computer programming), Artificial neural network, Supercomputer, Long short term memory, High performance computing, Càlcul intensiu (Informàtica), Algorithm, Accelerators
Abstract: The use of low numerical precision is a fundamental optimization included in modern accelerators for Deep Neural Networks (DNNs). The number of bits of the numerical representation is set to the minimum precision that is able to retain accuracy based on an offline profiling, and it is kept constant for DNN inference. In this work, we explore the use of dynamic precision selection during DNN inference. We focus on Long Short Term Memory (LSTM) networks, which represent the state-of-the-art networks for applications such as machine translation and speech recognition. Unlike conventional DNNs, LSTM networks remember information from previous evaluations by storing data in the LSTM cell state. Our key observation is that the cell state determines the amount of precision required: time-steps where the cell state changes significantly require higher precision, whereas time-steps where the cell state is stable can be computed with lower precision without any loss in accuracy. We propose a novel hardware scheme that tracks the evolution of the elements in the LSTM cell state and dynamically selects the appropriate precision on each time-step. For a set of popular LSTM networks, it chooses the lowest precision for 57% of the time, outperforming systems that fix the precision statically. We evaluate our proposal on top of a modern highly-optimized LSTM accelerator, and show that it provides 1.46x speedup and 19.2% energy savings on average without degrading the model accuracy. Our scheme has an overhead of less than 8%. This work has been supported by the CoCoUnit ERC Advanced Grant of the ED's Horizon 2020 program (grant No 833057), the Spanish State Research Agency under grant TIN2016-75344-R (AEI/FEDER, EU), the ICREA Academia program, and the Fundación Carolina and PUCMM by a scholarship.
Published: 2020

5. Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads

Author: Albert Segura, Jose Maria Arnau, Antonio Gonzalez, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: FOS: Computer and information sciences, Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), Energia -- Consum, GPGPU, Graph processing, Gestió de memòria (Informàtica), Theoretical Computer Science, Energy consumption, Parallel architectures, Memory management (Computer science), Hardware and Architecture, Hardware Architecture (cs.AR), Computer architecture, Computer Science - Hardware Architecture, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Software, Information Systems
Abstract: GPGPU architectures have become the dominant platform for massively parallel workloads, delivering high performance and energy efficiency for popular applications such as machine learning, computer vision or self-driving cars. However, irregular applications, such as graph processing, fail to fully exploit GPGPU resources due to their divergent memory accesses that saturate the memory hierarchy. To reduce the pressure on the memory subsystem for divergent memory-intensive applications, programmers must take into account SIMT execution model and memory coalescing in GPGPUs, devoting significant efforts in complex optimization techniques. Despite these efforts, we show that irregular graph processing still suffers from low GPGPU performance. We observe that in many irregular applications the mapping of data to threads can be safely changed. In other words, it is possible to relax the strict relationship between thread and data processed to reduce memory divergence. Based on this observation, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension tightly integrated in the GPGPU pipeline. The IRU reorders data processed by the threads on irregular accesses to improve memory coalescing, i.e., it tries to assign data elements to threads as to produce coalesced accesses in SIMT groups. Furthermore, the IRU is capable of filtering and merging duplicated accesses, significantly reducing the workload. Programmers can easily utilize the IRU with a simple API, or let the compiler issue instructions from our extended ISA. We evaluate our proposal for state-of-the-art graph-based algorithms and a wide selection of applications. Results show that the IRU achieves a memory coalescing improvement of 1.32x and a 46% reduction in the overall traffic in the memory hierarchy, which results in 1.33x speedup and 13% energy savings on average, while incurring in a small 5.6% area overhead. This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00 and the ICREA Academia program.
Published: 2020

6. Demystifying Power and Performance Bottlenecks in Autonomous Driving Systems

Author: Pedro Henrique Exenberger Becker, Antonio González, Jose-Maria Arnau, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: 0209 industrial biotechnology, Safe driving, business.industry, Computer science, Vehicles autònoms, Distributed computing, Performance analysis, Automotive industry, Autonomous vehicles, 02 engineering and technology, Energy consumption, Workload characterization, Computing systems, 020901 industrial engineering & automation, Software, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, Predictability, Latency (engineering), business, Informàtica::Robòtica [Àrees temàtiques de la UPC], Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Abstract: ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Autonomous Vehicles (AVs) have the potential to radically change the automotive industry. However, computing solutions for AVs have to meet severe performance and power constraints to guarantee a safe driving experience. Current solutions either exhibit high cost and power dissipation or fail to meet the stringent latency constraints. Therefore, the popularization of AVs requires a low-cost yet effective computing system. Understanding the sources of latency and energy consumption is key in order to improve autonomous driving systems. In this paper, we present a detailed characterization of Autoware, a modern self-driving car system. We analyze the performance and power of the different components and leverage hardware counters to identify the main bottlenecks. Our approach to AV characterization avoids pitfalls of previous works: profiling individual components in isolation and neglecting LiDAR-related components. We base our characterization on a rigorous methodology that considers the entire software stack. Profiling the end-to-end system accounts for interference and contention among different components that run in parallel, also including memory transfers to communicate data. We show that all these factors have a high impact on latency and cannot be measured by profiling isolated modules. Our characterization provides novel insights, some of the interesting findings are the following. First, contention among different modules drastically impacts latency and performance predictability. Second, LiDAR-related components are important contributors to the latency of the system. Finally, a modern platform with a high-end CPU and GPU cannot achieve real-time performance when considering the entire end-to-end system. This work has been supported by the the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057) and the Spanish State Research Agency under grant TIN2016-75344-R (AEI/FEDER, EU).
Published: 2020
Full Text: View/download PDF

7. LAWS: Locality-AWare Scheme for automatic speech recognition

Author: Antonio Gonzalez, Reza Yazdani, Jose-Maria Arnau, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: WFST, Low-power architecture, Computer science, Speech recognition, Reconeixement automàtic de la parla, 02 engineering and technology, Viterbi algorithm, Theoretical Computer Science, symbols.namesake, 0202 electrical engineering, electronic engineering, information engineering, Memory-efficient, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Mobile computing, Locality, Process (computing), Automatic speech recognition, Energy consumption, 020202 computer hardware & architecture, Informàtica mòbil, Memory management, Computational Theory and Mathematics, Hardware and Architecture, Law, symbols, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Viterbi beam search, Hardware accelerator, Software
Abstract: Automatic Speech Recognition (ASR) systems are changing the way people interact with different applications on mobile devices. Fulfilling such user-interactivity requires not only a highly accurate, large-vocabulary recognition system, but also a real-time, energy-efficient solution. However, these ASR systems need high memory bandwidth and power budget, which may be impractical for most of small form-factor battery-operated devices. In this article, we propose two combined techniques implemented on top of a state-of-the-art ASR accelerator in order to significantly reduce its energy consumption and memory requirements. First, by leveraging the locality among consecutive segments of the speech signal, we develop a Locality-AWare-Scheme (LAWS) which exploits the on-chip recently-explored data while removing most of the off-chip accesses during the ASR's decoding process. As a result, we remove up to 60 percent of ASR's workload. As the second step, we introduce an approach to improve LAWS's effectiveness by selectively adapting the amount of ASR's workload, based on run-time feedback. In particular, we exploit the fact that the confidence of the ASR system varies along the recognition process. When confidence is high, the ASR system can be more restrictive and reduce the amount of work. The end design including both techniques provides a saving of more than 87 percent in the memory requests and 2.3x reduction in energy consumption, and a speedup of 2.1x with respect to a state-of-the-art baseline design. This work has been supported in part by the CoCoUnit ERC Advanced Grant of the EUs Horizon 2020 program under Grant 833057, in part by the Spanish State Research Agency under Grant TIN2016-75344-R (AEI/FEDER, EU), and in part by the ICREA Academia program.
Published: 2020

8. POSTER: Leveraging Run-Time Feedback for Efficient ASR Acceleration

Author: Antonio González, Jose-Maria Arnau, and Reza Yazdani
Subjects: symbols.namesake, Memory management, Speedup, Computer science, Speech recognition, Process (computing), symbols, Hardware acceleration, Workload, Energy consumption, Viterbi algorithm, ComputingMethodologies_ARTIFICIALINTELLIGENCE, Decoding methods
Abstract: In this work, we propose Locality-AWare-Scheme (LAWS) for an Automatic Speech Recognition (ASR) accelerator in order to significantly reduce its energy consumption and memory requirements, by leveraging the locality among consecutive segments of the speech signal. LAWS diminishes ASR's workload by up to 60% by removing most of the off-chip accesses during the ASR's decoding process. We furthermore improve LAWS's effectiveness by selectively adapting the amount of ASR's workload, based on run-time feedback. In particular, we exploit the fact that the confidence of the ASR system varies along the recognition process. When confidence is high, the ASR system can be more restrictive and reduce the amount of work. The end design provides a saving of 87% in memory requests, 2.3x reduction in energy consumption, and a speedup of 2.1x with respect to the state-of-the-art ASR accelerator.
Published: 2019

9. CGPA: Coarse-Grained Pruning of Activations for Energy-Efficient RNN Inference

Author: Jose-Maria Arnau, Marc Riera, Antonio González, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Speedup, Informàtica::Intel·ligència artificial::Aprenentatge automàtic [Àrees temàtiques de la UPC], Low energy, Computer science, 02 engineering and technology, 01 natural sciences, RNN, Set (abstract data type), Histogram, 0103 physical sciences, Machine learning, Aprenentatge automàtic, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Pruning (decision trees), Electrical and Electronic Engineering, 010302 applied physics, Recurrent neural nets, 020202 computer hardware & architecture, Recurrent neural network, Hardware and Architecture, Algorithm, Software, Energy (signal processing), Accelerators, Efficient energy use
Abstract: © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Recurrent neural networks (RNNs) perform element-wise multiplications across the activations of gates. We show that a significant percentage of activations are saturated and propose coarse-grained pruning of activations (CGPA) to avoid the computation of entire neurons, based on the activation values of the gates. We show that CGPA can be easily implemented on top of a TPU-like architecture with negligible area overhead, resulting in 12% speedup and 12% energy savings on average for a set of widely used RNNs.
Published: 2019

10. A Low-Power, High-Performance Speech Recognition Accelerator

Author: Jose-Maria Arnau, Antonio González, Reza Yazdani, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: WFST, Computer science, Speech recognition, Decoding, Reconeixement automàtic de la parla, 02 engineering and technology, Viterbi algorithm, CAS latency, Bottleneck, Theoretical Computer Science, Reduction (complexity), CUDA, symbols.namesake, Hardware, 0202 electrical engineering, electronic engineering, information engineering, hardware accelerator, Automatic speech recognition, Acoustics, Automatic Speech Recognition (ASR), 020202 computer hardware & architecture, Computational Theory and Mathematics, Viterbi search, Hardware and Architecture, symbols, Hardware acceleration, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Central processing unit, low-power architecture, Graphics processing units, Software, Central Processing Unit
Abstract: © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.
Published: 2019
Full Text: View/download PDF

11. SCU: a GPU stream compaction unit for graph processing

Author: Antonio González, Albert Segura, Jose-Maria Arnau, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: 010302 applied physics, Computer science, Computers, GPGPU, Degree of parallelism, Compaction, Graph processing, 02 engineering and technology, Parallel computing, Enginyeria de la telecomunicació::Processament del senyal::Processament de la imatge i del senyal vídeo [Àrees temàtiques de la UPC], Small unit, 01 natural sciences, Ordinadors, Graph, 020202 computer hardware & architecture, Memory access pattern, Stream compaction, Imatges -- Processament -- Tècniques digitals, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Data analysis, General-purpose computing on graphics processing units, Image processing -- Digital techniques, Informàtica::Hardware [Àrees temàtiques de la UPC]
Abstract: Graph processing algorithms are key in many emerging applications in areas such as machine learning and data analytics. Although the processing of large scale graphs exhibits a high degree of parallelism, the memory access pattern tend to be highly irregular, leading to poor GPGPU efficiency due to memory divergence. To ameliorate this issue, GPGPU applications perform a stream compaction operation each iteration of the algorithm to extract the subset of active nodes/edges, so subsequent steps work on compacted dataset. We show that GPGPU architectures are inefficient for stream compaction, and propose to offload this task to a programmable Stream Compaction Unit (SCU) tailored to the requirements of this kernel. The SCU is a small unit tightly integrated in the GPU that efficiently gathers the active nodes/edges into a compacted array in memory. Applications can make use of it through a simple API. The remaining steps of the graph-based algorithm are executed on the GPU cores taking benefit of the large amount of parallelism in the GPU, but they operate on the SCU-prepared data and achieve larger memory coalescing and, hence, much higher efficiency. Besides, the SCU performs filtering of repeated and already visited nodes during the compaction process, significantly reducing GPGPU workload, and writes the compacted nodes/edges in an order that improves memory coalescing by reducing memory divergence. We evaluate the performance of a state-of-the-art GPGPU architecture extended with our SCU for a wide variety of applications. Results show that for high-performance and for low-power GPU systems the SCU achieves speedups of 1.37x and 2.32x, 84.7% and 69% energy savings, and an area increase of 3.3% and 4.1% respectively.
Published: 2019

12. E-PUR

Author: Gem Dot, Franyell Silfa, Jose-Maria Arnau, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: FOS: Computer and information sciences, Informàtica::Intel·ligència artificial::Aprenentatge automàtic [Àrees temàtiques de la UPC], Computer science, Parallel programming (Computer science), 02 engineering and technology, Programació en paral·lel (Informàtica), 01 natural sciences, Machine learning, Aprenentatge automàtic, Hardware Architecture (cs.AR), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Locality of reference, Neural and Evolutionary Computing (cs.NE), Computer Science - Hardware Architecture, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], 010302 applied physics, Multi-core processor, Locality, Computer Science - Neural and Evolutionary Computing, Energy consumption, Long short term memory, 020202 computer hardware & architecture, Term (time), Recurrent neural network, Recurrent neural networks, Computer engineering, Mobile device, Accelerators, Efficient energy use
Abstract: Recurrent Neural Networks (RNNs) are a key technology for emerging applications such as automatic speech recognition, machine translation or image description. Long Short Term Memory (LSTM) networks are the most successful RNN implementation, as they can learn long term dependencies to achieve high accuracy. Unfortunately, the recurrent nature of LSTM networks significantly constrains the amount of parallelism and, hence, multicore CPUs and many-core GPUs exhibit poor efficiency for RNN inference. In this paper, we present E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM computation. The main goal of E-PUR is to support large recurrent neural networks for low-power mobile devices. E-PUR provides an efficient hardware implementation of LSTM networks that is flexible to support diverse applications. One of its main novelties is a technique that we call Maximizing Weight Locality (MWL), which improves the temporal locality of the memory accesses for fetching the synaptic weights, reducing the memory requirements by a large extent. Our experimental results show that E-PUR achieves real-time performance for different LSTM networks, while reducing energy consumption by orders of magnitude with respect to general-purpose processors and GPUs, and it requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA Tegra X1, E-PUR provides an average energy reduction of 88x.
Published: 2018

13. Performance analysis and optimization of automatic speech recognition

Author: Jordi Tubella, Hamid Tabani, Antonio González, Jose-Maria Arnau, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Mobile processor, Speedup, Computer science, Speech recognition, Automatic speech recognition, 020206 networking & telecommunications, Memory bandwidth, Reconeixement automàtic de la parla, 02 engineering and technology, Bottleneck, Microarchitecture, 030507 speech-language pathology & audiology, 03 medical and health sciences, Hardware and Architecture, Control and Systems Engineering, Vectorization (mathematics), Vectorization, 0202 electrical engineering, electronic engineering, information engineering, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Central processing unit, Gaussian mixture models, 0305 other medical science, Hidden Markov model, Information Systems
Abstract: © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Fast and accurate Automatic Speech Recognition (ASR) is emerging as a key application for mobile devices. Delivering ASR on such devices is challenging due to the compute-intensive nature of the problem and the power constraints of embedded systems. In this paper, we provide a performance and energy characterization of Pocketsphinx, a popular toolset for ASR that targets mobile devices. We identify the computation of the Gaussian Mixture Model (GMM) as the main bottleneck, consuming more than 80 percent of the execution time. The CPI stack analysis shows that branches and main memory accesses are the main performance limiting factors for GMM computation. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. First, we use a refactored implementation of the innermost loop of the GMM evaluation code to ameliorate the impact of branches. Second, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. Third, we compute the Gaussians for multiple frames in parallel, so means and variances can be fetched once in the on-chip caches and reused across multiple frames, significantly reducing memory bandwidth usage. We evaluate our optimizations using both hardware counters on real CPUs and simulations. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61 percent energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59 percent energy savings without any loss in the accuracy of the ASR system.
Published: 2018

14. Computation Reuse in DNNs by Exploiting Input Similarity

Author: Jose-Maria Arnau, Antonio González, Marc Riera, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Degree of similarity, Computer science, Video classification, Computation reuse, Inference, Back-to-back execution, Input similarity, Reconeixement automàtic de la parla, 02 engineering and technology, Speech recognition, Reuse, Energy conservation, 01 natural sciences, Different layers, Deep neural networks, 0103 physical sciences, Memory architecture, 0202 electrical engineering, electronic engineering, information engineering, 010302 applied physics, Network architecture, Artificial neural network, Automatic speech recognition, Process (computing), Input similarity Decision making, Network layers, Hardware accelerators, Program processors, 020202 computer hardware & architecture, Energy efficiency, Computer engineering, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Hardware acceleration, Hardware accelerator, Energy efficient, DNN, Efficient energy use
Abstract: In recent years, Deep Neural Networks (DNNs) have achieved tremendous success for diverse problems such as classification and decision making. Efficient support for DNNs on CPUs, GPUs and accelerators has become a prolific area of research, resulting in a plethora of techniques for energy-efficient DNN inference. However, previous proposals focus on a single execution of a DNN. Popular applications, such as speech recognition or video classification, require multiple back-to-back executions of a DNN to process a sequence of inputs (e.g., audio frames, images). In this paper, we show that consecutive inputs exhibit a high degree of similarity, causing the inputs/outputs of the different layers to be extremely similar for successive frames of speech or images of a video. Based on this observation, we propose a technique to reuse some results of the previous execution, instead of computing the entire DNN. Computations related to inputs with negligible changes can be avoided with minor impact on accuracy, saving a large percentage of computations and memory accesses. We propose an implementation of our reuse-based inference scheme on top of a state-of-the-art DNN accelerator. Results show that, on average, more than 60% of the inputs of any neural network layer tested exhibit negligible changes with respect to the previous execution. Avoiding the memory accesses and computations for these inputs results in 63% energy savings on average.
Published: 2018

15. A novel register renaming technique for out-of-order processors

Author: Hamid Tabani, Jordi Tubella, Antonio González, Jose-Maria Arnau, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors, and Universitat Politècnica de Catalunya. CERCLE - Cercle d'Arquitectura
Subjects: Scheme (programming language), In-flight instructions, Speedup, Computer science, Register renaming supercomputers, Register file, 02 engineering and technology, Physical registers, 01 natural sciences, Instruction set, Precise exceptions, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Register renaming, Computer architecture, Arithmetic, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, computer.programming_language, 010302 applied physics, Out-of-order execution, Superscalar processor, 020202 computer hardware & architecture, Producer consumers, Register (music), Out-of-order processors, High performance computing, Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING, computer, Càlcul intensiu (Informàtica), SPECint, Register files
Abstract: Modern superscalar processors support a large number of in-flight instructions, which requires sizeable register files. Conventional register renaming techniques allocate a new storage location, i.e. physical register, for every instruction whose destination is a logical register in order to remove false dependences. Physical registers are released in a conservative manner when the same logical register is redefined. For this reason, many cycles may happen between the last read and the release of a physical register, leading to suboptimal utilization of the register file. We have observed that for more than 50% of the instructions in SPECfp and more than 30% of the instructions in SPECint that have a destination register, the produced value has only a single consumer. In this case, the RAW dependence guarantees that the producer-consumer instructions pair will be executed in program order and, hence, the same physical register can be used to store the value produced by both instructions. In this paper, we propose a renaming technique that exploits this property to reduce the pressure on the register file. Our technique leverages physical register sharing by introducing minor changes in the register map table and the issue queue. We also describe how our renaming scheme supports precise exceptions. We evaluated our renaming technique on top of a modern out-of-order processor. Our experimental results show that it provides 6% speedup on average for the SPEC2006 benchmarks. Alternatively, our renaming scheme achieves the same performance while reducing the number of physical registers by 10.5%.
Published: 2018

16. The dark side of DNN pruning

Author: Marc Riera, Jose-Maria Arnau, Reza Yazdani, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Minera, Industrial i TIC, Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors, and Universitat Politècnica de Catalunya. CERCLE - Cercle d'Arquitectura
Subjects: Speedup, Computer science, Speech recognition, Word error rate, 02 engineering and technology, Viterbi algorithm, Energy conservation, 01 natural sciences, State of the art, symbols.namesake, Hardware, 0103 physical sciences, Deep neural networks, N-best hypothesis, 0202 electrical engineering, electronic engineering, information engineering, Pruning (decision trees), Computer architecture, Automatic speech recognition (ASR), Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], DNN pruning, 010302 applied physics, business.industry, Deep learning, Redundant connections, Automatic speech recognition, Computer hardware, Hardware accelerators, Hash table, Arquitectura d'ordinadors, 020202 computer hardware & architecture, Solution approach, Energy efficiency, Viterbi decoder, Viterbi search, symbols, Hardware acceleration, Artificial intelligence, Hardware accelerator, business
Abstract: DNN pruning has been recently proposed as an effective technique to improve the energy-efficiency of DNN-based solutions. It is claimed that by removing unimportant or redundant connections, the pruned DNN delivers higher performance and energy-efficiency with negligible impact on accuracy. However, DNN pruning has an important side effect: it May reduce the confidence of DNN predictions. We show that, although top-1 accuracy May be maintained with DNN pruning, the likelihood of the class in the top-1 is significantly reduced when using the pruned models. For applications such as Automatic Speech Recognition (ASR), where the DNN scores are consumed by a successive stage, the workload of this stage can be dramatically increased due to the loss of confidence in the DNN. An ASR system consists of a DNN for computing acoustic scores, followed by a Viterbi beam search to find the most likely sequence of words. We show that, when pruning the DNN model used for acoustic scoring, the Word Error Rate (WER) is maintained but the execution time of the ASR system is increased by 33%. Although pruning improves the efficiency of the DNN, it results in a huge increase of activity in the Viterbi search since the output scores of the pruned model are less reliable. Based on this observation, we propose a novel hardware-based ASR system that effectively integrates a DNN accelerator for pruned models with a Viterbi accelerator. In order to avoid the aforementioned increase in Viterbi search workload, our system loosely selects the N-best hypotheses at every time step, exploring only the N most likely paths. To avoid an expensive sort of the hypotheses based on their likelihoods, our accelerator employs a set-associative hash table to keep track of the best paths mapped to each set. In practice, this solution approaches the selection of N-best, but it requires much simpler hardware. Our approach manages to efficiently combine both DNN pruning and Viterbi search, and achieves 9x energy savings and 4.2x speedup with respect to the state-of-the-art ASR solutions.
Published: 2018

17. Low-power automatic speech recognition through a mobile GPU and a Viterbi accelerator

Author: Antonio González, Albert Segura, Jose-Maria Arnau, Reza Yazdani, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: 010302 applied physics, Audio signal, business.industry, Computer science, Speech recognition, Pipeline (computing), Automatic speech recognition, Accelerator, Reconeixement automàtic de la parla, 02 engineering and technology, Viterbi algorithm, 01 natural sciences, 020202 computer hardware & architecture, symbols.namesake, Viterbi search, Hardware and Architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, symbols, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Mobile telephony, Electrical and Electronic Engineering, business, Software
Abstract: © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Automatic speech recognition (ASR) has become a core technology for mobile devices. Delivering real-time and accurate ASR has a huge computational cost, which is challenging to achieve in tightly energy-constrained platforms such as mobile devices. A state-of-the-art ASR pipeline consists of a deep neural network (DNN) that converts the audio signal into phonemes' probabilities, followed by a Viterbi search that uses these probabilities to generate a sequence of words. In this article, the authors propose an ASR system for low-power devices that combines a mobile GPU for the DNN with a dedicated hardware accelerator for the Viterbi search. DNN evaluation is easy to parallelize and, hence, it achieves high energy efficiency on a mobile GPU. On the other hand, the Viterbi search is difficult to parallelize, and it represents the main bottleneck for ASR, so the authors propose a hardware accelerator to dramatically reduce its energy requirements while increasing performance. Their proposal outperforms traditional solutions running on the CPU by orders of magnitude. Compared to a GPU-only system, their hybrid scheme combining the GPU and the accelerator improves performance by 5.25 times, while reducing energy by 2.05 times.
Published: 2017

18. UNFOLD: a memory-efficient speech recognizer using on-the-fly WFST composition

Author: Antonio González, Reza Yazdani, Jose-Maria Arnau, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: 010302 applied physics, Computer science, Speech recognition, Automatic speech recognition, Acoustic model, Memory bandwidth, Reconeixement automàtic de la parla, 02 engineering and technology, 01 natural sciences, Bottleneck, 020202 computer hardware & architecture, Special purpose systems, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, Hardware acceleration, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Language model, Decoding methods
Abstract: Accurate, real-time Automatic Speech Recognition (ASR) requires huge memory storage and computational power. The main bottleneck in state-of-the-art ASR systems is the Viterbi search on a Weighted Finite State Transducer (WFST). The WFST is a graph-based model created by composing an Acoustic Model (AM) and a Language Model (LM) offline. Offline composition simplifies the implementation of a speech recognizer as only one WFST has to be searched. However, the size of the composed WFST is huge, typically larger than a Gigabyte, resulting in a large memory footprint and memory bandwidth requirements. In this paper, we take a completely different approach and propose a hardware accelerator for speech recognition that composes the AM and LM graphs on-the-fly. In our ASR system, the fully-composed WFST is never generated in main memory. On the contrary, only the subset required for decoding each input speech fragment is dynamically generated from the AM and LM models. In addition to the direct benefits of this on-the-fly composition, the resulting approach is more amenable to further reduction in storage requirements through compression techniques. The resulting accelerator, called UNFOLD, performs the decoding in real-time using the compressed AM and LM models, and reduces the size of the datasets from more than one Gigabyte to less than 40 Megabytes, which can be very important in small form factor mobile and wearable devices. Besides, UNFOLD improves energy-efficiency by orders of magnitude with respect to CPUs and GPUs. Compared to a state-of-the-art Viterbi search accelerators, the proposed ASR system outperforms by providing 31x reduction in memory footprint and 28% energy savings on average. CCS CONCEPTS • Computer systems organization $\rightarrow$ Special purpose systems; • Computing methodologies $\rightarrow$ Speech recognition
Published: 2017

19. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization

Author: Polychronis Xekalakis, Jose-Maria Arnau, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Informàtica::Infografia [Àrees temàtiques de la UPC], Speedup, Computer science, Memoization, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Software rendering, General Medicine, Parallel computing, Animation, Rendering (computer graphics), Redundancy, Redundancy (engineering), Rendering (Computer graphics), Locality of reference, Animació per ordinador, Graphics processing units, Shader
Abstract: Redundancy is at the heart of graphical applications. In fact, generating an animation typically involves the succession of extremely similar images. In terms of rendering these images, this behavior translates into the creation of many fragment programs with the exact same input data. We have measured this fragment redundancy for a set of commercial Android applications, and found that more than 40% of the fragments used in a frame have been already computed in a prior frame. In this paper we try to exploit this redundancy, using fragment memoization. Unfortunately, this is not an easy task as most of the redundancy exists across frames, rendering most HW based schemes unfeasible. We thus first take a step back and try to analyze the temporal locality of the redundant fragments, their complexity, and the number of inputs typically seen in fragment programs. The result of our analysis is a task level memoization scheme, that easily outperforms the current state-of-the-art in low power GPUs More specifically, our experimental results show that our scheme is able to remove 59.7% of the redundant fragment computations on average. This materializes to a significant speedup of 17.6% on average, while also improving the overall energy efficiency by 8.9% on average.
Published: 2014

20. An ultra low-power hardware accelerator for automatic speech recognition

Author: Reza Yazdani, Albert Segura, Jose-Maria Arnau, Antonio Gonzalez, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Power aware computing, Parallel processing (Electronic computers), Microprocessor chips, Processament en paral·lel (Ordinadors), Automatic speech recognition, Storage management, Speech recognition, Parallel architectures, Power consumption, Search problems, Microprocessadors, Processament de la parla, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Microprocessors, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC]
Abstract: Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for large-vocabulary, speaker-independent, continuous speech recognition. It focuses on the Viterbi search algorithm, that represents the main bottleneck in an ASR system. The proposed design includes innovative techniques to improve the memory subsystem, since memory is identified as the main bottleneck for performance and power in the design of these accelerators. We propose a prefetching scheme tailored to the needs of an ASR system that hides main memory latency for a large fraction of the memory accesses with a negligible impact on area. In addition, we introduce a novel bandwidth saving technique that removes 20% of the off-chip memory accesses issued during the Viterbi search. The proposed design outperforms software implementations running on the CPU by orders of magnitude and achieves 1.7x speedup over a highly optimized CUDA implementation running on a high-end Geforce GTX 980 GPU, while reducing by two orders of magnitude (287x) the energy required to convert the speech into text.
Published: 2016
Full Text: View/download PDF

21. Boosting mobile GPU performance with a decoupled access/execute fragment processor

Author: Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis
Subjects: Real-time computer graphics, Computer architecture, Computer science, Multithreading, Software rendering, Decoupled architecture, General Medicine, Cache, General-purpose computing on graphics processing units, Texture memory, CAS latency, Efficient energy use, Rendering (computer graphics)
Abstract: Smartphones represent one of the fastest growing markets, providing significant hardware/software improvements every few months. However, supporting these capabilities reduces the operating time per battery charge. The CPU/GPU component is only left with a shrinking fraction of the power budget, since most of the energy is consumed by the screen and the antenna. In this paper, we focus on improving the energy efficiency of the GPU since graphical applications consist an important part of the existing market. Moreover, the trend towards better screens will inevitably lead to a higher demand for improved graphics rendering. We show that the main bottleneck for these applications is the texture cache and that traditional techniques for hiding memory latency (prefetching, multithreading) do not work well or come at a high energy cost. We thus propose the migration of GPU designs towards the decoupled access-execute concept. Furthermore, we significantly reduce bandwidth usage in the decoupled architecture by exploiting inter-core data sharing. Using commercial Android applications, we show that the end design can achieve 93% of the performance of a heavily multithreaded GPU while providing energy savings of 34%.
Published: 2012

22. Neither more nor less: optimizing thread-level parallelism for GPGPUs

Author: Polychronis Xekalakis, Jose-Maria Arnau, and Joan-Manuel Parcerisa
Subjects: Memory management, Computer science, business.industry, Bandwidth (computing), Task parallelism, Mobile telephony, Parallel computing, business, Scheduling (computing)
Published: 2013

23. TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems

Author: Joan-Manuel Parcerisa, Polychronis Xekalakis, Jose-Maria Arnau, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Mobile computing, Informàtica::Infografia [Àrees temàtiques de la UPC], Computer science, Image quality, Opengl es, Computer graphics, Informàtica mòbil, Low-power graphics, Computer engineering, Computer graphics (images), Simulation infrastructure, Android (operating system), Graphics, Mobile gpu, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: In this paper we present TEAPOT, a full system GPU simulator, whose goal is to allow the evaluation of the GPUs that reside in mobile phones and tablets. To this extent, it has a cycle accurate GPU model for evaluating performance, power models for the GPU, the memory subsystem and for OLED screens, and image quality metrics. Unlike prior GPU simulators, TEAPOT supports the OpenGL ES 1.1/2.0 API, so that it can simulate all commercial graphical applications available for Android systems. To illustrate potential uses of this simulating infrastructure, we perform two case studies. We first turn our attention to evaluating the impact of the OS when simulating graphical applications. We show that the overall GPU power/performance is greatly aff ected by common OS tasks, such as image composition, and argue that application level simulation is not sufficient to understand the overall GPU behavior. We then utilize the capabilities of TEAPOT to perform studies that trade image quality for energy. We demonstrate that by allowing for small distortions in the overall image quality, a signifi cant amount of energy can be saved.
Published: 2013
Full Text: View/download PDF

24. Design and Evaluation of an Ultra Low-power Human-quality Speech Recognition System

Author: Antonio González, Jose-Maria Arnau, Dennis Pinto, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Vocabulary, Computer science, media_common.quotation_subject, Speech recognition, 02 engineering and technology, Neural networks (Computer science), 030507 speech-language pathology & audiology, 03 medical and health sciences, Software, Low-power hardware, 0202 electrical engineering, electronic engineering, information engineering, Xarxes neuronals (Informàtica), Quality (business), media_common, Ultra low power, business.industry, Automatic speech recognition, Hardware accelerators, 020202 computer hardware & architecture, Power (physics), Hardware and Architecture, Software deployment, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Processament de la parla, Language model, 0305 other medical science, business, Decoding methods, Information Systems
Abstract: Automatic Speech Recognition (ASR) has experienced a dramatic evolution since pioneer development of Bell Lab’s single-digit recognizer more than 50 years ago. Current ASR systems have taken advantage of the tremendous improvements in AI during the past decade by incorporating Deep Neural Networks into the system and pushing their accuracy to levels comparable to that of humans. This article describes and characterizes a representative ASR system with state-of-the-art accuracy and proposes a hardware platform capable of decoding speech in real-time with a power dissipation close to 1 Watt. The software is based on the so-called hybrid approach with a vocabulary of 200K words and RNN-based language model re-scoring, whereas the hardware consists of a commercially available low-power processor along with two accelerators used for the most compute-intensive tasks. The article shows that high performance can be obtained with very low power, enabling the deployment of these systems in extremely power-constrained environments such as mobile and IoT devices. This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No. 833057), the Spanish State Research Agency under grant TIN2016-75344-R (AEI/FEDER, EU), the ICREA Academia program, and the Spanish MICINN Ministry under grant BES-2017-080605.
Full Text: View/download PDF

25. Neuron-level fuzzy memoization in RNNs

Author: Gem Dot, Antonio González, Jose-Maria Arnau, Franyell Silfa, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Scheme (programming language), Binary networks, Speedup, Computer science, Memoization, Reconeixement automàtic de la parla, 02 engineering and technology, 01 natural sciences, Fuzzy logic, Neural networks (Computer science), Informàtica [Àrees temàtiques de la UPC], 0103 physical sciences, Machine learning, 0202 electrical engineering, electronic engineering, information engineering, medicine, Xarxes neuronals (Informàtica), computer.programming_language, 010302 applied physics, Sequence, Artificial neural network, Automatic speech recognition, 3. Good health, 020202 computer hardware & architecture, Long short term memory, Recurrent neural network, medicine.anatomical_structure, Recurrent neural networks, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Neuron, Algorithm, computer
Abstract: The final publication is available at ACM via http://dx.doi.org/10.1145/3352460.3358309 Recurrent Neural Networks (RNNs) are a key technology for applications such as automatic speech recognition or machine translation. Unlike conventional feed-forward DNNs, RNNs remember past information to improve the accuracy of future predictions and, therefore, they are very effective for sequence processing problems. For each application run, each recurrent layer is executed many times for processing a potentially large sequence of inputs (words, images, audio frames, etc.). In this paper, we make the observation that the output of a neuron exhibits small changes in consecutive invocations. We exploit this property to build a neuron-level fuzzy memoization scheme, which dynamically caches the output of each neuron and reuses it whenever it is predicted that the current output will be similar to a previously computed result, avoiding in this way the output computations. The main challenge in this scheme is determining whether the new neuron's output for the current input in the sequence will be similar to a recently computed result. To this end, we extend the recurrent layer with a much simpler Bitwise Neural Network (BNN), and show that the BNN and RNN outputs are highly correlated: if two BNN outputs are very similar, the corresponding outputs in the original RNN layer are likely to exhibit negligible changes. The BNN provides a low-cost and effective mechanism for deciding when fuzzy memoization can be applied with a small impact on accuracy. We evaluate our memoization scheme on top of a state-of-the-art accelerator for RNNs, for a variety of different neural networks from multiple application domains. We show that our technique avoids more than 24.2% of computations, resulting in 18.5% energy savings and 1.35x speedup on average.

26. An ultra low-power hardware accelerator for acoustic scoring in speech recognition

Author: Jordi Tubella, Jose-Maria Arnau, Antonio González, Hamid Tabani, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: 010302 applied physics, Mobile processor, Speedup, Computer science, Memoization, Speech recognition, Acoustic scoring, Automatic speech recognition, Acoustic model, Memory bandwidth, Reconeixement automàtic de la parla, 02 engineering and technology, 01 natural sciences, 020202 computer hardware & architecture, CUDA, Gaussian mixture model, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Hardware acceleration, Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic [Àrees temàtiques de la UPC], Lazy evaluation, Hardware accelerator
Abstract: Accurate, real-time Automatic Speech Recognition (ASR) comes at a high energy cost, so accuracy has often to be sacrificed in order to fit the strict power constraints of mobile systems. However, accuracy is extremely important for the end-user, and today's systems are still unsatisfactory for many applications. The most critical component of an ASR system is the acoustic scoring, as it has a large impact on the accuracy of the system and takes up the bulk of execution time. The vast majority of ASR systems implement the acoustic scoring by means of Gaussian Mixture Models (GMMs), where the acoustic scores are obtained by evaluating multidimensional Gaussian distributions.In this paper, we propose a hardware accelerator for GMM evaluation that reduces the energy required for acoustic scoring by three orders of magnitude compared to solutions based on CPUs and GPUs. Our accelerator implements a lazy evaluation scheme where Gaussians are computed on demand, avoiding 50% of the computations. Furthermore, it employs a novel clustering scheme to reduce the size of the acoustic model, which results in 8x memory bandwidth savings with a negligible impact on accuracy. Finally, it includes a novel memoization scheme that avoids 74.88% of floating-point operations. The end design provides a 164x speedup and 3532x energy reduction when compared with a highly-tuned implementation running on a modern mobile CPU. Compared to a state-of-the-art mobile GPU, the GMM accelerator achieves 5.89x speedup over a highly optimized CUDA implementation, while reducing energy by 241x.

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

26 results on '"Jose Maria Arnau"'

1. SHARP: An adaptable, energy-efficient accelerator for recurrent neural networks

2. E-BATCH: Energy-efficient and high-throughput RNN batching

3. Energy-efficient stream compaction through filtering and coalescing accesses in GPGPU memory partitions

4. Boosting LSTM Performance Through Dynamic Precision Selection

5. Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads

6. Demystifying Power and Performance Bottlenecks in Autonomous Driving Systems

7. LAWS: Locality-AWare Scheme for automatic speech recognition

8. POSTER: Leveraging Run-Time Feedback for Efficient ASR Acceleration

9. CGPA: Coarse-Grained Pruning of Activations for Energy-Efficient RNN Inference

10. A Low-Power, High-Performance Speech Recognition Accelerator

11. SCU: a GPU stream compaction unit for graph processing

12. E-PUR

13. Performance analysis and optimization of automatic speech recognition

14. Computation Reuse in DNNs by Exploiting Input Similarity

15. A novel register renaming technique for out-of-order processors

16. The dark side of DNN pruning

17. Low-power automatic speech recognition through a mobile GPU and a Viterbi accelerator

18. UNFOLD: a memory-efficient speech recognizer using on-the-fly WFST composition

19. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization

20. An ultra low-power hardware accelerator for automatic speech recognition

21. Boosting mobile GPU performance with a decoupled access/execute fragment processor

22. Neither more nor less: optimizing thread-level parallelism for GPGPUs

23. TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems

24. Design and Evaluation of an Ultra Low-power Human-quality Speech Recognition System

25. Neuron-level fuzzy memoization in RNNs

26. An ultra low-power hardware accelerator for acoustic scoring in speech recognition

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

26 results on '"Jose Maria Arnau"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources