76 results on '"Benini, L"'
Search Results
2. Comparative analysis of NoCs for two-dimensional versus three-dimensional SoCs supporting multiple voltage and frequency islands
- Author
-
Seiculescu, C., Murali, S., Benini, L., and De Micheli, G.
- Subjects
Delay lines -- Analysis ,Voltage -- Measurement ,Embedded systems -- Design and construction ,Embedded system ,System on a chip ,Business ,Computers and office automation industries ,Electronics ,Electronics and electrical industries - Published
- 2010
3. Thermal balancing policy for multiprocessor stream computing platforms
- Author
-
Mulas, F., Atienza, D., Acquaviva, A., Carta, S., Benini, L., and De Micheli, G.
- Subjects
Multiprocessors -- Design and construction ,Digital integrated circuits -- Analysis ,Embedded systems -- Analysis ,System on a chip ,Programmable logic array ,Embedded system ,Business ,Computers and office automation industries ,Electronics ,Electronics and electrical industries - Published
- 2009
4. Design of a solar-harvesting circuit for batteryless embedded systems
- Author
-
Brunelli, D., Moser, C., Thiele, L., and Benini, L.
- Subjects
Electric current converters -- Design and construction ,Voltage -- Measurement ,Embedded systems -- Design and construction ,Wireless sensor networks -- Analysis ,Electric current converter ,Embedded system ,System on a chip ,Business ,Computers and office automation industries ,Electronics ,Electronics and electrical industries - Published
- 2009
5. Design of a collective communication infrastructure for barrier synchronization in cluster-based nanoscale MPSoCs
- Author
-
Abellán, J. L., Fernández, J., Acacio, M. E., Bertozzi, D., Daniele Bortolotti, Marongiu, A., Benini, L., Abellan J.L., Fernandez J., Acacio M.E., Bertozzi D., Bortolotti D., Marongiu A., and Benini L.
- Subjects
embedded system ,embedded systems ,shared memory systems ,system-on-chip ,shared memory system - Abstract
Barrier synchronization is a key programming primitive for shared memory embedded MPSoCs. As the core count increases, software implementations cannot provide the needed performance and scalability, thus making hardware acceleration critical. In this paper we describe an interconnect extension implemented with standard cells and with a mainstream industrial toolflow. We show that the area overhead is marginal with respect to the performance improvements of the resulting hardware-accelerated barriers.We integrate our HW barrier into the OpenMP programming model and discuss synchronization efficiency compared with traditional software implementations.
- Published
- 2012
6. Analysis of Error Recovery Schemes for Networks on Chips
- Author
-
Murali, S., Theocharides, Theocharis, Vijaykrishnan, N., Irwin, M. J., Benini, L., De Micheli, G., S. Murali, T. Theocharide, N. Vijaykrishnan, M. J. Irwin, L. Benini, G. De Micheli, and Theocharides, Theocharis [0000-0001-7222-9152]
- Subjects
Computer science ,business.industry ,Energy conservation ,Data link ,Hardware and Architecture ,Power consumption ,Embedded system ,Electronic engineering ,System on a chip ,Electrical and Electronic Engineering ,Error detection and correction ,business ,Software ,Efficient energy use - Abstract
In this article, we discuss design constraints to characterize efficient error recovery mechanisms for the NoC design environment. We explore error control mechanisms at the data link and network layers and present the schemes' architectural details. We investigate the energy efficiency, error protection efficiency, and performance impact of various error recovery mechanisms.
- Published
- 2005
7. Telescopic Units: Increasing the Average Throughput of Pipelined Designs by Adaptive Latency Control
- Author
-
Benini, L, DE MICHELI, G, Macii, Enrico, and Poncino, M.
- Subjects
Combinational logic ,Logic synthesis ,Adaptive control ,Computer engineering ,Computer science ,business.industry ,Embedded system ,Latency (engineering) ,business ,Programmable logic array ,Electronic circuit - Abstract
This paper presents a technique, alternative to performance-drivensynthesis, that allows to drastically increase the averagethroughput of combinational logic blocks by transforming fixed-latencyunits into variable-latency ones that run with a fasterclock cycle.The transformation is fully automatic and can beused in conjunction with traditional design techniques, such aspipelining, to improve the overall performance of speed-criticalsystems.Results, obtained on a large set of benchmark circuits,are very promising.
- Published
- 1997
8. Improving Autonomous Nano-Drones Performance via Automated End-to-End Optimization and Deployment of DNNs
- Author
-
Francesco Conti, Lorenzo Lamberti, Luca Benini, Vlad Niculescu, Daniele Palossi, Niculescu V., Lamberti L., Conti F., Benini L., and Palossi D.
- Subjects
convolutional neural network (CNN) ,Computer science ,business.industry ,nano-drone ,Drone ,End-to-end principle ,Software deployment ,Embedded system ,Nano ,Unmanned aerial vehicle (UAV) ,Electrical and Electronic Engineering ,business ,autonomous navigation ,ultra-low-power (ULP) - Abstract
The evolution of energy-efficient ultra-low-power (ULP) parallel processors and the diffusion of convolutional neural networks (CNNs) are fueling the advent of autonomous driving nano-sized unmanned aerial vehicles (UAVs). These sub-10cm robotic platforms are envisioned as next-generation ubiquitous smart-sensors and unobtrusive robotic-helpers. However, the limited computational/memory resources available aboard nano-UAVs introduce the challenge of minimizing and optimizing vision-based CNNs - which to date require error-prone, labor-intensive iterative development flows. This work explores methodologies and software tools to streamline and automate all the deployment of vision-based CNN navigation on a ULP multicore system-on-chip acting as a mission computer on a Crazyflie 2.1 nano-UAV. We focus on the deployment of PULP-Dronet (Palossi et al., 2019), a state-of-the-art CNN for autonomous navigation of nano-UAVs, from the initial training to the final closed-loop evaluation. Compared to the original hand-crafted CNN, our results show a 2 imes reduction of memory footprint and a speedup of 1.6 imes in inference time while guaranteeing the same prediction accuracy and significantly improving the behavior in the field, achieving: i) obstacle avoidance with a peak braking-speed of 1.65m/s and improving the speed/braking-space ratio of the baseline, ii) free flight in a familiar environment up to 1.96m/s (0.5m/s for the baseline), and iii) lane following on a path featuring a 90deg turn - all while using for computation less than 1.6% of the drone's power budget. To foster new applications and future research, we open-source all the software design in a ready-to-run project compatible with the Crazyflie 2.1.
- Published
- 2021
9. Adaptive Random Forests for Energy-Efficient Inference on Microcontrollers
- Author
-
Enrico Macii, Francesco Daghero, Massimo Poncino, Daniele Jahier Pagliari, Alessio Burrello, Andrea Calimera, Luca Benini, Chen Xie, Daghero F., Burrello A., Xie C., Benini L., Calimera A., MacIi E., Poncino M., and Pagliari D.J.
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,Decision tree ,Inference ,Energy consumption ,Machine Learning (cs.LG) ,Random forest ,Machine Learning ,Microcontroller ,Computer engineering ,Embedded System ,Embedded Systems ,Latency (engineering) ,Energy (signal processing) ,Efficient energy use - Abstract
Random Forests (RFs) are widely used Machine Learning models in low-power embedded devices, due to their hardware friendly operation and high accuracy on practically relevant tasks. The accuracy of a RF often increases with the number of internal weak learners (decision trees), but at the cost of a proportional increase in inference latency and energy consumption. Such costs can be mitigated considering that, in most applications, inputs are not all equally difficult to classify. Therefore, a large RF is often necessary only for (few) hard inputs, and wasteful for easier ones. In this work, we propose an early-stopping mechanism for RFs, which terminates the inference as soon as a high-enough classification confidence is reached, reducing the number of weak learners executed for easy inputs. The early-stopping confidence threshold can be controlled at runtime, in order to favor either energy saving or accuracy. We apply our method to three different embedded classification tasks, on a single-core RISC-V microcontroller, achieving an energy reduction from 38% to more than 90% with a drop of less than 0.5% in accuracy. We also show that our approach outperforms previous adaptive ML methods for RFs., Published in: 2021 IFIP/IEEE 29th International Conference on Very Large Scale Integration (VLSI-SoC), 2021
- Published
- 2022
10. PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning
- Author
-
Davide Nadalini, Manuele Rusci, Giuseppe Tagliavini, Leonardo Ravaglia, Luca Benini, Francesco Conti, Nadalini D., Rusci M., Tagliavini G., Ravaglia L., Benini L., and Conti F.
- Subjects
distributed systems ,parallel processing systems ,engineering ,microprocessor chips ,computer system ,computer science ,distributed system ,artificial intelligence ,computer hardware ,distributed computer systems ,distributed computer system ,processors ,computer programming ,embedded system ,machine learning ,computer networks ,computer systems ,embedded systems ,internet ,signal processing ,microprocessor chip ,parallel processing system ,processor ,computer network - Abstract
An open challenge in making Internet-of-Things sensor nodes "smart'' and self-adaptive is to enable on-chip Deep Neural Network (DNN) training on Ultra-Low-Power (ULP) microcontroller units (MCUs). To this aim, we present a framework, based on PULP-TrainLib, to deploy DNN training tasks on RISC-V-based Parallel-ULP (PULP) MCUs. PULP-TrainLib is a library of parallel software DNN primitives enabling the execution of forward and backward steps on PULP MCUs. To optimize PULP-TrainLib's kernels, we propose a strategy to automatically select and configure (autotune) the fastest among a set of tiling options and optimized floating-point matrix multiplication kernels, according to the tensor shapes of every DNN layer. Results on an 8-core RISC-V MCU show that our auto-tuned primitives improve MAC/clk by up to 2.4x compared to "one-size-fits-all'' matrix multiplication, achieving up to 4.39 MAC/clk - 36.6x better than a commercial STM32L4 MCU executing the same DNN layer training workload. Furthermore, our strategy proves to be 30.7x faster than AIfES, a state-of-the-art training library for MCUs, while training a complete TinyML model.
- Published
- 2022
11. Hardware-In-The Loop Emulation for Agile Co-Design of Parallel Ultra-Low Power IoT Processors
- Author
-
Luca Valente, Luca Benini, Davide Rossi, Valente L., Rossi D., and Benini L.
- Subjects
IoT ,Emulation ,Exploit ,business.industry ,Computer science ,Verification ,Hardware-in-the-loop simulation ,Context (language use) ,Power (physics) ,Embedded system ,Full cycle ,FPGA emulation ,business ,Field-programmable gate array ,CNN ,Agile software development - Abstract
Simulation of Computing Systems plays a crucial role in state-of-the-art design validation and optimization methodologies. Traditionally, Register Transfer-Level (RTL) simulation is a well-established approach to perform performance analysis as well as functional validation. However, more agile simulation and emulation methodologies are required for architectural exploration and optimization, given the always increasing complexity of parallel processors, including those designed for ultra-low power IoT end-nodes. Architectural simulators are the most commonly used tools to explore parallel computing architectures. Nevertheless, they are often not accurate enough to identify and quantify system-level performance bottlenecks, such as access to external peripherals or contentions on shared resources. In this context, FPGA logic emulation is gaining increasing popularity. In this paper, we exploit and introduce a “Hardware-in-the-loop” framework that eases HW/SW co-design of Parallel-ultra-low power IoT processors by enabling the analysis of the full cyber-physical loop from sensing to actuation. We use the proposed methodology to carry out $\mu$ Architectural optimizations and to demonstrate performance improvements on two real-life end-to-end applications with full hardware in the loop emulation. Our results show that we can run workloads on real-life applications more than 3’000 times faster than cycle-accurate RTL, and with speed similar to instruction accurate simulators but with full cycle accuracy.
- Published
- 2021
12. Towards an Open, Flexible, Wearable Ultrasound Probe for Musculoskeletal Monitoring
- Author
-
Christoph Leitner, Michele Magno, Sergei Vostrikov, Luca Benini, Andrea Cossettini, Christian Vogt, Vostrikov S., Cossettini A., Vogt C., Leitner C., Magno M., and Benini L.
- Subjects
Battery (electricity) ,ultrasonic ,Modalities ,muscle ,Computer science ,business.industry ,Work (physics) ,Wearable computer ,smart probe ,Frame rate ,embedded system ,monitoring ,Ultrasonic ,Low power design ,Smart probe ,Embedded system ,Muscle ,Monitoring ,low power design ,Key (cryptography) ,Ultrasonic sensor ,business ,Computer hardware ,Communication channel - Abstract
Ultrasound (US) is a widely used tool to observe musculoskeletal features during movement. US signals provide information about the instantaneous length and activation of muscles, and these signals are used for new prosthetic control strategies, to increase the performance of Human-Machine interfaces, or to improve muscle force predictions. However, existing commercial systems and research platforms are bulky, non-wearable, not supporting high frame rates, or lack configurability. Our work presents an US prototype meeting all key requirements to measure muscle activity during motion: our platform is open (providing access to raw data), flexible (operating in different modalities and channel count), it features wireless connectivity (supporting high frame rates> 100 Hz) and supports low power consumption (
- Published
- 2021
13. A Microcontroller is All You Need: Enabling Transformer Execution on Low-Power IoT Endnodes
- Author
-
Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, Marcello Zanghieri, Burrello A., Scherer M., Zanghieri M., Conti F., and Benini L.
- Subjects
TinyML ,business.industry ,Computer science ,Deep learning ,Internet of Things ,Latency (audio) ,Energy consumption ,Matrix multiplication ,ARM architecture ,Microcontroller ,Statistical classification ,Deep Learning ,Transformers ,Embedded system ,Artificial intelligence ,Internet of Thing ,business ,Transformer (machine learning model) - Abstract
Transformer networks have become state-of-The-Art for many tasks such as NLP and are closing the gap on other tasks like image recognition. Similarly, Transformers and Attention methods are starting to attract attention on smaller-scale tasks, which fit the typical memory envelope of MCUs. In this work, we propose a new set of execution kernels tuned for efficient execution on MCU-class RISC-V and ARM Cortex-M cores. We focus on minimizing memory movements while maximizing data reuse in the Attention layers. With our library, we obtain 3.4×, 1.8×, and 2.1× lower latency and energy on 8-bit Attention layers, compared to previous state-of-The-Art (SoA) linear and matrix multiplication kernels in the CMSIS-NN and PULP-NN libraries on the STM32H7 (Cortex M7), STM32L4 (Cortex M4), and GAP8 (RISC-V IMC-Xpulp) platforms, respectively. As a use case for our TinyTransformer library, we also demonstrate that we can fit a 263 kB Transformer on the GAP8 platform, outperforming the previous SoA convolutional architecture on the TinyRadarNN dataset, with a latency of 9.24 ms and 0.47 mJ energy consumption and an accuracy improvement of 3.5%.
- Published
- 2021
14. A RISC-V in-network accelerator for flexible high-performance low-power packet processing
- Author
-
Andreas Kurth, Torsten Hoefler, Timo Schneider, Jakub Beránek, Luca Benini, Thomas Benz, Alexandru Calotoiu, Salvatore Di Girolamo, Di Girolamo S., Kurth A., Calotoiu A., Benz T., Schneider T., Beranek J., Benini L., and Hoefler T.
- Subjects
Special-ized architecture ,packet processing ,Computer science ,Packet processing ,010103 numerical & computational mathematics ,02 engineering and technology ,01 natural sciences ,Gigabit ,in-network compute ,Next-generation network ,0202 electrical engineering, electronic engineering, information engineering ,Computer architecture ,0101 mathematics ,Silicon-on-insulator ,Next generation networking ,020203 distributed computing ,business.industry ,Network packet ,Process (computing) ,Computational modeling ,Open source software ,sPIN ,specialized architecture ,Embedded system ,RISC-V ,Task analysis ,Programming paradigm ,Programming ,business ,Host (network) - Abstract
The capacity of offloading data and control tasks to the network is becoming increasingly important, especially if we consider the faster growth of network speed when compared to CPU frequencies. In-network compute alleviates the host CPU load by running tasks directly in the network, enabling additional computation/communication overlap and potentially improving overall application performance. However, sustaining bandwidths provided by next-generation networks, e.g., 400 Gbit/s, can become a challenge. sPIN is a programming model for in-NIC compute, where users specify handler functions that are executed on the NIC, for each incoming packet belonging to a given message or flow. It enables a CUDA-like acceleration, where the NIC is equipped with lightweight processing elements that process network packets in parallel. We investigate the architectural specialties that a sPIN NIC should provide to enable high-performance, low-power, and flexible packet processing. We introduce PsPIN, a first open-source sPIN implementation, based on a multi-cluster RISC-V architecture and designed according to the identified architectural specialties. We investigate the performance of PsPIN with cycle-accurate simulations, showing that it can process packets at 400 Gbit/s for several use cases, introducing minimal latencies (26 ns for 64 B packets) and occupying a total area of 18.5 mm2(22 nm FDSOI).
- Published
- 2021
15. Arnold: An eFPGA-Augmented RISC-V SoC for Flexible and Low-Power IoT End Nodes
- Author
-
Timothy Saxe, Alfio Di Mauro, Frank K. Gurkaynak, Ket Chong Yap, Luca Benini, Mao Wang, Davide Rossi, Pasquale Davide Schiavone, Schiavone P.D., Rossi D., Di Mauro A., Gurkaynak F.K., Saxe T., Wang M., Yap K.C., and Benini L.
- Subjects
FOS: Computer and information sciences ,Computer science ,02 engineering and technology ,Encryption ,7. Clean energy ,embedded system ,open source ,Gate array ,field-programmable gate array (FPGA) ,Hardware Architecture (cs.AR) ,0202 electrical engineering, electronic engineering, information engineering ,System on a chip ,Electrical and Electronic Engineering ,Computer Science - Hardware Architecture ,Field-programmable gate array ,Edge computing ,Embedded systems ,Field Programmable Gate Array (FPGA) ,Internet of Things (IoT) ,Microcontroller ,Open source ,RISC-V ,business.industry ,020208 electrical & electronic engineering ,020202 computer hardware & architecture ,Hardware and Architecture ,Interfacing ,Analytics ,microcontroller ,Embedded system ,business ,Software - Abstract
A wide range of Internet of Things (IoT) applications require powerful, energy-efficient, and flexible end nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold : a 0.5-to-0.8-V, 46.83- $\mu \text{W}$ /MHz, 600-MOPS fully programmable RISC-V microcontroller unit (MCU) fabricated in 22-nm Globalfoundries GF22FDX (GF22FDX) technology, coupled with a state-of-the-art (SoA) microcontroller to an embedded field-programmable gate array (eFPGA). We demonstrate the flexibility of the system-on-chip (SoC) to tackle the challenges of many emerging IoT applications, such as interfacing sensors and accelerators with nonstandard interfaces, performing on-the-fly preprocessing tasks on data streamed from peripherals, and accelerating near-sensor analytics, encryption, and machine learning tasks. A unique feature of the proposed SoC is the exploitation of body-biasing to reduce leakage power of the eFPGA fabric by up to $18\times $ at 0.5 V, achieving SoA state bitstream-retentive sleep power for the eFPGA fabric, as low as $20.5~\mu \text{W}$ . The proposed SoC provides $3.4\times $ better performance and $2.9\times $ better energy efficiency than other fabricated heterogeneous reconfigurable SoCs of the same class.
- Published
- 2021
16. Analyzing Memory Interference of FPGA Accelerators on Multicore Hosts in Heterogeneous Reconfigurable SoCs
- Author
-
Björn Forsberg, Luca Benini, Maxim Mattheeuws, Andreas Kurth, Mattheeuws M., Forsberg B., Kurth A., and Benini L.
- Subjects
Multi-core processor ,model ,business.industry ,Computer science ,Benchmarks ,Interference theory ,interference ,Byte ,FLOPS ,Interference ,Heterogeneous ,SoC ,CPU ,FPGA ,Model ,heterogeneou ,benchmark ,Memory management ,Interference (communication) ,Embedded system ,business ,Field-programmable gate array ,Host (network) - Abstract
Reconfigurable heterogeneous systems-on-chips (SoCs) integrating multiple accelerators are cost-effective and feature the processing power required for complex embedded applications. However, to enable their usage in real-time settings, it is crucial to control interference on the shared main memory for reliable performance. Interference causes performance degradation due to simultaneous memory requests by components such as CPUs, caches, accelerators, and DMAs. We propose a methodology to characterize the interference to multicore host processors caused by accelerators implemented in the FPGA fabric of reconfigurable heterogeneous SoCs. Based on it, we extend the roofline model to account for performance degradation of the computing platform. The extended model allows to determine in an efficient way at which point memory interference becomes critical for a given platform and workload. We apply our methodology to a modern Xilinx UltraScale+ SoC integrating a multicore ARM Cortex-A CPU and a Kintex-grade FPGA. To the best of our knowledge, our results experimentally show for the first time that programs with intensities below 5 flop/byte - workloads with low cache locality - can suffer from slowdowns of up to an order of magnitude., Proceedings of the 2021 Design, Automation & Test in Europe (DATE 2021), ISBN:978-3-9819263-5-4, ISBN:978-1-7281-6336-9
- Published
- 2021
17. MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect
- Author
-
Matheus Cavalcante, Luca Benini, Samuel Riedel, Antonio Pullini, Cavalcante M., Riedel S., Pullini A., and Benini L.
- Subjects
FOS: Computer and information sciences ,Multi-core processor ,business.industry ,Computer science ,MIMD ,Network topology ,Many-core ,Memory management ,Memory bank ,Embedded system ,Hardware Architecture (cs.AR) ,Latency (engineering) ,Networks-on-Chips ,Computer Science - Hardware Architecture ,business ,Throughput (business) ,Scratchpad memory - Abstract
A key challenge in scaling shared-L1 multi-core clusters towards many-core (more than 16 cores) configurations is to ensure low-latency and efficient access to the L1 memory. In this work we demonstrate that it is possible to scale up the shared-L1 architecture: We present MemPool, a 32 bit many-core system with 256 fast RV32IMA "Snitch" cores featuring application-tunable execution units, running at 700 MHz in typical conditions (TT/0.80 V/25{\deg}C). MemPool is easy to program, with all the cores sharing a global view of a large L1 scratchpad memory pool, accessible within at most 5 cycles. In MemPool's physical-aware design, we emphasized the exploration, design, and optimization of the low-latency processor-to-L1-memory interconnect. We compare three candidate topologies, analyzing them in terms of latency, throughput, and back-end feasibility. The chosen topology keeps the average latency at fewer than 6 cycles, even for a heavy injected load of 0.33 request/core/cycle. We also propose a lightweight addressing scheme that maps each core private data to a memory bank accessible within one cycle, which leads to performance gains of up to 20% in real-world signal processing benchmarks. The addressing scheme is also highly efficient in terms of energy consumption since requests to local banks consume only half of the energy required to access remote banks. Our design achieves competitive performance with respect to an ideal, non-implementable full-crossbar baseline., Comment: Accepted for publication in the Design, Automation and Test in Europe (DATE) Conference 2021
- Published
- 2021
18. H-Watch: An Open, Connected Platform for AI-Enhanced COVID19 Infection Symptoms Monitoring and Contact Tracing
- Author
-
Tommaso Polonelli, Michele Magno, Luca Benini, Philipp Mayer, Lukas Schulthess, Polonelli T., Schulthess L., Mayer P., Magno M., and Benini L.
- Subjects
Battery (electricity) ,smart sensing ,business.industry ,Event (computing) ,Computer science ,Wearable device ,Wearable computer ,COVID-19 ,Wireless sensors networks ,Tracing ,Microcontroller ,Software ,Smart sensor ,Low power design ,Embedded system ,Tiny machine learning ,wearable device ,Covid-19 ,low power design ,Tiny Machine Learning ,wireless sensor networks ,business ,Contact tracing ,Wearable technology - Abstract
The novel COVID-19 disease has been declared a pandemic event. Early detection of infection symptoms and contact tracing are playing a vital role in containing COVID-19 spread. As demonstrated by recent literature, multi-sensor and connected wearable devices might enable symptom detection and help tracing contacts, while also acquiring useful epidemiological information. This paper presents the design and implementation of a fully open-source wearable platform called H-Watch. It has been designed to include several sensors for COVID-19 early detection, multi-radio for wireless transmission and tracking, a microcontroller for processing data on-board, and finally, an energy harvester to extend the battery lifetime. Experimental results demonstrated only 5.9~mW of average power consumption, leading to a lifetime of 9 days on a small watch battery. Finally, all the hardware and the software, including a machine learning on MCU toolkit, are provided open-source, allowing the research community to build and use the H-Watch., 2021 IEEE International Symposium on Circuits and Systems (ISCAS), ISBN:978-1-7281-9201-7
- Published
- 2021
19. Microarchitectural Timing Channels and their Prevention on an Open-Source 64-bit RISC-V Core
- Author
-
Gernot Heiser, Nils Wistoff, Moritz Schneider, Frank K. Gurkaynak, Luca Benini, Wistoff N., Schneider M., Gurkaynak F.K., Benini L., and Heiser G.
- Subjects
timing channels ,covert channel ,business.industry ,Computer science ,020206 networking & telecommunications ,02 engineering and technology ,Security policy ,020202 computer hardware & architecture ,Microarchitecture ,Software ,Stack (abstract data type) ,Embedded system ,RISC-V ,operating system ,0202 electrical engineering, electronic engineering, information engineering ,system security ,State (computer science) ,Microkernel ,computer architecture ,business ,Reset (computing) ,microarchitecture ,time protection - Abstract
Microarchitectural timing channels use variations in the timing of events, resulting from competition for limited hardware resources, to leak information in violation of the operating system's security policy. Such channels also exist on a simple in-order RISC-V core, as we demonstrate on the open-source RV64GC Ariane core. Time protection, recently proposed and implemented in the seL4 microkernel, aims to prevent timing channels, but depends on a controlled reset of microarchitectural state. Using Ariane, we show that software techniques for performing such a reset are insufficient and highly inefficient. We demonstrate that adding a single flush instruction is sufficient to close all five evaluated channels at negligible hardware costs, while requiring only minor modifications to the software stack.
- Published
- 2021
20. Efficient Transform Algorithms for Parallel Ultra-Low-Power IoT End Nodes
- Author
-
Benedetta Mazzoni, Simone Benatti, Luca Benini, Giuseppe Tagliavini, Mazzoni B., Benatti S., Benini L., and Tagliavini G.
- Subjects
IoT ,STFT ,Speedup ,General Computer Science ,Computer science ,Synchronization ,Convolutional codes ,Discrete wavelet transforms ,DWT ,Libraries ,Parallel processing ,parallel programming ,Throughput ,Transforms ,Software ,Synchronization (computer science) ,Throughput (business) ,business.industry ,Librarie ,Software development ,Shared memory ,Parallel processing (DSP implementation) ,Control and Systems Engineering ,Embedded system ,Key (cryptography) ,Discrete wavelet transform ,Convolutional code ,business - Abstract
Modern IoT end nodes must support computational intensive workloads at a limited power-budget. Parallel ultra-low-power architectures are a promising target for this scenario, and the availability of highly optimized software libraries is crucial to exploit parallelism and reduce software development costs. This letter proposes an efficient parallel design of the widely used STFT and DWT transforms targeting ultra-low-power IoT devices. We address key performance challenges related to fine-grained synchronization and banking conflicts in shared memory. We achieve high throughput (50.95 samples/μs, on average), good parallel speedup (up to 6.79×), and high energy efficiency (up to 172.55 GOp/s/W) on a cluster of 8 RISC-V cores optimized for parallel ultra-low-power (PULP) operation.
- Published
- 2021
21. 4.4 A 1.3TOPS/W @ 32GOPS Fully Integrated 10-Core SoC for IoT End-Nodes with 1.7μW Cognitive Wake-Up from MRAM-Based State-Retentive Sleep Mode
- Author
-
Manuel Eggiman, Eric Flamand, Igor Loi, Jie Chen, Stefan Mach, Francesco Conti, Antonio Pullini, Davide Rossi, Luca Benini, Giuseppe Tagliavini, Marco Guermandi, Alfio Di Mauro, Rossi D., Conti F., Eggiman M., Mach S., Mauro A.D., Guermandi M., Tagliavini G., Pullini A., Loi I., Chen J., Flamand E., and Benini L.
- Subjects
Flexibility (engineering) ,Magnetoresistive random-access memory ,business.industry ,Computer science ,020208 electrical & electronic engineering ,020206 networking & telecommunications ,02 engineering and technology ,System On Chip (SoC), Digital Signal Processor(DSP), Magnetoresistive Random Access Memory (MRAM), Cog-nitive Wake-Up (CWU), Internet of Things (IoT), Near SensorAnalytic Applications (NSAA), Machine Learning (ML), DeepNeural Networks (DNN), RISC-V ,7. Clean energy ,Embedded system ,RISC-V ,0202 electrical engineering, electronic engineering, information engineering ,Static random-access memory ,State (computer science) ,SIMD ,business ,Sleep mode ,Efficient energy use - Abstract
The Internet-of-Things requires end-nodes with ultra-low-power always-on capability for long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an always-on IoT end-node SoC capable of scaling from a 1.7$\mu$ W fully retentive COGNITIVE sleep mode up to 32.2GOPS (@49.4mW) peak performance on NSAAs, including mobile DNN inference, exploiting 1.6MB of state- retentive SRAM, and 4MB of non-volatile MRAM. To meet the performance and flexibility requirements of NSAAs, the SoC features 10 RISC-V cores: one core for SoC and IO management and a 9-core cluster supporting multi-precision SIMD integer and floating- point computation. Two programmable machine-learning (ML) accelerators boost energy efficiency in sleep and active state, respectively.
- Published
- 2021
22. RISC-V for Real-time MCUs - Software Optimization and Microarchitectural Gap Analysis
- Author
-
Luca Benini, Robert Balas, Balas R., and Benini L.
- Subjects
FreeRTOS ,Interrupt latency ,business.industry ,Computer science ,Firmware ,real-time ,RISC-V ,Program optimization ,computer.software_genre ,ARM architecture ,Software ,Embedded system ,Use case ,business ,computer ,Context switch ,open-source - Abstract
Processors using the RISC-VISA are finding increasing real use in IoT and embedded systems in the MCU segment. However, many real-life use cases in this segment have realtime constraints. In this paper we analyze the current state of real-time support for RISC-V with respect to the ISA, available hardware and software stack focusing on the RV32IMC subset of the ISA. As a reference point, we use the CV32E40P, an open-source industrially supported RV32IMFC core and FreeRTOS, a popular open-source real-time operating system, to do a baseline characterization. We perform a series of software optimizations on the vanilla RISC-V FreeRTOS port where we also explore and make use of ISA and micro-architectural features, improving the context switch time by 25% and the interrupt latency by 33% in the average and 20% in the worst-case run on a CV32E40P when evaluated on a power control unit firmware and synthetic benchmarks. This improved version serves then in a comparison against the ARM Cortex-M series, which in turn allows us to highlight gaps and challenges to be tackled in the RISC-VISA as well as in the hardware/software ecosystem to achieve competitive maturity.
- Published
- 2021
23. Using Low-Power, Low-Cost IoT Processors in Clinical Biosignal Research: an In-depth Feasibility Check
- Author
-
Luca Benini, Simone Benatti, Silvestro Micera, Victor Kartsch, Fiorenzo Artoni, Kartsch Victor, Artoni F., Benatti S., Micera S., and Benini L.
- Subjects
Computer science ,business.industry ,020208 electrical & electronic engineering ,02 engineering and technology ,Biosensing Techniques ,Feasibility Studies ,Wearable Electronic Devices ,Power (physics) ,Feasibility Studie ,03 medical and health sciences ,Biosensing Technique ,0302 clinical medicine ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Biosignal ,Wearable Electronic Device ,Internet of Things ,business ,030217 neurology & neurosurgery - Abstract
Research on biosignal (ExG) analysis is usually performed with expensive systems requiring connection with external computers for data processing. Consumer-grade low-cost wearable systems for bio-potential monitoring and embedded processing have been presented recently, but are not considered suitable for medical-grade analyses. This work presents a detailed quantitative comparative analysis of a recently presented fully-wearable low-power and low-cost platform (BioWolf) for ExG acquisition and embedded processing with two research-grade acquisition systems, namely, ANTNeuro (EEG) and the Noraxon DTS (EMG). Our preliminary results demonstrate that BioWolf offers competitive performance in terms of electrical properties and classification accuracy. This paper also highlights distinctive features of BioWolf, such as real-time embedded processing, improved wearability, and energy-efficiency, which allows devising new types of experiments and usage scenarios for medical-grade biosignal processing in research and future clinical studies.
- Published
- 2020
24. An Accurate EEGNet-based Motor-Imagery Brain–Computer Interface for Low-Power Edge Computing
- Author
-
Burak Kaya, Michael Hersche, Michele Magno, Luca Benini, Xiaying Wang, Batuhan Tomekce, Wang X., Hersche M., Tomekce B., Kaya B., Magno M., and Benini L.
- Subjects
Signal Processing (eess.SP) ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,Embedded systems ,Interface (computing) ,0206 medical engineering ,Real-time computing ,Computer Science - Human-Computer Interaction ,02 engineering and technology ,Convolutional neural network ,Human-Computer Interaction (cs.HC) ,Machine Learning (cs.LG) ,Reduction (complexity) ,Upsampling ,embedded system ,03 medical and health sciences ,0302 clinical medicine ,edge computing ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical Engineering and Systems Science - Signal Processing ,Brain–computer interface ,motor-imagery ,020601 biomedical engineering ,ARM architecture ,Microcontroller ,Brain-computer interface ,Memory footprint ,Motor-imagery ,CNN ,Edge computing ,030217 neurology & neurosurgery - Abstract
This paper presents an accurate and robust embedded motor-imagery brain–computer interface (MI-BCI). The proposed novel model, based on EEGNet [1], matches the requirements of memory footprint and computational resources of low-power microcontroller units (MCUs), such as the ARM Cortex-M family. Furthermore, the paper presents a set of methods, including temporal downsampling, channel selection, and narrowing of the classification window, to further scale down the model to relax memory requirements with negligible accuracy degradation. Experimental results on the Physionet EEG Motor Movement/Imagery Dataset show that standard EEGNet achieves 82.43%, 75.07%, and 65.07% classification accuracy on 2-, 3-, and 4-class MI tasks in global validation, outperforming the state-of-the-art (SoA) convolutional neural network (CNN) by 2.05%, 5.25%, and 6.49%. Our novel method further scales down the standard EEGNet at a negligible accuracy loss of 0.31% with 7.6× memory footprint reduction and a small accuracy loss of 2.51% with 15× reduction. The scaled models are deployed on a commercial Cortex-M4F MCU taking 101 ms and consuming 4.28 mJ per inference for operating the smallest model, and on a Cortex-M7 with 44 ms and 18.1 mJ per inference for the medium-sized model, enabling a fully autonomous, wearable, and accurate low-power BCI. © 2020 IEEE., 2020 IEEE International Symposium on Medical Measurements and Applications (MeMeA), ISBN:978-1-7281-5386-5, ISBN:978-1-7281-5387-2
- Published
- 2020
25. Mixed-data-model heterogeneous compilation and OpenMP offloading
- Author
-
Luca Benini, Koen Wolters, Björn Forsberg, Andrea Marongiu, Alessandro Capotondi, Andreas Kurth, Tobias Grosser, Kurth A., Wolters K., Forsberg B., Capotondi A., Marongiu A., Grosser T., and Benini L.
- Subjects
Data Model ,050101 languages & linguistics ,Computer science ,Compiler ,02 engineering and technology ,Compilers ,Data Models ,Heterogeneous Computer Architectures ,Memory Sharing ,Offloading ,OpenMP ,Runtime Libraries ,computer.software_genre ,Data modeling ,0202 electrical engineering, electronic engineering, information engineering ,0501 psychology and cognitive sciences ,Programmer ,business.industry ,Suite ,Heterogeneous Computer Architecture ,05 social sciences ,Multiple data ,Data model ,Embedded system ,020201 artificial intelligence & image processing ,business ,computer ,Host (network) ,Efficient energy use - Abstract
Heterogeneous computers combine a general-purpose host processor with domain-specific programmable many-core accelerators, uniting high versatility with high performance and energy efficiency. While the host manages ever-more application memory, accelerators are designed to work mainly on their local memory. This difference in addressed memory leads to a discrepancy between the optimal address width of the host and the accelerator. Today 64-bit host processors are commonplace, but few accelerators exceed 32-bit addressable local memory, a difference expected to increase with 128-bit hosts in the exascale era. Managing this discrepancy requires support for multiple data models in heterogeneous compilers. So far, compiler support for multiple data models has not been explored, which hampers the programmability of such systems and inhibits their adoption. In this work, we perform the first exploration of the feasibility and performance of implementing a mixed-data-model heterogeneous system. To support this, we present and evaluate the first mixed-data-model compiler, supporting arbitrary address widths on host and accelerator. To hide the inherent complexity and to enable high programmer productivity, we implement transparent offloading on top of OpenMP. The proposed compiler techniques are implemented in LLVM and evaluated on a 64+32-bit heterogeneous SoC. Results on benchmarks from the PolyBench-ACC suite show that memory can be transparently shared between host and accelerator at overheads below 0.7% compared to 32-bit-only execution, enabling mixed-data-model computers to execute at near-native performance.
- Published
- 2020
26. An Open-Source Scalable Thermal and Power Controller for HPC Processors
- Author
-
Andrea Bartolini, Robert Balas, Andrea Tilli, Simone Benatti, Luca Benini, Giovanni Bambini, Christian Conficoni, Bambini G., Balas R., Conficoni C., Tilli A., Benini L., Benatti S., and Bartolini A.
- Subjects
Computer science ,business.industry ,Overhead (engineering) ,02 engineering and technology ,Chip ,Parallel microcontroller ,7. Clean energy ,020202 computer hardware & architecture ,Power (physics) ,Microcontroller ,HPC Processor ,Embedded system ,Thermal Control ,Scalability ,Power Control ,0202 electrical engineering, electronic engineering, information engineering ,Real-time OS ,Scalable ,business ,Real-time operating system ,Power control - Abstract
In the last decade, high performance multi-core processor designs have followed an increase in number of cores, interfaces, heterogeneity and System-on-chip (SoC) complexity. HPC applications also require tailored chip designs with specific operating points and performance indexes. In this scenario, an advanced and configurable Power Controller System (PCS) is necessary to meet power and thermal constraints, without the necessity of static ultra-conservative margins on the operating points. In this paper, we propose an open-source PCS design, based on a parallel ultra-low power microcontroller with RISC-V cores, and an open-source software environment based on a Real-time operating system (RTOS) with a configurable Power-thermal control algorithm. Considering a 1ms control interval, the overhead of the RTOS is about 6% of the cycles in the nominal case. The control algorithm is able to limit temperature and power consumption within given bounds, while maximizing performance. The PCS is able to control up to 76 different cores/computing units with headroom for larger core counts.
- Published
- 2020
27. Optimizing Temporal Convolutional Network inference on FPGA-based accelerators
- Author
-
Gianfranco Deriu, Paolo Meloni, Luigi Raffo, Luca Benini, Marco Carreras, Carreras M., Deriu G., Raffo L., Benini L., and Meloni P.
- Subjects
Signal Processing (eess.SP) ,FOS: Computer and information sciences ,Speedup ,Computer science ,Inference ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Convolutional neural network ,Scheduling (computing) ,embedded system ,Field programmable gate arrays ,Task analysis ,Computer architecture ,Acceleration ,Kernel ,Quantization (signal) ,Neural networks ,Temporal convolutional network ,TCN ,hardware accelerator ,FPGA ,embedded systems ,Hardware Architecture (cs.AR) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,Segmentation ,Electrical Engineering and Systems Science - Signal Processing ,Electrical and Electronic Engineering ,Computer Science - Hardware Architecture ,0105 earth and related environmental sciences ,Artificial neural network ,Computer engineering ,Kernel (image processing) ,Hardware acceleration ,020201 artificial intelligence & image processing - Abstract
Convolutional Neural Networks (CNNs) are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition and segmentation. Recent research results demonstrate that multi-layer (deep) network involving mono-dimensional convolutions and dilation can be effectively used in time series and sequences classification and segmentation, as well as in tasks involving sequence modeling. These structures, commonly referred to as Temporal Convolutional Networks (TCNs), represent an extremely promising alternative to recurrent architectures, commonly used across a broad range of sequence modeling tasks. While FPGA based inference accelerators for classic CNNs are widespread, literature is lacking in a quantitative evaluation of their usability on inference for TCN models. In this paper we present such an evaluation, considering a CNN accelerator with specific features supporting TCN kernels as a reference and a set of state-of-the-art TCNs as a benchmark. Experimental results show that, during TCN execution, operational intensity can be critical for the overall performance. We propose a convolution scheduling based on batch processing that can boost efficiency up to 96% of theoretical peak performance. Overall we can achieve up to 111,8 GOPS/s and a power efficiency of 33,8 GOPS/s/W on an Ultrascale+ ZU3EG (up to $10\times$ speedup and $3\times$ power efficiency improvement with respect to pure software implementation).
- Published
- 2020
- Full Text
- View/download PDF
28. Neuro-PULP: A Paradigm Shift Towards Fully Programmable Platforms for Neural Interfaces
- Author
-
Simone Benatti, Pasquale Davide Schiavone, Song Luan, Yan Liu, Davide Rossi, Timothy G. Constandinou, Ian Williams, Luca Benini, Schiavone P.D., Rossi D., Liu Y., Benatti S., Luan S., Williams I., Benini L., and Constandinou T.
- Subjects
0209 industrial biotechnology ,Computer science ,business.industry ,02 engineering and technology ,Microcontroller ,020901 industrial engineering & automation ,Embedded system ,Neuro-PULP, Designing systems, brain-machine interfaces, FPGA, Parallel Ultra-Low-Power ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Field-programmable gate array ,Cluster analysis ,business ,Decoding methods ,Efficient energy use ,Neural decoding - Abstract
Designing systems with many recording channels is a major challenge in brain-machine interfaces. Power, bandwidth, and size requirements impose tight design constraints for implementing the required processing within an acceptable latency and battery life. Moreover, the variety of brain decoding algorithms require highly versatile systems that can be rapidly adapted to execute different tasks from experiment to experiment as for example microcontrollers (MCUs). However, state of the art MCUs lack of performance and consume too much power to be used as generic platforms for neural decoding applications. To overcome the aforementioned limitations, this paper presents an MCU-based system consisting of a 64-channel event-based neural interface and a Parallel Ultra-Low-Power (PULP) platform that acquires and processes the neural activity. The flexibility of the system (called Neuro-PULP) has been demonstrated through the deployment of two applications: one compressing raw data in streaming mode for wireless transmission, and one generating the cluster and time-stamp of detected spikes, leveraging a low-power event-mode. The event-based approach, coupled with the energy efficiency of the PULP architecture, leads to a more than 4x improvement in energy efficiency with respect to state of the art systems based on FPGAs, leading to the average power consumption of $114 \mu \mathrm {W} /$channel, yet retaining the flexibility of fully programmable processor-based architectures.
- Published
- 2020
29. The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology
- Author
-
Luca Benini, Florian Zaruba, Zaruba F., and Benini L.
- Subjects
FOS: Computer and information sciences ,Architectural analysi ,Computer science ,02 engineering and technology ,7. Clean energy ,Instruction set ,RISCV ,Hardware Architecture (cs.AR) ,0202 electrical engineering, electronic engineering, information engineering ,SIMD ,Electrical and Electronic Engineering ,Computer Science - Hardware Architecture ,Multi-core processor ,Out-of-order execution ,business.industry ,Microarchitecture ,Out of order ,Multicore processing ,Open source software ,Hardware ,Architectural analysis ,cost application-class ,020202 computer hardware & architecture ,Microcontroller ,Hardware and Architecture ,Embedded system ,RISC-V ,Virtual memory ,cost application-cla ,business ,Software - Abstract
The open-source RISC-V ISA is gaining traction, both in industry and academia. The ISA is designed to scale from micro-controllers to server-class processors. Furthermore, openness promotes the availability of various open-source and commercial implementations. Our main contribution in this work is a thorough power, performance, and efficiency analysis of the RISC-V ISA targeting baseline "application class" functionality, i.e. supporting the Linux OS and its application environment based on our open-source single-issue in-order implementation of the 64 bit ISA variant (RV64GC) called Ariane. Our analysis is based on a detailed power and efficiency analysis of the RISC-V ISA extracted from silicon measurements and calibrated simulation of an Ariane instance (RV64IMC) taped-out in GlobalFoundries 22 FDX technology. Ariane runs at up to 1.7 GHz and achieves up to 40 Gop/sW peak efficiency. We give insight into the interplay between functionality required for application-class execution (e.g. virtual memory, caches, multiple modes of privileged operation) and energy cost. Our analysis indicates that ISA heterogeneity and simpler cores with a few critical instruction extensions (e.g. packed SIMD) can significantly boost a RISC-V core's compute energy efficiency., 11 pages, submitted to IEEE Transaction on Very Large Scale Integration (VLSI) Systems
- Published
- 2019
30. Increasing the energy efficiency of microcontroller platforms with low-design margin co-processors
- Author
-
Jose Pineda da Gyvez, Andrea Bartolini, Davide Rossi, Andres Gomez, Luca Benini, Hamed Fatemi, Barış Can Kara, Electronic Systems, Gomez, A., Bartolini, A., Rossi, D., Can Kara, B., Fatemi, H., Pineda de Gyvez, J., and Benini, L.
- Subjects
Computer Networks and Communications ,business.industry ,Computer science ,Reliability (computer networking) ,020208 electrical & electronic engineering ,02 engineering and technology ,Energy consumption ,020202 computer hardware & architecture ,Core (optical fiber) ,Microcontroller ,Computer Networks and Communication ,Artificial Intelligence ,Hardware and Architecture ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,SDG 7 - Affordable and Clean Energy ,business ,Software ,Energy (signal processing) ,Computer hardware ,SDG 7 – Betaalbare en schone energie ,Power density ,Efficient energy use - Abstract
Reducing the energy consumption in low cost, performance-constrained microcontroller units (MCU’s) cannot be achieved with complex energy minimization techniques (i.e. fine-grained DVFS, Thermal Management, etc), due to their high overheads. To this end, we propose an energy-efficient, multi-core architecture combining two homogeneous cores with different design margins. One is a performance-guaranteed core, also called Heavy Core (HC), fabricated with a worst-case design margin. The other is a low-power core, called Light Core (LC), which has only a typical-corner design margin. Post-silicon measurements show that the Light core has a 30% lower power density compared to the Heavy core, with only a small loss in reliability. Furthermore, we derive the energy-optimal workload distribution and propose a runtime environment for Heavy/Light MCU platforms. The runtime decreases the overall energy by exploiting available parallelism to minimize the platform’s active time. Results show that, depending on the core to peripherals power-ratio and the Light core’s operating frequency, the expected energy savings range from 10 to 20%.
- Published
- 2017
31. An Energy-Efficient IoT node for HMI applications based on an ultra-low power Multicore Processor
- Author
-
Luca Benini, Marco Guermandi, Victor Kartsch, Simone Benatti, Fabio Montagna, Kartsch V., Guermandi M., Benatti S., Montagna F., and Benini L.
- Subjects
Multi-core processor ,Computer science ,business.industry ,PULP ,Node (networking) ,Wearable computer ,EMG ,02 engineering and technology ,Embedded system ,Ultra-low power ,Multi-core ,Power budget ,020202 computer hardware & architecture ,Embedded systems ,multi-core ,ultra-low power ,Software ,Gesture recognition ,Sensor node ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business ,Efficient energy use - Abstract
Developing wearable sensing technologies and unobtrusive devices is paving the way to the design of compelling applications for the next generation of systems for a smart IoT node for Human Machine Interaction (HMI). In this paper we present a smart sensor node for IoT and HMI based on a programmable Parallel Ultra-Low-Power (PULP) platform. We tested the system on a hand gesture recognition application, which is a preferred way of interaction in HMI design. A wearable armband with 8 EMG sensors is controlled by our IoT node, running a machine learning algorithm in real-time, recognizing up to 11 gestures with a power envelope of 11.84 mW. As a result, the proposed approach is capable to 35 hours of continuous operation and 1000 hours in standby. The resulting platform minimizes effectively the power required to run the software application and thus, it allows more power budget for high-quality AFE.
- Published
- 2019
32. Online Learning and Classification of EMG-Based Gestures on a Parallel Ultra-Low Power Platform Using Hyperdimensional Computing
- Author
-
Benatti, Simone, Montagna, Fabio, Kartsch, Victor, Rahimi, Abbas, Rossi, Davide, Benini, Luca, Benatti S., Montagna F., Kartsch V., Rahimi A., Rossi D., and Benini L.
- Subjects
Human-Machine Interface ,Automated ,hyperdimensional computing ,Computer science ,Embedded systems ,Biomedical Engineering ,Wearable computer ,02 engineering and technology ,Pattern Recognition ,Pattern Recognition, Automated ,Wearable Electronic Devices ,PULP platform ,Computer-Assisted ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,Electrical and Electronic Engineering ,Hidden Markov model ,Embedded system ,HMI ,Gestures ,Electromyography ,gesture recognition ,020208 electrical & electronic engineering ,Signal Processing, Computer-Assisted ,Electromyography (EMG) ,Online learning ,Support vector machine ,Computer architecture ,Gesture recognition ,Scalability ,Personal computer ,Pattern recognition (psychology) ,Signal Processing ,Algorithms ,Gesture - Abstract
This work presents a wearable EMG gesture recognition system based on the hyperdimensional (HD) computing paradigm, running on a programmable Parallel Ultra-Low-Power (PULP) platform. The processing chain includes efficient on-chip training, which leads to a fully embedded implementation with no need to perform any offline training on a personal computer. The proposed solution has been tested on 10 subjects in a typical gesture recognition scenario achieving 85% average accuracy on 11 gestures recognition, which is aligned with the State-Of-the-Art (SoA), with the unique capability of performing online learning. Furthermore, by virtue of the Hardware (HW) friendly algorithm and of the efficient PULP System-on-Chip (SoC) (Mr. Wolf) used for prototyping and evaluation, the energy budget required to run the learning part with 11 gestures is 10.04mJ, and 83.2uJ per classification. The system works with a average power consumption of 10.4mW in classification, ensuring around 29h of autonomy with a 100mAh battery. Finally, the scalability of the system is explored by increasing the number of channels (up-to 256 electrodes), demonstrating the suitability of our approach as universal, energy-efficient biopotential wearable recognition framework., IEEE Transactions on Biomedical Circuits and Systems, 13 (3), ISSN:1932-4545, ISSN:1940-9990
- Published
- 2019
33. FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things
- Author
-
Xiaying Wang, Lukas Cavigelli, Michele Magno, Luca Benini, Wang X., Magno M., Cavigelli L., and Benini L.
- Subjects
Signal Processing (eess.SP) ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Speedup ,Computer Networks and Communications ,Computer science ,neural networks (NNs) ,Machine Learning (stat.ML) ,02 engineering and technology ,multilayer perceptron (MLP) ,computer.software_genre ,wearable ,Machine Learning (cs.LG) ,embedded system ,low-power device ,Statistics - Machine Learning ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical Engineering and Systems Science - Signal Processing ,Edge computing ,TinyML ,Artificial neural network ,business.industry ,Firmware ,Energy consumption ,Perceptron ,020202 computer hardware & architecture ,Computer Science Applications ,Internet of Things (IoT) ,Microcontroller ,machine learning ,Hardware and Architecture ,Embedded system ,Signal Processing ,020201 artificial intelligence & image processing ,business ,Edge AI ,computer ,Information Systems ,Efficient energy use - Abstract
The growing number of low-power smart devices in the Internet of Things is coupled with the concept of "Edge Computing", that is moving some of the intelligence, especially machine learning, towards the edge of the network. Enabling machine learning algorithms to run on resource-constrained hardware, typically on low-power smart devices, is challenging in terms of hardware (optimized and energy-efficient integrated circuits), algorithmic and firmware implementations. This paper presents FANN-on-MCU, an open-source toolkit built upon the Fast Artificial Neural Network (FANN) library to run lightweight and energy-efficient neural networks on microcontrollers based on both the ARM Cortex-M series and the novel RISC-V-based Parallel Ultra-Low-Power (PULP) platform. The toolkit takes multi-layer perceptrons trained with FANN and generates code targeted at execution on low-power microcontrollers either with a floating-point unit (i.e., ARM Cortex-M4F and M7F) or without (i.e., ARM Cortex M0-M3 or PULP-based processors). This paper also provides an architectural performance evaluation of neural networks on the most popular ARM Cortex-M family and the parallel RISC-V processor called Mr. Wolf. The evaluation includes experimental results for three different applications using a self-sustainable wearable multi-sensor bracelet. Experimental results show a measured latency in the order of only a few microseconds and a power consumption of few milliwatts while keeping the memory requirements below the limitations of the targeted microcontrollers. In particular, the parallel implementation on the octa-core RISC-V platform reaches a speedup of 22x and a 69% reduction in energy consumption with respect to a single-core implementation on Cortex-M4 for continuous real-time classification.
- Published
- 2019
- Full Text
- View/download PDF
34. Idleness-Aware Dynamic Power Mode Selection on the i.MX 7ULP IoT Edge Processor
- Author
-
Luca Benini, Jose Pineda de Gyvez, Alfio Di Mauro, Hamed Fatemi, Mauro A.D., Fatemi H., de Gyvez J.P., and Benini L.
- Subjects
Power management ,Edge device ,Computer science ,02 engineering and technology ,01 natural sciences ,Reduction (complexity) ,Edge devices ,Energy efficiency ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,010302 applied physics ,business.industry ,lcsh:Applications of electric power ,Workload ,edge devices ,lcsh:TK4001-4102 ,020202 computer hardware & architecture ,Variable (computer science) ,Embedded system ,Dynamic demand ,Enhanced Data Rates for GSM Evolution ,business ,Efficient energy use - Abstract
Power management is a crucial concern in micro-controller platforms for the Internet of Things (IoT) edge. Many applications present a variable and difficult to predict workload profile, usually driven by external inputs. The dynamic tuning of power consumption to the application requirements is indeed a viable approach to save energy. In this paper, we propose the implementation of a power management strategy for a novel low-cost low-power heterogeneous dual-core SoC for IoT edge fabricated in 28 nm FD-SOI technology. Ss with more complex power management policies implemented on high-end application processors, we propose a power management strategy where the power mode is dynamically selected to ensure user-specified target idleness. We demonstrate that the dynamic power mode selection introduced by our power manager allows achieving more than 43% power consumption reduction with respect to static worst-case power mode selection, without any significant penalty in the performance of a running application., Journal of Low Power Electronics and Applications, 10 (2)
- Published
- 2020
35. PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors
- Author
-
Manuele Rusci, Luca Benini, Angelo Garofalo, Francesco Conti, Davide Rossi, Garofalo A., Rusci M., Conti F., Rossi D., and Benini L.
- Subjects
FOS: Computer and information sciences ,Computer science ,General Mathematics ,General Physics and Astronomy ,Edge processing ,02 engineering and technology ,Parallel computing ,Data type ,Low power ,0202 electrical engineering, electronic engineering, information engineering ,Neural and Evolutionary Computing (cs.NE) ,Embedded system ,Digital signal processing ,Operating point ,Artificial neural network ,business.industry ,020208 electrical & electronic engineering ,General Engineering ,Computer Science - Neural and Evolutionary Computing ,Byte ,Articles ,020202 computer hardware & architecture ,Embedded systems ,Quantized neural networks ,Edge Processing ,Coupled cluster ,RISC-V ,business ,Efficient energy use - Abstract
We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for Quantized Neural Network (QNN) inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing (DSP) extensions available in the PULP RISC-V processors and the cluster's parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63x with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30x and 19.6x less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on GAP-8 processor, outperforms by 36.8x and by 7.45x the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1x higher than STM32L4 and 39.5x higher than STM32H7, at the maximum efficiency operating point., Comment: 13 pages, 11 figures, 2 tables
- Published
- 2019
36. Torpor: A Power-Aware HW Scheduler for Energy Harvesting IoT SoCs
- Author
-
J. Pineda de Gyvez, Hamed Fatemi, Andres Gomez, Lothar Thiele, Pascal A. Hager, P. Anagnostou, Luca Benini, Anagnostou, P., Gomez, A., Hager, P.A., Fatemi, H., De Gyvez, J. Pineda, Thiele, L., and Benini, L.
- Subjects
Control and Optimization ,Computer science ,business.industry ,Continuous monitoring ,Energy Engineering and Power Technology ,020206 networking & telecommunications ,02 engineering and technology ,Dynamic priority scheduling ,Torpor ,020202 computer hardware & architecture ,Scheduling (computing) ,Software ,Modeling and Simulation ,Embedded system ,Available energy ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,business ,Field-programmable gate array ,Energy harvesting - Abstract
2018 28th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS), ISBN:978-1-5386-6365-3, ISBN:978-1-5386-6364-6, ISBN:978-1-5386-6366-0
- Published
- 2018
37. Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems
- Author
-
Luca Benini, Nuria Pazos, Miguel de Prado, Prado M.D., Pazos N., and Benini L.
- Subjects
010302 applied physics ,FOS: Computer and information sciences ,Speedup ,business.industry ,Computer science ,Deep learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Inference ,02 engineering and technology ,01 natural sciences ,Convolutional neural network ,Bottleneck ,020202 computer hardware & architecture ,Libraries, Acceleration, Engines, Reinforcement learning, Space exploration, Optimization, Computer architecture ,Random search ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Reinforcement learning ,Artificial intelligence ,Inference engine ,business - Abstract
Deep Learning is increasingly being adopted by industry for computer vision applications running on embedded devices. While Convolutional Neural Networks' accuracy has achieved a mature and remarkable state, inference latency and throughput are a major concern especially when targeting low-cost and low-power embedded platforms. CNNs' inference latency may become a bottleneck for Deep Learning adoption by industry, as it is a crucial specification for many real-time processes. Furthermore, deployment of CNNs across heterogeneous platforms presents major compatibility issues due to vendor-specific technology and acceleration libraries. In this work, we present QS-DNN, a fully automatic search based on Reinforcement Learning which, combined with an inference engine optimizer, efficiently explores through the design space and empirically finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices. We show that, an optimized combination can achieve 45x speedup in inference latency on CPU compared to a dependency-free baseline and 2x on average on GPGPU compared to the best vendor library. Further, we demonstrate that, the quality of results and time "to-solution" is much better than with Random Search and achieves up to 15x better results for a short-time search.
- Published
- 2018
- Full Text
- View/download PDF
38. Hero: An open-source research platform for HW/SW exploration of heterogeneous manycore systems
- Author
-
Pirmin Vogel, Andreas Kurth, Andrea Marongiu, Luca Benini, Alessandro Capotondi, Kurth A., Capotondi A., Vogel P., Benini L., Marongiu A., Bartolini, Andrea, Cardoso, João M.P., and Silvano, Cristina
- Subjects
Heterogeneous soc ,Computer science ,02 engineering and technology ,Parallel Architectures ,Heterogeneous (hybrid) systems ,System on a chip ,Reconfigurable Logic and FPGAs ,Parallel programming language ,Software ,0202 electrical engineering, electronic engineering, information engineering ,HERO ,Field-programmable gate array ,Multi-core processor ,Shared virtual memory ,business.industry ,020208 electrical & electronic engineering ,Heterogeneous socs ,Multi- and many-core architectures ,020202 computer hardware & architecture ,Parallel processing (DSP implementation) ,Embedded system ,Scalability ,Parallel programming model ,business ,Multi- and many-core architecture - Abstract
Heterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine “standard platform” software support (e.g. the Linux OS) with energy-efficient, domain-specific, highly parallel processing capabilities. In this work, we present HERO, a HeSoC platform that tackles this challenge in a novel way HERO’s host processor is an industry-standard ARM Cortex-A multicore complex, while its PMCA is a scalable, silicon-proven, open-source many-core processing engine, based on the extensible, open RISC-V ISA. We evaluate a prototype implementation of HERO, where the PMCA implemented on an FPGA fabric is coupled with a hard ARM Cortex-A host processor, and show that the run time overhead compared to manually written PMCA code operating on private physical memory is lower than 10 % for pivotal benchmarks and operating conditions. Thus, HERO demonstrates that ARM and RISC-V can productively coexist in a dual-ISA HW-SW platform., ISBN:978-1-4503-6591-8
- Published
- 2018
39. Modeling and Evaluation of Application-Aware Dynamic Thermal Control in HPC Nodes
- Author
-
Luca Benini, Daniele Cesarini, Andrea Bartolini, Alma Mater Studiorum Università di Bologna [Bologna] (UNIBO), Eidgenössische Technische Hochschule - Swiss Federal Institute of Technology [Zürich] (ETH Zürich), Michail Maniatakos, Ibrahim (Abe) M. Elfadel, Matteo Sonza Reorda, H. Fatih Ugurdag, José Monteiro, Ricardo Reis, TC 10, WG 10.5, Cesarini D., Bartolini A., and Benini L.
- Subjects
ILP ,Workload model ,Computer science ,Power model ,Power saving ,02 engineering and technology ,01 natural sciences ,Thermal constraint ,Energy saving ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Leverage (statistics) ,[INFO]Computer Science [cs] ,010302 applied physics ,Operating point ,business.industry ,Workload ,Thermal model ,Thermal control ,020202 computer hardware & architecture ,DTM ,Runtime ,Embedded system ,HPC ,Quantum ESPRESSO ,MPI ,business - Abstract
International audience; As side effects of the end of Dennard’s scaling, power and thermal technological walls stand in front of the evolution of supercomputers towards the exaflops era. Energy and temperature walls are big challenges to face for assuring a constant grow of performance in future. New generation architectures for HPC systems implement HW and SW components to address energy and thermal issues for increasing power and efficient computing in scientific workload. In thermal-bound HPC machines, workload-aware runtimes can leverage hardware knobs to guarantee the best operating point in term of performance and power saving without violating thermal constraints.In this paper, we present an integer-linear programming formulation for job mapping and frequency selection for thermal-bound HPC nodes. We use a fast solver and workload traces extracted from a real supercomputer to test our methodology. Our runtime is integrated into the MPI library, and it is capable of assigning high-performance cores to performance-critical processes. Critical processes are identified at execution time through a mathematical formulation, which relies on the characterization of the application workload and on the global synchronization barriers. We demonstrate that by combining long and short horizon predictions with information on the critical processes retrieved from the programming model, we can drastically improve the performance of the target application w.r.t. state-of-the-art DTM solutions.
- Published
- 2017
40. Towards a Novel HMI Paradigm Based on Mixed EEG and Indoor Localization Platforms
- Author
-
Michela Milano, Victor Kartsch, Simone Benatti, Mattia Salvaro, Luca Benini, Salvaro, M., Kartsch, V., Benatti, S., Milano, M., and Benini, L.
- Subjects
EEG, Indoor location, HMI, wearable, android ,0209 industrial biotechnology ,Engineering ,business.industry ,Wearable computer ,Location awareness ,020206 networking & telecommunications ,Cloud computing ,02 engineering and technology ,computer.software_genre ,020901 industrial engineering & automation ,Human–computer interaction ,Embedded system ,Location-based service ,0202 electrical engineering, electronic engineering, information engineering ,Global Positioning System ,Human–machine system ,business ,computer ,Decoding methods ,Wearable technology - Abstract
Location Based Services (LBS) have been gaining a great deal of attention thanks to their capability to enhance mobile services with location awareness. While outdoor localization is almost universally achieved via Global Positioning System (GPS), indoor localization is still challenging and a general solution is yet to be found. In a vision where wearable devices are taking over smartphones' leading role as gateway to the cyber world, new paradigms of interactive Human Machine Interfaces (iHMI) are arising. Among others, one of the most intriguing alternative iHMI is based on decoding the brain signals. Combining EEG activity data and indoor localization could dramatically improve the pervasiveness of the interaction between human, devices and environment. For these reasons, we propose a portable Hardware-Software platform that acquires brain EEG signals using a dedicated board along with position information from a cloud service. The positive results of the preliminary analysis successfully show the correlation between EEG signal and motion. Understanding that this is one of the first intents to merge these two sources of information, we intend to share publicly the ever-growing dataset to allow other researchers to investigate better the interaction between subjects and environments, and to lay the foundation of new paradigms in HMI.
- Published
- 2017
41. Aging-Aware Energy-Efficient Workload Allocation for Mobile Multimedia Platforms
- Author
-
Luca Benini, Francesco Paterna, Andrea Acquaviva, Paterna F., Acquaviva A., and Benini L.
- Subjects
Computer science ,Distributed computing ,Context (language use) ,02 engineering and technology ,multicore/single-chip multiprocessor ,computer.software_genre ,0202 electrical engineering, electronic engineering, information engineering ,Resource management ,SCHEDULING AND TASK PARTITIONING ,Multi-core processor ,Multimedia ,business.industry ,Quality of service ,020208 electrical & electronic engineering ,Workload ,Energy consumption ,020202 computer hardware & architecture ,Energy conservation ,Allocator ,Computational Theory and Mathematics ,Hardware and Architecture ,Embedded system ,RELIABILITY ,Signal Processing ,Resource allocation ,business ,computer ,Efficient energy use - Abstract
Multicore platforms are characterized by increasing variability and aging effects which imply heterogeneity in core performance, energy consumption and reliability. In particular, wear-out effects such as Negative-Bias-Temperature-Instability (NBTI) require run-time adaptation of system resource utilization to time-varying and uneven platform degradation, so as to prevent premature chip failure. In this context, task allocation techniques can be used to deal with heterogeneous cores and extend chip lifetime while minimizing energy and preserving Quality of Service (QoS). We propose a new formulation of the task allocation problem for variability affected platforms, which manages per-core utilization to achieve a target lifetime while minimizing energy consumption during the execution of rate-constrained multimedia applications. We devise an adaptive solution that can be applied on-line and approximates the result of an optimal, off-line version. Our allocator has been implemented and tested on real-life functional workloads running on a timing accurate simulator of a next-generation industrial multicore platform. We extensively assess the effectiveness of the on-line strategy both against the optimal solution and also compared to alternative state-of-the-art policies. The proposed policy outperforms state-of-the-art strategies in terms of lifetime preservation, while saving up to 20% of energy consumption without impacting timing constraints.
- Published
- 2013
42. Computing Accurate Performance Bounds for Best Effort Networks-on-Chip
- Author
-
G. De Micheli, Srinivasan Murali, Federico Angiolini, Luca Benini, Dara Rahmati, Hamid Sarbazi-Azad, Rahmati D., Murali S., Benini L., Angiolini F., De Micheli G., and Sarbazi-Azad H.
- Subjects
Computer science ,best-effort analysis ,QoS ,real-time ,02 engineering and technology ,Integrated circuit design ,Network topology ,01 natural sciences ,Theoretical Computer Science ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Networks on chip ,Wormhole switching ,010302 applied physics ,NoC, QoS, Soc ,real time ,business.industry ,Quality of service ,analytical model ,wormhole switching ,020202 computer hardware & architecture ,Network on a chip ,Computational Theory and Mathematics ,Hardware and Architecture ,Embedded system ,SoC ,business ,NoC ,best-effort analysi ,performance ,Software - Abstract
Real-time (RT) communication support is a critical requirement for many complex embedded applications which are currently targeted to Network-on-chip (NoC) platforms. In this paper, we present novel methods to efficiently calculate worst case bandwidth and latency bounds for RT traffic streams on wormhole-switched NoCs with arbitrary topology. The proposed methods apply to best-effort NoC architectures, with no extra hardware dedicated to RT traffic support. By applying our methods to several realistic NoC designs, we show substantial improvements (more than 30 percent in bandwidth and 50 percent in latency, on average) in bound tightness with respect to existing approaches.
- Published
- 2013
43. Scalable EEG seizure detection on an ultra low power multi-core architecture
- Author
-
Luca Benini, Fabio Montagna, Davide Rossi, Simone Benatti, Benatti, S., Montagna, F., Rossi, D., and Benini, L.
- Subjects
Smart system ,Signal processing ,Computer science ,business.industry ,020208 electrical & electronic engineering ,Biomedical Engineering ,Wearable computer ,02 engineering and technology ,Power budget ,020202 computer hardware & architecture ,Parallel processing (DSP implementation) ,Embedded system ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,Electrical and Electronic Engineering ,business ,Instrumentation ,Efficient energy use - Abstract
Energy efficient processing architectures represent key elements for wearable and implantable medical devices. Signal processing of neural data is a challenge in new designs of Brain Machine Interfaces (BMI). A highly efficient multi-core platform, designed for ultra low power processing allows the execution of complex algorithms complying with real time requirements. This paper describes the implementation and optimization of a seizure detection algorithm on a multi-core digital integrated circuit designed for energy efficient applications. The proposed architecture is able to implement ultra low power parallel processing seizure detection on 23 electrodes within a power budget of 1 mW, outperforming implementations on commercial MCUs by up to 100 times in terms of performance and up to 80 times in terms of energy efficiency still providing high versatility and scalability, opening the way to the development of efficient implantable and wearable smart systems.
- Published
- 2016
44. Long-Term ECG monitoring with zeroing Compressed Sensing approach
- Author
-
Riccardo Rovatti, Daniele Bortolotti, Andrea Bartolini, Gianluca Setti, Luca Benini, Fabio Pareschi, Mauro Mangia, Mangia, M., Bortolotti, D., Bartolini, A., Pareschi, F., Benini, L., Rovatti, R., and Setti, G.
- Subjects
Zero current switching ,Signal processing ,Standards ,Monitoring ,business.industry ,Computer science ,Nonvolatile memory ,Electrocardiography, Digital signal processing, Zero current switching,Monitoring,Nonvolatile memory,Standards, Biomedical monitoring ,Data compression ratio ,Digital signal processing ,NO ,Non-volatile memory ,Electrocardiography ,Compressed sensing ,Hardware and Architecture ,Embedded system ,Wireless ,Latency (engineering) ,Electrical and Electronic Engineering ,business ,Computer hardware ,Energy (signal processing) ,Biomedical monitoring - Abstract
Novel low-voltage, low latency, non-volatile memory (NVM) technologies allow long-term wearable biomedical monitors to benefit from large storage capability, avoiding costly wireless transmissions and enabling, along with proper signal processing and architectural optimization, minimal energy operations and extended battery life. The recently proposed rakeness-based Compressed Sensing (RCS) offers high compression rate with an associated low computational power. This allows an energy trade-off between the compression stage and the storage stage. In this paper we introduce a novel approach, namely zeroing CS, which reduces RCS computational requirements to extremely low levels. The new energy trade-off is analyzed, considering a suitable multi-core DSP and different NVM technologies for local storage. According to our analysis, the proposing zeroing approach is up to 80% more efficient than a standard CS solution and 70% w.r.t. RCS when overall energy requirement is not dominated by storage.
- Published
- 2015
45. Specification and analysis of power-managed systems
- Author
-
G. De Micheli, Luca Benini, Emanuele Lattanzi, Alessandro Bogliolo, BOGLIOLO A., BENINI L., LATTANZI E., and DE MICHELI G.
- Subjects
Power management ,Engineering ,business.industry ,Energy management ,Semantics (computer science) ,Distributed computing ,System requirements specification ,Microarchitecture ,Nondeterministic algorithm ,Power system simulation ,Embedded system ,Electrical and Electronic Engineering ,business ,Real-time operating system - Abstract
Dynamic power management encompasses several techniques for reducing energy dissipation in electronic systems by selective slowdown or shutdown of components. We present a theoretical framework for explaining and classifying different approaches to power management. Within this framework, we model power-manageable components, workloads, and controllers as discrete-event systems (DESs). The structure of these DESs is specified in terms of physical states (representing operation modes) and events (triggering state transitions), while system behavior is specified in terms of next-event and next-state functions. In particular, nondeterministic next-event and next-state functions are modeled by conditional probability distributions, according to generalized semi-Markov processes (GSMPs). The modeling framework provides a general denotational model for system specification and a rigorous execution semantics that enables event-driven simulation. We introduce a modeling framework, built on top of MathWork's Simulink, supporting the specification and execution of our model. In particular, we present templates for the Simulink simulator to execute GSMP models, and we describe how to use such templates for specifying, analyzing, and optimizing dynamic power-managed systems. Finally, we demonstrate the expressive power and versatility of the proposed approach by using the modeling framework and the simulator for the analysis of representative real-life case studies, including the Intel Xscale processor architecture, a multitasking real-time system, and a sensor network.
- Published
- 2004
46. Aging-aware compiler-directed VLIW assignment for GPGPU architectures
- Author
-
Abbas Rahimi, Rajesh Gupta, Luca Benini, Rahimi A., Benini L., and Gupta R.K.
- Subjects
010302 applied physics ,business.industry ,Computer science ,Reliability (computer networking) ,02 engineering and technology ,Parallel computing ,computer.software_genre ,01 natural sciences ,020202 computer hardware & architecture ,Instruction set ,Software ,Very long instruction word ,Embedded system ,Kernel (statistics) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Compiler ,General-purpose computing on graphics processing units ,business ,computer ,graphics processing units instruction sets negative bias temperature instability operating system kernels parallel architectures program compilers - Abstract
Negative bias temperature instability (NBTI) adversely affects the reliability of a processor by introducing new delay-induced faults. However, the effect of these delay variations is not uniformly spread across functional units and instructions: some are affected more (hence less reliable) than others. This paper proposes a NBTI-aware compiler-directed very long instruction word (VLIW) assignment scheme that uniformly distributes the stress of instructions with the aim of minimizing aging of GPGPU architecture without any performance penalty. The proposed solution is an entirely software technique based on static workload characterization and online execution with NBTI monitoring that equalizes the expected lifetime of each processing element by regenerating aging-aware healthy kernels that respond to the specific health state of GPGPU. We demonstrate our approach on AMD Evergreen architecture where iso-throughput executions of the healthy kernels reduce NBTI-induced voltage threshold shift up to 49% (11%) compared to naïve kernel executions, with (without) architectural support for power-gating. The kernel adaption flow takes average of 13 millisecond on a typical host machine thus making it suitable for practical implementation.
- Published
- 2013
47. Combined methods to extend the lifetime of power hungry WSN with multimodal sensors and nanopower wakeups
- Author
-
Davide Brunelli, Emanuel Popovici, Stevan Marinkovic, Michele Magno, Luca Benini, Magno M., Marinkovic S., Brunelli D., Benini L., and Popovici E.
- Subjects
Power management ,surveilance ,pyroelectric sensor ,low power ,business.industry ,Computer science ,Node (networking) ,Energy consumption ,multimodal network ,Bottleneck ,Intelligent sensor ,Embedded system ,wake-up radio trigger ,Wireless ,power management ,Smart camera ,business ,Wireless sensor network - Abstract
During recent years, there has been a growing interest on wireless sensor networks (WSNs) and on the opportunities opened by this technology. Since the energy consumption is a bottleneck in WSNs, reducing it has a significant impact on the applicability of this technology. Typically, the energy consumed by wireless communication and by power-hungry sensors as CMOS imagers or Gas sensors, is dominant over the power required for computation or other activities of the node. Hence, an efficient management of the resources leading to a reduction of unnecessary communication and minimizing the use of power-hungry sensor while keeping the same performance is desirable to extend the life-time of the network. In this paper we address the challenges of exploiting wake-up receivers and heterogeneous sensors in WSN applications to reduce the average power consumption of individual nodes. In particular, we show how to configure a WSN which includes Pyroelectric InfraRed (PIR) sensors, smart camera sensors and a nano-Watt wake up radio as secondary radio receiver to efficiently extend the autonomy of the system. The evaluation of the proposed approach shows a significant reduction of the activity of the primary radio and of the high power sensor while keeping the same accuracy. We prototyped and tested the nodes, and used their characterization to demonstrate through simulations the power consumption reduction and the life-time extension of the network in a typical surveillance application.
- Published
- 2012
48. Platform 2012, a many-core computing accelerator for embedded SoCs
- Author
-
Germain Haugou, Bruno Jego, Fabien Clermidy, Thierry Lepley, Diego Melpignano, Eric Flamand, Denis Dutoit, Luca Benini, Melpignano D., Benini L., Flamand E., Jego B., Lepley T., Haugou G., Clermidy F., and Dutoit D.
- Subjects
Power management ,Visual analytics ,Computer science ,business.industry ,CMOS ,power consumption ,P2012 ,Synchronization ,Asynchronous communication ,Embedded system ,Programming paradigm ,System on a chip ,power management ,business - Abstract
P2012 is an area- and power-efficient many-core computing accelerator based on multiple globally asynchronous, locally synchronous processor clusters. Each cluster features up to 16 processors with independent instruction streams sharing a multibanked one-cycle access L1 data memory, a multi-channel DMA engine and specialized hardware for synchronization and aggressive power management. P2012 is 3D stacking ready and can be customized to achieve extreme area and energy efficiency by adding domain-specific HW IPs to the cluster. The first P2012 SoC prototype in 28nm CMOS will sample in Q3, featuring four 16-processor clusters, a 1MB L2 memory and delivering 80GOPS (with 32 bit single precision floating point support) in 18mm2 with 2W power consumption (worst-case). P2012 can run standard OpenCLTM and proprietary Native Programming Model SW components to achieve the highest level of control on applicationto- resource mapping. A dedicated version of the OpenCV vision library is provided in the P2012 SW Development Kit to enable visual analytics acceleration. This paper will discuss preliminary performance measurements of common feature extraction and tracking algorithms, parallelized on P2012, versus sequential execution on ARM CPUs.
- Published
- 2012
49. Low-power processor architecture exploration for online biomedical signal analysis
- Author
-
Jeremy Constantin, Ahmed Yasir Dogan, Andreas Burg, David Atienza, Luca Benini, Dogan A.Y., Constantin J., Atienza D., Burg A., and Benini L.
- Subjects
Signal processing ,Interconnection ,BIOMEDICAL ELECTRONICS ,business.industry ,Computer science ,Computation ,Microarchitecture ,parallel architecture ,Instruction set ,Microcontroller ,Control and Systems Engineering ,microcontroller ,Embedded system ,Biosignal ,Electrical and Electronic Engineering ,Crossbar switch ,business ,medical signal processing ,low-power electronic - Abstract
In this study, the authors explore sequential and parallel processing architectures, utilising a custom ultra-low-power (ULP) processing core, to extend the lifetime of health monitoring systems, where slow biosignal events and highly parallel computations exist. To this end, a single- and a multi-core architecture are proposed and compared. The single-core architecture is composed of one ULP processing core, an instruction memory (IM) and a data memory (DM), while the multi-core architecture consists of several ULP processing cores, individual IMs for each core, a shared DM and an interconnection crossbar between the cores and the DM. These architectures are compared with respect to power/performance trade-offs for different target workloads of online biomedical signal analysis, while exploiting near threshold computing. The results show that with respect to the single-core architecture, the multi-core solution consumes 62% less power for high computation requirements (167 MOps/s), while consuming 46% more power for extremely low computation needs when the power consumption is dominated by leakage. Additionally, the authors show that the proposed ULP processing core, using a simplified instruction set architecture (ISA), achieves energy savings of 54% compared to a reference microcontroller ISA (PIC24).
- Published
- 2012
50. Static Thermal Model Learning for High-Performance Multicore Servers
- Author
-
Francesco Beneventi, Luca Benini, Andrea Bartolini, Beneventi F., Bartolini A., and Benini L.
- Subjects
multiprocessing system ,Multi-core processor ,Computer science ,business.industry ,Linux ,power aware computing ,Hardware_PERFORMANCEANDRELIABILITY ,Thermal management of electronic devices and systems ,Dissipation ,Optimal control ,Temperature measurement ,Reliability engineering ,least squares approximation ,Robustness (computer science) ,thermal management (packaging) ,Server ,Embedded system ,Thermal ,Hardware_INTEGRATEDCIRCUITS ,business ,temperature measurement - Abstract
Aggressive thermal management is a critical feature for high-end computing platforms, as worst-case thermal budgeting is becoming unaffordable. Reactive thermal management, which sets temperature thresholds to trigger thermal capping actions, is too "near-sighted", and it may lead to severe performance degradation and thermal overshoots. More aggressive proactive thermal management minimizes performance penalty with smooth optimal control, but it requires the knowledge of the system thermal models to be precise. Unfortunately, in practice these models are not provided by equipment manufacturers, and they strongly depend on the deployment environment. Hence, we need to develop procedures to derive thermal models automatically in the field. In this paper, we focus on static thermal model learning. We tackle the problem in a real-life context: we developed a complete infrastructure for model-building and thermal data collection in the Linux environment, and we tested it on an Intel Nehalem-based server CPU. Model building is based on a least-square procedure which extracts the model linking power dissipation with temperature in steady-state conditions. Our results show high accuracy and robustness even in presence of a complex thermal environment and limited-precision power and temperature measurements typical of today's commercial servers.
- Published
- 2011
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.