Author: "Per Stenström" / Database: OpenAIRE - Searchworks@Jio Institute Digital Library Search Results

1. SCALE: Secure and Scalable Cache Partitioning

Author: Nadja Ramhöj Holtryd, Madhavan Manivannan, and Per Stenström
Published: 2023
Full Text: View/download PDF

2. Cooperative Slack Management: Saving Energy of Multicore Processors by Trading Performance Slack Between QoS-Constrained Applications

Author: Mehrzad Nejat, Madhavan Manivannan, Miquel Pericàs, and Per Stenström
Subjects: Hardware and Architecture, Software, Information Systems
Abstract: Processor resources can be adapted at runtime according to the dynamic behavior of applications to reduce the energy consumption of multicore processors without affecting the Quality-of-Service (QoS). To achieve this, an online resource management scheme is needed to control processor configurations such as cache partitioning, dynamic voltage-frequency scaling, and dynamic adaptation of core resources.Prior State-of-the-art has shown the potential for reducing energy without any performance degradation by coordinating the control of different resources. However, in this article, we show that by allowing short-term variations in processing speed (e.g., instructions per second rate), in a controlled fashion, we can enable substantial improvements in energy savings while maintaining QoS. We keep track of such variations in the form of performance slack. Slack can be generated, at some energy cost, by processing faster than the performance target. On the other hand, it can be utilized to save energy by allowing a temporary relaxation in the performance target. Based on this insight, we present Cooperative Slack Management (CSM). During runtime, CSM finds opportunities to generate slack at low energy cost by estimating the performance and energy for different resource configurations using analytical models. This slack is used later when it enables larger energy savings. CSM performs such trade-offs across multiple applications, which means that the slack collected for one application can be used to reduce the energy consumption of another. This cooperative approach significantly increases the opportunities to reduce system energy compared with independent slack management for each application. For example, we show that CSM can potentially save up to 41% of system energy (on average, 25%) in a scenario in which both prior art and an extended version with local slack management for each core are ineffective.
Published: 2022
Full Text: View/download PDF

3. Task-RM: A Resource Manager for Energy Reduction in Task-Parallel Applications under Quality of Service Constraints

Author: M. Waqar Azhar, Miquel Pericàs, and Per Stenström
Subjects: Hardware and Architecture, Software, Information Systems
Abstract: Improving energy efficiency is an important goal of computer system design. This article focuses on a general model of task-parallel applications under quality-of-service requirements on the completion time. Our technique, called Task-RM , exploits the variance in task execution-times and imbalance between tasks to allocate just enough resources in terms of voltage-frequency and core-allocation so that the application completes before the deadline. Moreover, we provide a solution that can harness additional energy savings with the availability of additional processors. We observe that, for the proposed run-time resource manager to allocate resources, it requires specification of the soft deadlines to the tasks. This is accomplished by analyzing the energy-saving scenarios offline and by providing Task-RM with the performance requirements of the tasks. The evaluation shows an energy saving of 33% compared to race-to-idle and 22% compared to dynamic slack allocation (DSA) with an overhead of less than 1%.
Published: 2022
Full Text: View/download PDF

4. Federated Scheduling of Sporadic DAGs on Unrelated Multiprocessors

Author: Per Stenström, Risat Mahmud Pathan, and Petros Voudouris
Subjects: Job shop scheduling, Hardware and Architecture, Computer science, Heuristic (computer science), Value (computer science), Multiprocessing, Parallel computing, Software, Scheduling (computing)
Abstract: This paper presents a federated scheduling algorithm for implicit-deadline sporadic DAGs that execute on an unrelated heterogeneous multiprocessor platform. We consider a global work-conserving scheduler to execute a single DAG exclusively on a subset of the unrelated processors. Formal schedulability analysis to find the makespan of a DAG on its dedicated subset of the processors is proposed. The problem of determining each subset of dedicated unrelated processors for each DAG such that the DAG meets its deadline (i.e., designing the federated scheduling algorithm) is tackled by proposing a novel processors-to-task assignment heuristic using a new concept called processor value . Empirical evaluation is presented to show the effectiveness of our approach.
Published: 2021
Full Text: View/download PDF

5. Bounding the execution time of parallel applications on unrelated multiprocessors

Author: Per Stenström, Petros Voudouris, and Risat Mahmud Pathan
Subjects: Schedule, Control and Optimization, Computer Networks and Communications, Computer science, Multiprocessing, 02 engineering and technology, Parallel computing, Scheduling (computing), Computer Science::Hardware Architecture, Bounding overwatch, Computer Systems, 0202 electrical engineering, electronic engineering, information engineering, Real-time Scheduling, Computer Engineering, heterogeneous, scheduling, Electrical and Electronic Engineering, Heterogeneous multiprocessors, 020203 distributed computing, Makespan, Job shop scheduling, Directed acyclic graph, 020202 computer hardware & architecture, Computer Science Applications, Task (computing), Control and Systems Engineering, Modeling and Simulation, Benchmark (computing), Parallel Applications, Embedded Systems
Abstract: Heterogeneous multiprocessors can offer high performance at low energy expenditures. However, to be able to use them in hard real-time systems, timing guarantees need to be provided, and the main challenge is to determine the worst-case schedule length (also known as makespan) of an application. Previous works that estimate the makespan focus mainly on the independent-task application model or the related multiprocessor model that limits the applicability of the makespan. On the other hand, the directed acyclic graph (DAG) application model and the unrelated multiprocessor model are general and can cover most of today’s platforms and applications. In this work, we propose a simple work-conserving scheduling method of the tasks in a DAG and two new approaches to finding the makespan. A set of representative OpenMP task-based parallel applications from the BOTS benchmark suite and synthetic DAGs are used to evaluate the proposed method. Based on the empirical results, the proposed approach calculates the makespan close to the exhaustive method and with low pessimism compared to a lower bound of the actual makespan calculation.
Published: 2022

6. CBP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling

Author: Miquel Pericas, Per Stenström, Nadja Ramhoj Holtryd, and Madhavan Manivannan
Subjects: Instruction prefetch, FOS: Computer and information sciences, Hardware_MEMORYSTRUCTURES, Channel allocation schemes, Computer science, Distributed computing, Bandwidth throttling, Bandwidth allocation, Hardware Architecture (cs.AR), Bandwidth (computing), Cache, Computer Science - Hardware Architecture, Resource management (computing), Average memory access time
Abstract: Reducing the average memory access time is crucial for improving the performance of applications running on multicore architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention. Techniques for partitioning of shared resources - cache and bandwidth - and prefetch throttling have been proposed to mitigate contention and reduce the average memory access time. However, existing proposals only employ a single or a subset of these techniques and are therefore not able to exploit the full potential of coordinated management of cache, bandwidth and prefetching. Our characterization results show that application performance, in several cases, is sensitive to prefetching, cache and bandwidth allocation altogether. Furthermore, the results show that managing these together provides higher performance potential during workload consolidation as it enables more resource trade-offs. In this paper, we propose CBP a coordination mechanism for dynamically managing prefetching throttling, cache and bandwidth partitioning, in order to reduce average memory access time and improve performance. CBP works by employing individual resource managers to determine the appropriate setting for each resource and a coordinating mechanism in order to enable inter-resource trade-offs. Our evaluation on a 16-core CMP shows that CBP, on average, improves performance by 11% compared to the state-of-the-art technique that manages cache partitioning and prefetching and by 50% compared to the baseline without cache partitioning, bandwidth partitioning and prefetch throttling.
Published: 2021
Full Text: View/download PDF

7. A GPU Register File using Static Data Compression

Author: Alexandra Angerd, Per Stenström, and Erik Sintorn
Subjects: FOS: Computer and information sciences, 020203 distributed computing, Computer science, Register file, 020206 networking & telecommunications, 02 engineering and technology, Static analysis, computer.software_genre, Operand, Hardware Architecture (cs.AR), 0202 electrical engineering, electronic engineering, information engineering, Operating system, Leverage (statistics), Computer Science - Hardware Architecture, computer, Throughput (business), Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION, Integer (computer science)
Abstract: GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilization. This paper introduces a new register file organization for efficient register-packing of narrow integer and floating-point operands designed to leverage on advances in static analysis. We show that the hardware/software co-designed register file organization yields a performance improvement of up to 79%, and 18.6%, on average, at a modest output-quality degradation., Accepted to ICPP'20
Published: 2020

8. DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors

Author: Per Stenström, Madhavan Manivannan, Nadja Ramhoj Holtryd, and Miquel Pericas
Subjects: 010302 applied physics, Computer science, Locality, Temporal isolation among virtual machines, 02 engineering and technology, Parallel computing, Chip, 01 natural sciences, 020202 computer hardware & architecture, Asynchronous communication, visual_art, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, visual_art.visual_art_medium, Cache, Tile, Latency (engineering)
Abstract: Cache partitioning in tile-based CMP architectures is a challenging problem because of i) the need to determine capacity allocations with low computational overhead and ii) the need to place allocations close to where they are used, in order to reduce access latency. Although, previous solutions have addressed the problem of reducing the computational overhead and incorporating locality-awareness, they suffer from the overheads of centrally determining allocations.In this paper, we propose DELTA, a novel distributed and locality-aware cache partitioning solution which works by exchanging asynchronous challenges among cores. The distributed nature of the algorithm coupled with the low computational complexity allows for frequent reconfigurations at negligible cost and for the scheme to be implemented directly in hardware. The allocation algorithm is supported by an enforcement mechanism which enables locality-aware placement of data. We evaluate DELTA on 16- and 64-core tiled CMPs with multi-programmed workloads. Our evaluation shows that DELTA improves performance by 9% and 16%, respectively, on average, compared to an unpartitioned shared last-level cache.
Published: 2020
Full Text: View/download PDF

9. Global Dead-Block Management for Task-Parallel Programs

Author: Per Stenström, Miquel Pericas, Madhavan Manivannan, and Vassilis Papaefstathiou
Subjects: 010302 applied physics, Hardware_MEMORYSTRUCTURES, Computer science, Distributed computing, Task parallelism, 02 engineering and technology, 01 natural sciences, 020202 computer hardware & architecture, Task (computing), Runtime system, Shared memory, Hardware and Architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Cache, Cache hierarchy, Multicore architecture, Software, Information Systems, Block (data storage)
Abstract: Task-parallel programs inefficiently utilize the cache hierarchy due to the presence of dead blocks in caches. Dead blocks may occupy cache space in multiple cache levels for a long time without providing any utility until they are finally evicted. Existing dead-block prediction schemes take decisions locally for each cache level and do not efficiently manage the entire cache hierarchy. This article introduces runtime-orchestrated global dead-block management , in which static and dynamic information about tasks available to the runtime system is used to effectively detect and manage dead blocks across the cache hierarchy. In the proposed global management schemes, static information (e.g., when tasks start/finish, and what data regions tasks produce/consume) is combined with dynamic information to detect when/where blocks become dead. When memory regions are deemed dead at some cache level(s), all the associated cache blocks are evicted from the corresponding level(s). We extend the cache controllers at both private and shared cache levels to use the aforementioned information to evict dead blocks. The article does an extensive evaluation of both inclusive and non-inclusive cache hierarchies and shows that the proposed global schemes outperform existing local dead-block management schemes.
Published: 2018
Full Text: View/download PDF

10. Scheduling Parallel Real-Time Recurrent Tasks on Multicore Platforms

Author: Petros Voudouris, Risat Mahmud Pathan, and Per Stenström
Subjects: Multi-core processor, Computational Theory and Mathematics, Hardware and Architecture, Computer science, Distributed computing, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, Dynamic priority scheduling, Parallel computing, 020202 computer hardware & architecture, Scheduling (computing)
Abstract: We consider the scheduling of a real-time application that is modeled as a collection of parallel and recurrent tasks on a multicore platform. Each task is a directed-acyclic graph (DAG) having a set of subtasks (i.e., nodes) with precedence constraints (i.e., directed edges) and must complete the execution of all its subtasks by some specified deadline. Each task generates potentially infinite number of instances where the releases of consecutive instances are separated by some minimum inter-arrival time. Each DAG task and each subtask of that DAG task is assigned a fixed priority. A two-level preemptive global fixed-priority scheduling ( GFP ) policy is proposed: a task-level scheduler first determines the highest-priority ready task and a subtask-level scheduler then selects its highest-priority subtask for execution. To our knowledge, no earlier work considers a two-level GFP scheduler to schedule recurrent DAG tasks on a multicore platform. We derive a schedulability test for our proposed two-level GFP scheduler. If this test is satisfied, then it is guaranteed that all the tasks will meet their deadlines under GFP . We show that our proposed test is not only theoretically better but also empirically performs much better than the state-of-the-art test in scheduling randomly generated parallel DAG task sets.
Published: 2018
Full Text: View/download PDF

11. SLOOP

Author: M. Waqar Azhar, Per Stenström, and Vassilis Papaefstathiou
Subjects: 010302 applied physics, Scheme (programming language), Schedule, Exploit, Computer science, Quality of service, Distributed computing, Work (physics), 02 engineering and technology, Energy consumption, 01 natural sciences, 020202 computer hardware & architecture, Task (project management), Hardware and Architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, computer, Software, Energy (signal processing), Information Systems, computer.programming_language
Abstract: Most systems allocate computational resources to each executing task without any actual knowledge of the application’s Quality-of-Service (QoS) requirements. Such best-effort policies lead to overprovisioning of the resources and increase energy loss. This work assumes applications with soft QoS requirements and exploits the inherent timing slack to minimize the allocated computational resources to reduce energy consumption. We propose a lightweight progress-tracking methodology based on the outer loops of application kernels. It builds on online history and uses it to estimate the total execution time. The prediction of the execution time and the QoS requirements are then used to schedule the application on a heterogeneous architecture with big out-of-order cores and small (LITTLE) in-order cores and select the minimum operating frequency, using DVFS, that meets the deadline. Our scheme is effective in exploiting the timing slack of each application. We show that it can reduce the energy consumption by more than 20% without missing any computational deadlines.
Published: 2017
Full Text: View/download PDF

12. A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs

Author: Per Stenström, Erik Sintorn, and Alexandra Angerd
Subjects: 010302 applied physics, Floating point, Exploit, Computer science, Register file, 02 engineering and technology, Thread (computing), Parallel computing, computer.software_genre, 01 natural sciences, 020202 computer hardware & architecture, Microarchitecture, Computer graphics, Hardware and Architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Compiler, Graphics, computer, Software, Information Systems
Abstract: Reducing the precision of floating-point values can improve performance and/or reduce energy expenditure in computer graphics, among other, applications. However, reducing the precision level of floating-point values in a controlled fashion needs support both at the compiler and at the microarchitecture level. At the compiler level, a method is needed to automate the reduction of precision of each floating-point value. At the microarchitecture level, a lower precision of each floating-point register can allow more floating-point values to be packed into a register file. This, however, calls for new register file organizations. This article proposes an automated precision-selection method and a novel GPU register file organization that can store floating-point register values at arbitrary precisions densely. The automated precision-selection method uses a data-driven approach for setting the precision level of floating-point values, given a quality threshold and a representative set of input data. By allowing a small, but acceptable, degradation in output quality, our method can remove a significant amount of the bits needed to represent floating-point values in the investigated kernels (between 28% and 60%). Our proposed register file organization exploits these lower-precision floating-point values by packing several of them into the same physical register. This reduces the register pressure per thread by up to 48%, and by 27% on average, for a negligible output-quality degradation. This can enable GPUs to keep up to twice as many threads in flight simultaneously.
Published: 2017
Full Text: View/download PDF

13. Runtime-Assisted Global Cache Management for Task-Based Parallel Programs

Author: Madhavan Manivannan, Miquel Pericas, Vassilis Papaefstathiou, and Per Stenström
Subjects: 010302 applied physics, Hardware_MEMORYSTRUCTURES, Cache coloring, Computer science, Distributed computing, Global Assembly Cache, 02 engineering and technology, Parallel computing, Cache pollution, Cache-oblivious algorithm, 01 natural sciences, 020202 computer hardware & architecture, Smart Cache, Hardware and Architecture, Cache invalidation, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Cache, Cache algorithms
Abstract: Dead blocks are handled inefficiently in multi-level cache hierarchies because the decision as to whether a block is dead has to be taken locally at each cache level. This paper introduces runtime-assisted global cache management to quickly deem blocks dead across cache levels in the context of task-based parallel programs. The scheme is based on a cooperative hardware/software approach that leverages static and dynamic information about future data region reuse(s) available to runtime systems for task-based parallel programming models. We show that our proposed runtime-assisted global cache management approach outperforms previously proposed local dead-block management schemes for task-based parallel programs.
Published: 2017
Full Text: View/download PDF

14. Coordinated Management of Processor Configuration and Cache Partitioning to Optimize Energy under QoS Constraints

Author: Mehrzad Nejat, Per Stenström, Madhavan Manivannan, and Miquel Pericas
Subjects: 010302 applied physics, FOS: Computer and information sciences, Multi-core processor, Computer science, Distributed computing, Quality of service, 02 engineering and technology, Energy consumption, 01 natural sciences, 020202 computer hardware & architecture, Resource (project management), 0103 physical sciences, Memory-level parallelism, Hardware Architecture (cs.AR), 0202 electrical engineering, electronic engineering, information engineering, Resource management, Cache, Computer Science - Hardware Architecture, Efficient energy use
Abstract: An effective way to improve energy efficiency is to throttle hardware resources to meet a certain performance target, specified as a QoS constraint, associated with all applications running on a multicore system. Prior art has proposed resource management (RM) frameworks in which the share of the last-level cache (LLC) assigned to each processor and the voltage-frequency (VF) setting for each processor is managed in a coordinated fashion to reduce energy. A drawback of such a scheme is that, while one core gives up LLC resources for another core, the performance drop must be compensated by a higher VF setting which leads to a quadratic increase in energy consumption. By allowing each core to be adapted to exploit instruction and memory-level parallelism (ILP/MLP), substantially higher energy savings are enabled. This paper proposes a coordinated RM for LLC partitioning, processor adaptation, and per-core VF scaling. A first contribution is a systematic study of the resource trade-offs enabled when trading between the three classes of resources in a coordinated fashion. A second contribution is a new RM framework that utilizes these trade-offs to save more energy. Finally, a challenge to accurately model the impact of resource throttling on performance is to predict the amount of MLP with high accuracy. To this end, the paper contributes with a mechanism that estimates the effect of MLP over different processor configurations and LLC allocations. Overall, we show that up to 18% of energy, and on average 10%, can be saved using the proposed scheme., Submitted to the 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS2020)
Published: 2019

15. SaC

Author: M. Waqar Azhar, Miquel Pericas, and Per Stenström
Subjects: Multi-core processor, Computer science, Quality of service, Distributed computing, Key (cryptography), Brute-force search, Resource management, Point (geometry), Configuration space, Computer Science::Operating Systems, Energy (signal processing), Efficient energy use
Abstract: Reducing the energy to carry out computational tasks is key to almost any computing application. We focus in this paper on iterative applications that have explicit computational deadlines per iteration. Our objective is to meet the computational deadlines while minimizing energy. We leverage the vast configuration space offered by heterogeneous multicore platforms which typically expose three dimensions for energy saving configurability: Voltage/frequency levels, thread count and core type (e.g. ARM big/LITTLE). We note that when choosing the most energy-efficient configuration that meets the computational deadline, an iteration will typically finish before the deadline and execution-time slack will build up across iterations. Our proposed slack management policy - SaC (Slack as a Currency) - proactively explores the configuration space to select configurations that can save substantial amounts of energy. To avoid the overheads of an exhaustive search of the configuration space, our proposal also comprises a low-overhead, on-line method by which one can assess each point in the configuration space by linearly interpolating between the endpoints in each configuration-space dimension. Overall, we show that our proposed slack management policy and linear-interpolation configuration assessment method can yield 62% energy savings on top of race-to-idle without missing any deadlines.
Published: 2019
Full Text: View/download PDF

16. QoS-Driven Coordinated Management of Resources to Save Energy in Multi-core Systems

Author: Mehrzad Nejat, Per Stenström, and Miquel Pericas
Subjects: 010302 applied physics, Multi-core processor, Computer science, Quality of service, Distributed computing, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, Cache, 01 natural sciences, Partition (database), 020202 computer hardware & architecture, Efficient energy use
Abstract: Applications that are run on multicore systems without performance targets can waste significant energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of individual applications in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and DVFS settings at runtime at negligible overhead. Compared to DVFS and cache partitioning alone, we show that our proposed coordinated RMA is capable of saving, on average, 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone, when the performance target is set to 70% of the baseline system performance.
Published: 2019
Full Text: View/download PDF

17. ProFess: A Probabilistic Hybrid Main Memory Management Framework for High Performance and Fairness

Author: Per Stenström, Vassilis Papaefstathiou, and Dmitry Knyaginin
Subjects: 010302 applied physics, Random access memory, Computer science, Distributed computing, Probabilistic logic, Topology (electrical circuits), 02 engineering and technology, Storage management, 01 natural sciences, 020202 computer hardware & architecture, Non-volatile memory, Memory management, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Computer multitasking, Dram
Abstract: Non-Volatile Memory (NVM) technologies enable cost-effective hybrid main memories with two partitions: M1 (DRAM) and slower but larger M2 (NVM). This paper considers a flat migrating organization of hybrid memories. A challenging and open issue of managing such memories is to allocate M1 among co-running programs such that high fairness is achieved at the same time as high performance. This paper introduces ProFess: a Probabilistic hybrid main memory management Framework for high performance and fairness. It comprises: i) a Relative-Slowdown Monitor (RSM) that enables fair management by indicating which program suffers the most from competition for M1; and ii) a probabilistic Migration-Decision Mechanism (MDM) that unlocks high performance by realizing cost-benefit analysis that is individual for each pair of data blocks considered for migration. Within ProFess, RSM guides MDM towards high fairness. We show that for the multiprogrammed workloads evaluated, ProFess improves fairness by 15% (avg.; up to 29%), compared to the state-of-the-art, while outperforming it by 12% (avg.; up to 29%).
Published: 2018
Full Text: View/download PDF

18. PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor

Author: Yonghua Lin, Dian Zhou, Guancheng Chen, Per Stenström, Peter Hofstee, Qijun Wang, and Minghua Li
Subjects: Instruction prefetch, Scheme (programming language), 020203 distributed computing, Memory hierarchy, Computer science, business.industry, POWER8, 02 engineering and technology, 020202 computer hardware & architecture, Set (abstract data type), Computer architecture, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Key (cryptography), IBM, business, computer, Selection algorithm, Computer hardware, computer.programming_language
Abstract: Hardware prefetching on IBM's latest POWER8 processor is able to improve performance of many applications significantly, but it can also cause performance loss for others. The IBM POWER8 processor provides one of the most sophisticated hardware prefetching designs which supports 225 different configurations. Obviously, it is a big challenge to find the optimal or near-optimal hardware prefetching configuration for a specific application. We present a dynamic prefetching tuning scheme in this paper, named prefetch automatic tuner (PATer). PATer uses a prediction model based on machine learning to dynamically tune the prefetch configuration based on the values of hardware performance monitoring counters (PMCs). By developing a two-phase prefetching selection algorithm and a prediction accuracy optimization algorithm in this tool, we identify a set of selected key hardware prefetch configurations that matter mostly to performance as well as a set of PMCs that maximize the machine learning prediction accuracy. We show that PATer is able to accelerate the execution of diverse workloads up to 1.4×.
Published: 2016
Full Text: View/download PDF

19. A Primer on Compression in the Memory Hierarchy

Author: Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
Subjects: Hardware and Architecture
Published: 2015
Full Text: View/download PDF

20. Rock

Author: Dmitry Knyaginin and Per Stenström
Subjects: Engineering, Flat memory model, Resource (project management), Theoretical computer science, business.industry, Computation, Distributed computing, Resource allocation (computer), Point (geometry), Sensitivity (control systems), Pruning (decision trees), business, Dram
Abstract: Combining DRAM and a denser but slower technology into hybrid main memory is a promising way to address the demand for larger and still fast memory, driven by the growing working sets of contemporary workloads. The design space of such hybrid memory systems is multi-dimensional and large. Since simulating and/or prototyping each design point implies high computation and/or implementation efforts, these approaches alone are inefficient for finding the design point with the highest performance. This paper introduces Rock, a generalized framework for pruning the design space of hybrid memory systems. It is the first framework to mutually consider such important design dimensions as resource partitioning (the amount of each memory technology), resource allocation (distribution of the available memory capacity among co-running programs), and data placement (decisions about data promotions and demotions). Rock helps system architects to quickly infer trends, specific to the design space, and to validate them via sensitivity analysis. Trends that pass validation can be used for design-space pruning. Rock thus helps system architects to timely identify the most promising design points. This paper demonstrates the power of Rock by applying it to two design spaces formed by two carefully chosen example workloads and a system employing DRAM, phase-change memory, and solid-state disk. Rock prunes these design spaces to just a few design points.
Published: 2017
Full Text: View/download PDF

21. Timing-Anomaly Free Dynamic Scheduling of Task-Based Parallel Applications

Author: Risat Mahmud Pathan, Per Stenström, and Petros Voudouris
Subjects: Multi-core processor, 020203 distributed computing, Job shop scheduling, Computer science, Distributed computing, Suite, Processor scheduling, Dynamic priority scheduling, Parallel computing, 02 engineering and technology, Upper and lower bounds, Scheduling (computing), 020202 computer hardware & architecture, 0202 electrical engineering, electronic engineering, information engineering
Abstract: Multicore architectures can provide high predictable performance through parallel processing. Unfortunately, computing the makespan of parallel applications is overly pessimistic either due to load imbalance issues plaguing static scheduling methods or due to timing anomalies plaguing dynamic scheduling methods. This paper contributes with an anomaly-free dynamic scheduling method, called Lazy, which is non-preemptive and non-greedy in the sense that some ready tasks may not be dispatched for execution even if some processors are idle. Assuming parallel applications using contemporary taskbased parallel programming models, such as OpenMP, the general idea of Lazy is to avoid timing anomalies by assigning fixed priorities to the tasks and then dispatch selective highestpriority ready tasks for execution at each scheduling point. We formally prove that Lazy is timing-anomaly free. Unlike all the commonly-used dynamic schedulers like breadth-first and depth-first schedulers (e.g., CilkPlus) that rely on analytical approaches to determine an upper bound on the makespan of parallel application, a safe makespan of a parallel application is computed by simulating Lazy. Our experimental results show that the makespan computed by simulating Lazy is much tighter and scales better as demonstrated by four parallel benchmarks from a task-parallel benchmark suite in comparison to the state-of-the-art.
Published: 2017
Full Text: View/download PDF

22. Characterizing and Exploiting Small-Value Memory Instructions

Author: Per Stenström and Mafijul Md. Islam
Subjects: Power management, Speedup, Computer science, Locality, Parallel computing, Operand, computer.software_genre, Theoretical Computer Science, Memory management, Computational Theory and Mathematics, Hardware and Architecture, Server, Overhead (computing), Compiler, computer, Software
Abstract: This paper exploits small-value locality to accelerate the execution of memory instructions. We find that small-value loads—loads with small-value operands of 8 bits or less—are common across 52 applications from the desktop, embedded, and media domains. We show that the relative occurrences of small-value loads remain fairly stable during the program execution. Moreover, we establish that the frequency of small-value loads are almost independent of compiler and input data. We then introduce the concept of small-value caches (SVC) to compactly store small-value memory words. We show that SVCs provide significant speedup and reduce the overall energy dissipation with negligible chip-area overhead.
Published: 2014
Full Text: View/download PDF

23. SC2

Author: Per Stenström and Angelos Arelakis
Subjects: symbols.namesake, Hardware_MEMORYSTRUCTURES, Computer science, Locality, symbols, Code generation, General Medicine, Cache, Energy consumption, Parallel computing, Cache pollution, Huffman coding, Cache algorithms
Abstract: Low utilization of on-chip cache capacity limits performance and wastes energy because of the long latency, limited bandwidth, and energy consumption associated with off-chip memory accesses. Value replication is an important source of low capacity utilization. While prior cache compression techniques manage to code frequent values densely, they trade off a high compression ratio for low decompression latency, thus missing opportunities to utilize capacity more effectively. This paper presents, for the first time, a detailed designspace exploration of caches that utilize statistical compression. We show that more aggressive approaches like Huffman coding, which have been neglected in the past due to the high processing overhead for (de)compression, are suitable techniques for caches and memory. Based on our key observation that value locality varies little over time and across applications, we first demonstrate that the overhead of statistics acquisition for code generation is low because new encodings are needed rarely, making it possible to off-load it to software routines. We then show that the high compression ratio obtained by Huffman-coding makes it possible to utilize the performance benefits of 4X larger last-level caches with about 50% lower power consumption than such larger caches
Published: 2014
Full Text: View/download PDF

24. ZEBRA: Data-Centric Contention Management in Hardware Transactional Memory

Author: Ruben Titos-Gil, Per Stenström, Manuel E. Acacio, Anurag Negi, and José M. García
Subjects: Computer science, Transaction processing, Distributed computing, Transactional memory, Deadlock, Concurrency control, Computational Theory and Mathematics, Hardware and Architecture, Signal Processing, Concurrent computing, Cache, Optimistic concurrency control, Cache coherence, Block (data storage)
Abstract: Transactional contention management policies show considerable variation in relative performance with changing workload characteristics. Consequently, incorporation of fixed-policy Transactional Memory (TM) in general purpose computing systems is suboptimal by design and renders such systems susceptible to pathologies. Of particular concern are Hardware TM (HTM) systems where traditional designs have hardwired policies in silicon. Adaptive HTMs hold promise, but pose major challenges in terms of design and verification costs. In this paper, we present the ZEBRA HTM design, which lays down a simple yet high-performance approach to implement adaptive contention management in hardware. Prior work in this area has associated contention with transactional code blocks. However, we discover that by associating contention with data (cache blocks) accessed by transactional code rather than the code block itself, we achieve a neat match in granularity with that of the cache coherence protocol. This leads to a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload. ZEBRA, therefore, brings together the inherent benefits of traditional eager HTMs-parallel commits-and lazy HTMs-good optimistic concurrency without deadlock avoidance mechanisms-, combining them into a low-complexity design.
Published: 2014
Full Text: View/download PDF

25. A Case for a Value-Aware Cache

Author: Per Stenström and Angelos Arelakis
Subjects: Smart Cache, Hardware_MEMORYSTRUCTURES, Hardware and Architecture, Computer science, Cache invalidation, Cache coloring, Page cache, Data_CODINGANDINFORMATIONTHEORY, Cache, Parallel computing, Cache pollution, Cache-oblivious algorithm, Cache algorithms
Abstract: Replication of values causes poor utilization of on-chip cache memory resources. This paper addresses the question: How much cache resources can be theoretically and practically saved if value replication is eliminated? We introduce the concept of value-aware caches and show that a sixteen times smaller value-aware cache can yield the same miss rate as a conventional cache. We then make a case for a value-aware cache design using Huffman-based compression. Since the value set is rather stable across the execution of an application, one can afford to reconstruct the coding tree in software. The decompression latency is kept short by our proposed novel pipelined Huffman decoder that uses canonical codewords. While the (loose) upper-bound compression factor is 5.2×, we show that, by eliminating cache-block alignment restrictions, it is possible to achieve a compression factor of 3.4× for practical designs.
Published: 2014
Full Text: View/download PDF

26. Moving from petaflops to petadata

Author: Goran Rakocevic, Per Stenström, Mateo Valero, Michael J. Flynn, Veljko Milutinovic, Roman Trobec, and Oskar Mencer
Subjects: General Computer Science, Computer science, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, Parallel computing, 020202 computer hardware & architecture
Abstract: The race to build ever-faster supercomputers is on, with more contenders than ever before. However, the current goals set for this race may not lead to the fastest computation for particular applications.
Published: 2013
Full Text: View/download PDF

27. Adaptive Row Addressing for Cost-Efficient Parallel Memory Protocols in Large-Capacity Memories

Author: Per Stenström, Vassilis Papaefstathiou, and Dmitry Knyaginin
Subjects: Hardware_MEMORYSTRUCTURES, Flat memory model, Cost efficiency, Computer science, business.industry, Distributed computing, Large capacity, Multiplexing, Physical address, Address bus, Latency (engineering), business, Efficient energy use, Computer network
Abstract: Modern commercial workloads drive a continuous demand for larger and still low-latency main memories. JEDEC member companies indicate that parallel memory protocols will remain key to such memories, though widening the bus (increasing the pin count) to address larger capacities would cause multiple issues ultimately reducing the speed (the peak data rate) and cost-efficiency of the protocols. Thus to stay high-speed and cost-efficient, parallel memory protocols should address larger capacities using the available number of pins. This is accomplished by multiplexing the pins to transfer each address in multiple bus cycles, implementing Multi-Cycle Addressing (MCA). However, additional address-transfer cycles can significantly worsen performance and energy efficiency. This paper contributes with the concept of adaptive row addressing that comprises row-address caching to reduce the number of address-transfer cycles, enhanced by row-address prefetching and an adaptive row-access priority policy to improve state-of-the-art memory schedulers. For a case-study MCA protocol, the paper shows that the proposed concept improves: i) the read latency by 7.5% on average and up to 12.5%, and ii) the system-level performance and energy efficiency by 5.5% on average and up to 6.5%. This way, adaptive row addressing makes the MCA protocol as efficient as an idealistic protocol of the same speed but with enough pins to transfer each row address in a single bus cycle.
Published: 2016
Full Text: View/download PDF

28. RADAR: Runtime-assisted dead region management for last-level caches

Author: Madhavan Manivannan, Vassilis Papaefstathiou, Miquel Pericas, and Per Stenström
Subjects: 010302 applied physics, Scheme (programming language), Hardware_MEMORYSTRUCTURES, Computer science, Real-time computing, 02 engineering and technology, 01 natural sciences, Bridge (nautical), 020202 computer hardware & architecture, law.invention, Runtime system, law, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Granularity, Cache, Radar, computer, Energy (signal processing), computer.programming_language
Abstract: Last-level caches (LLCs) bridge the processor/memory speed gap and reduce energy consumed per access. Unfortunately, LLCs are poorly utilized because of the relatively large occurrence of dead blocks. We propose RADAR, a hybrid static/dynamic dead-block management technique that can accurately predict and evict dead blocks in LLCs. RADAR does dead-block prediction and eviction at the granularity of address regions supported in many of today's task-parallel programming models. The runtime system utilizes static control-flow information about future region accesses in conjunction with past region access patterns to make accurate predictions about dead regions. The runtime system instructs the cache to demote and eventually evict blocks belonging to such dead regions. This paper considers three RADAR schemes to predict dead regions: a scheme that uses control-flow information provided by the programming model (Look-ahead), a history-based scheme (Look-back) and a combined scheme (Look-ahead and Look-back). Our evaluation shows that, on average, all RADAR schemes outperform state-of-the-art hardware dead-block prediction techniques, whereas the combined scheme always performs best.
Published: 2016
Full Text: View/download PDF

29. Removal of Conflicts in Hardware Transactional Memory Systems

Author: M. M. Waliullah and Per Stenström
Subjects: Root (linguistics), Hardware_MEMORYSTRUCTURES, Computer science, Distributed computing, False sharing, Transactional memory, Energy consumption, Theoretical Computer Science, Theory of computation, Conflict resolution, Cache, Software, Software versioning, Information Systems
Abstract: This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance and energy consumption are improved substantially.
Published: 2012
Full Text: View/download PDF

30. SimWattch and learn

Author: Jianwei Chen, Per Stenström, and Michel Dubois
Subjects: Computer science, Programming language, Strategy and Management, computer.software_genre, Simics, Tool design, Education, Microarchitecture, Data modeling, Instruction set, Instruction set simulation, Compiler, Electrical and Electronic Engineering, computer, Design space
Abstract: We have introduced SimWattch, a complete system simulation environment that can be used to conduct performance and power-oriented microarchitecture explorations or revisit existing techniques from a more complete perspective taking into account operating system interactions. As a combination and extension of Simics and Wattch, SimWattch facilitates the analysis of a wider design space for computer architects and application, compiler and OS developers. In addition, it explores a new approach to cost-effective simulation tool design.
Published: 2009
Full Text: View/download PDF

31. FlexCore: Utilizing Exposed Datapath Control for Efficient Computing

Author: Per Stenström, Martin Thuresson, Lars Svensson, Per Larsson-Edefors, Magnus Själander, and Magnus Björk
Subjects: Very-large-scale integration, Speedup, Computer science, Pipeline (computing), Parallel computing, Theoretical Computer Science, Instruction set, Computer architecture, Hardware and Architecture, Control and Systems Engineering, Modeling and Simulation, Signal Processing, Datapath, Memory footprint, Code generation, Word (computer architecture), Information Systems
Abstract: rdquoWe introduce FlexCore, the first exemplar of an architecture based on the FlexSoC framework. Comprising the same datapath units found in a conventional five-stage pipeline, the FlexCore has an exposed datapath control and a flexible interconnect to allow the datapath to be dynamically reconfigured as a consequence of code generation. Additionally, the FlexCore allows specialized datapath units to be inserted and utilized within the same architecture and compilation framework. This study shows that, in comparison to a conventional five-stage general-purpose processor, the FlexCore is up to 40% more efficient in terms of cycle count on a set of benchmarks from the embedded application domain. We show that both the fine grained control and the flexible interconnect contribute to the speedup. Furthermore, according to our VLSI implementation study, the FlexCore architecture offers both time and energy savings. The exposed FlexCore datapath requires a wide control word. The conducted evaluation confirms that this increases the instruction bandwidth and memory footprint. This calls for efficient instruction decoding as proposed in the FlexSoC
Published: 2008
Full Text: View/download PDF

32. Introduction

Author: Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
Published: 2016
Full Text: View/download PDF

33. Concluding Remarks

Author: Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
Published: 2016
Full Text: View/download PDF

34. Compression Algorithms

Author: Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
Published: 2016
Full Text: View/download PDF

35. A Primer on Compression in the Memory Hierarchy

Author: Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
Published: 2016
Full Text: View/download PDF

36. Cache/Memory Link Compression

Author: Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
Published: 2016
Full Text: View/download PDF

37. Memory Compression

Author: Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
Published: 2016
Full Text: View/download PDF

38. Cache Compression

Author: Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
Published: 2016
Full Text: View/download PDF

39. Improving power efficiency of D-NUCA caches

Author: Cosimo Antonio Prete, Per Stenström, Giacomo Gabrielli, Pierfrancesco Foglia, and Alessandro Bardine
Subjects: Hardware_MEMORYSTRUCTURES, Computer science, business.industry, Cache coloring, General Medicine, Parallel computing, Cache pollution, Cache invalidation, Bus sniffing, Page cache, Cache, business, Cache algorithms, Computer network
Abstract: D-NUCA caches are cache memories that, thanks to banked organization, broadcast search and promotion/demotion mechanism, are able to tolerate the increasing wire delay effects introduced by technology scaling. As a consequence, they will outperform conventional caches (UCA, Uniform Cache Architectures) in future generation cores. Due to the promotion/demotion mechanism, we have found that, in a D-NUCA cache, the distribution of hits on the ways varies across applications as well as across different execution phases within a single application. In this paper, we show how such a behavior can be utilized to improve D-NUCA power efficiency as well as to decrease its access latencies. In particular, we propose a new D-NUCA structure, called Way Adaptable D-NUCA cache, in which the number of active (i.e. powered-on) ways is dynamically adapted to the need of the running application. Our initial evaluation shows that a consistent reduction of both the average number of active ways (42% in average) and the number of bank access requests (29% in average) is achieved, without significantly affecting the IPC.
Published: 2007
Full Text: View/download PDF

40. Effectiveness of caching in a distributed digital library system

Author: Anders Ardö, Jochen Hollmann, and Per Stenström
Subjects: Exploit, business.industry, Computer science, Distributed computing, Locality, Digital library, Hardware and Architecture, Server, Hit rate, Locality of reference, Overhead (computing), Cache, business, Software, Computer network
Abstract: Today independent publishers are offering digital libraries with fulltext archives. In an attempt to provide a single user-interface to a large set of archives, the studied Article-Database-Service offers a consolidated interface to a geographically distributed set of archives. While this approach offers a tremendous functional advantage to a user, the fulltext download delays caused by the network and queuing in servers make the user-perceived interactive performance poor. This paper studies how effective caching of articles at the client level can be achieved as well as at intermediate points as manifested by gateways that implement the interfaces to the many fulltext archives. A central research question in this approach is: What is the nature of locality in the user access stream to such a digital library? Based on access logs that drive the simulations, it is shown that client-side caching can result in a 20% hit rate. Even at the gateway level temporal locality is observable, but published replacement algorithms are unable to exploit this temporal locality. Additionally, spatial locality can be exploited by considering loading into cache all articles in an issue, volume, or journal, if a single article is accessed. But our experiments showed that improvement introduced a lot of overhead. Finally, it is shown that the reason for this cache behavior is the long time distance between re-accesses, which makes caching quite unfeasible.
Published: 2007
Full Text: View/download PDF

41. SimWattch: Integrating Complete-System and User-Level Performance and Power Simulators

Author: Per Stenström, Michel Dubois, and Jianwei Chen
Subjects: Power management, Computer architecture, Hardware and Architecture, Computer science, Low-power electronics, System-level simulation, Electrical and Electronic Engineering, Simics, Software, Power (physics)
Abstract: Evaluating the impact of applications with significant operating-system interactions requires detailed microarchitectural simulation combined with system-level simulation. a cost-effective and practical approach is to combine two widely used simulators. SimWattch integrates Simics, a system-level tool, with Wattch, a user-level tool, to facilitate analysis of a wider design space for computer architects and system developers.
Published: 2007
Full Text: View/download PDF

42. Starvation-free commit arbitration policies for transactional memory systems

Author: Per Stenström and M. M. Waliullah
Subjects: Starvation, Computer science, Transactional memory, General Medicine, Commit, Computer security, computer.software_genre, Three-phase commit protocol, Arbitration, medicine, Operating system, medicine.symptom, Baseline (configuration management), computer, Protocol (object-oriented programming), Database transaction
Abstract: In transactional memory systems like TCC, unordered transactions are committed on a first-come, first-serve basis. If a transaction has read data that has been modified by the next transaction to commit, it will have to roll-back and a lot of computations can potentially be wasted. Even worse, such simple commit arbitration policies are prone to starvation; in fact, the performance of Raytrace in SPLASH-2 suffered significantly because of this. This paper analyzes in detail the design issues for commit arbitration policies and proposes novel policies that reduce the amount of wasted computation due to roll-back and, most importantly, avoid starvation. We analyze in detail how to incorporate them in a TCC-like transactional memory protocol. We find that our proposed schemes have no impact on the common-case performance. In addition, they add modest complexity to the baseline protocol.
Published: 2007
Full Text: View/download PDF

43. HyComp

Author: Fredrik Dahlgren, Per Stenström, and Angelos Arelakis
Subjects: Lossless compression, Theoretical computer science, Computer science, CPU cache, Locality, Huffman coding, Data type, symbols.namesake, symbols, Locality of reference, Cache, Cache algorithms, Algorithm, Data compression, Image compression
Abstract: Proposed cache compression schemes make design-time assumptions on value locality to reduce decompression latency. For example, some schemes assume that common values are spatially close whereas other schemes assume that null blocks are common. Most schemes, however, assume that value locality is best exploited by fixed-size data types (e.g., 32-bit integers). This assumption falls short when other data types, such as floating-point numbers, are common. This paper makes two contributions. First, HyComp -- a hybrid cache compression scheme -- selects the best-performing compression scheme, based on heuristics that predict data types. Data types considered are pointers, integers, floating-point numbers and the special (and trivial) case of null blocks. Second, this paper contributes with a compression method that exploits value locality in data types with predefined semantic value fields, e.g., as in the exponent and the mantissa in floating-point numbers. We show that HyComp, augmented with the proposed floating-point-number compression method, offers superior performance in comparison with prior art.
Published: 2015
Full Text: View/download PDF

44. Enhancing Garbage Collection Synchronization Using Explicit Bit Barriers

Author: Ruben Titos-Gil, Per Stenström, and Jochen Hollmann
Subjects: Multi-core processor, Memory management, Computer science, Concurrency, Synchronization (computer science), Virtual memory, Operating system, Overhead (computing), Transactional memory, computer.software_genre, computer, Garbage collection
Abstract: Multicore architectures offer a convenient way to unlock concurrency between application (called mutator) and garbage collector, yet efficient synchronization between the two by means of barriers is critical to unlock this concurrency. Hardware Transactional Memory (HTM), now commercially available, opens up new ways for synchronization with dramatically lower overhead for the mutator. Unfortunately, HTM-based schemes proposed to date either require specialized hardware support or impose severe overhead through invocation of OS-level trap handlers. This paper proposes Explicit Bit Barriers (EBB), a novel approach for fast synchronization between the mutator and HTM-encapsulated relocation tasks. We compare the efficiency of EBBs with read barriers based on virtual memory that rely on OS-level trap handlers. We show that EBBs are nearly as efficient as those needing specialized hardware, but run on commodity Intel processors with TSX extensions.
Published: 2015
Full Text: View/download PDF

45. Performance Impact of Batching Web-Application Requests Using Hot-Spot Processing on GPUs

Author: Tobias Fjalling and Per Stenström
Subjects: Web server, Speedup, business.industry, Data parallelism, Computer science, Task parallelism, Parallel computing, Thread (computing), Program optimization, computer.software_genre, Code refactoring, Server, Web application, SIMD, business, computer
Abstract: Web applications are a good fit for many-core servers because of their inherent high-degree of request-level parallelism. Yet, processing-intensive web-server requests can lead to low quality-of-service due to hot-spots, which calls for methods that can improve single-thread performance. This paper explores how to use off-chip GPUs to speed up web application hot-spots written in productivity-friendly environments (e.g. C#). First, we apply a number of straightforward optimizations through refactoring of a commercial-strength, web application code. This yields a speedup of 7.6 in a CPU multi-threaded, and multi-core test. Second, we then gather similar requests from different threads of the optimized code, by applying a technique called batching, to exploit SIMD parallelism provided by GPUs. Surprisingly, there is ample parallelism to be exploited from the already optimized code yielding a speedup of a factor between 2x to 3x compared to the best optimized CPU version.
Published: 2015
Full Text: View/download PDF

46. A cache block reuse prediction scheme

Author: Per Stenström and Jonas Jalminger
Subjects: Computer Networks and Communications, Computer science, Locality, Parallel computing, Reuse, Branch predictor, Reduction (complexity), Set (abstract data type), Artificial Intelligence, Hardware and Architecture, Cache invalidation, Cache, Cache algorithms, Software, Block (data storage)
Abstract: We introduce a novel approach to predict whether a block should be allocated in the cache or not upon a miss based on past reuse behavior during its lifetime in the cache. It introduces a new reuse model that makes a single-entry bypass buffer suffice to exploit the spatial locality in non-allocated blocks. It also applies classical two-level branch prediction to the reuse history patterns to predict whether the block should be allocated or not. Our evaluation of the scheme, based on five benchmarks from SPEC'95 and a set of six multimedia and database applications, shows that the prediction accuracy is between 66 and 94% across the applications and can result in a miss rate reduction of between 1 and 32% with an average of 12% (using the ideal implementation). We also consider cost/performance aspects of several implementations of the scheme. We find that with a modest hardware cost—essentially a table of about 300 bytes—miss rate can be cut by up to 14% compared to a cache with an always-allocate strategy.
Published: 2004
Full Text: View/download PDF

47. A comparative evaluation of hardware-only and software-only directory protocols in shared-memory multiprocessors

Author: Håkan Grahn and Per Stenström
Subjects: Hardware_MEMORYSTRUCTURES, business.industry, Computer science, Byte, Directory, computer.software_genre, Software, Shared memory, Hardware and Architecture, Lightweight Directory Access Protocol, Embedded system, Directory service, Operating system, Cache, business, computer, Cache coherence, Computer hardware
Abstract: The hardware complexity of hardware-only directory protocols in shared-memory multiprocessors has motivated many researchers to emulate directory management by software handlers executed on the compute processors, called software-only directory protocols. In this paper, we evaluate the performance and design trade-offs between these two approaches in the same architectural simulation framework driven by eight applications from the SPLASH-2 suite. Our evaluation reveals some common case operations that can be supported by simple hardware mechanisms and can make the performance of software-only directory protocols competitive with that of hardware-only protocols. These mechanisms aim at either reducing the software handler latency or hiding it by overlapping it with the message latencies associated with inter-node memory transactions. Further, we evaluate the effects of cache block sizes between 16 and 256 bytes as well as two different page placement policies. Overall, we find that a software-only directory protocol enhanced with these mechanisms can reach between 63% and 97% of the baseline hardware-only protocol performance at a lower design complexity.
Published: 2004
Full Text: View/download PDF

48. 2015 Maurice Wilkes Award Given to Christos Kozyrakis

Author: Per Stenström
Subjects: 010302 applied physics, ComputingMilieux_THECOMPUTINGPROFESSION, business.industry, Computer science, Art history, 02 engineering and technology, 01 natural sciences, GeneralLiterature_MISCELLANEOUS, 020202 computer hardware & architecture, ComputingMilieux_GENERAL, Software, Hardware and Architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Electrical and Electronic Engineering, business
Abstract: This column discusses the career and work of Christos Kozyrakis, the 2015 recipient of the ACM SIGARCH Maurice Wilkes award, which is given annually to an early-career researcher for an outstanding contribution to computer architecture.
Published: 2016
Full Text: View/download PDF

49. Improvement of energy-efficiency in off-chip caches by selective prefetching

Author: Per Stenström and Jonas Jalminger
Subjects: Artificial Intelligence, Computer Networks and Communications, Hardware and Architecture, Computer science, Locality, Mobile computing, Byte, Parallel computing, Line (text file), Chip, Software, Efficient energy use
Abstract: The line size/performance trade-offs in off-chip second-level caches in light of energy-efficiency are revisited. Based on a mix of applications representing server and mobile computer system usage, we show that while the large line sizes (128 bytes) typically used maximize performance, they result in a high power dissipation owing to the limited exploitation of spatial locality. In contrast, small blocks (32 bytes) are found to cut the energy-delay by more than a factor of 2 with only a moderate performance loss of less than 25%. As a remedy, prefetching, if applied selectively, is shown to avoid the performance losses of small blocks, yet keeping power consumption low.
Published: 2002
Full Text: View/download PDF

50. Crystal: A Design-Time Resource Partitioning Method for Hybrid Main Memory

Author: Dmitry Knyaginin, Georgi Gaydadjiev, and Per Stenström
Subjects: Non-volatile memory, Random access memory, Computer science, Universal memory, Interleaved memory, Registered memory, Non-volatile random-access memory, Parallel computing, Energy consumption, Dram, Computer memory, Efficient energy use
Abstract: Non-Volatile Memory (NVM) technologies can be used to reduce system-level execution time, energy, or cost but they add a new design dimension. Finding the best amounts of DRAM and NVM in hybrid main memory systems is a nontrivial design-time issue, the best solution to which depends on many factors. Such resource partitioning between DRAM and NVM can be framed as an optimization problem where the minimum of a target metric is sought, trends matter more than absolute values, and thus the precision of detailed modeling is overkill. Here we present Crystal, an analytic approach to early and rapid design-time resource partitioning of hybrid main memories. Crystal provides first-order estimates of system-level execution time and energy, sufficient to enable exhaustive search of the best amount and type of NVM for given workloads and partitioning goals. Crystal thus helps system designers to quickly find the most promising hybrid configurations for detailed evaluation. E.g., Crystal shows how for specific workloads higher system-level performance and energy efficiency can be achieved by employing an NVM with the speed and energy consumption of NAND Flash instead of a much faster and more energy efficient NVM like phase-change memory.
Published: 2014
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

189 results on '"Per Stenström"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources