20 results on '"Sreenivas Subramoney"'
Search Results
2. Segment-Fusion: Hierarchical Context Fusion for Robust 3D Semantic Segmentation
- Author
-
Anirud Thyagharajan, Benjamin Ummenhofer, Prashant Laddha, Om Ji Omer, and Sreenivas Subramoney
- Published
- 2022
3. Compute-In-Memory Using 6T SRAM for a Wide Variety of Workloads
- Author
-
Pramod Kumar Bharti, Saurabh Jain, Kamlesh R. Pillai, Sagar Varma Sayyaparaju, Gurpreet S. Kalsi, Joycee Mekie, and Sreenivas Subramoney
- Published
- 2022
4. RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU
- Author
-
Christopher J. Hughes, Ananda Samajdar, Hyesoon Kim, Sreenivas Subramoney, Eric Qin, Geonhwa Jeong, and Tushar Krishna
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer science ,business.industry ,Systolic array ,Machine Learning (cs.LG) ,Power (physics) ,Pipeline transport ,Matrix (mathematics) ,Artificial Intelligence (cs.AI) ,Overhead (business) ,Embedded system ,Hardware Architecture (cs.AR) ,Datapath ,Electronic design automation ,Computer Science - Hardware Architecture ,business ,Efficient energy use - Abstract
As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath to boost efficiency. Systolic arrays have been the premier architectural choice as matrix engines in offload accelerators. However, we demonstrate that incorporating them inside CPUs can introduce under-utilization and stalls due to limited register storage to amortize the fill and drain times of the array. To address this, we propose RASA, Register-Aware Systolic Array. We develop techniques to divide an execution stage into several sub-stages and overlap instructions to hide overheads and run them concurrently. RASA-based designs improve performance significantly with negligible area and power overhead., This paper is accepted to DAC 2021
- Published
- 2021
5. REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute
- Author
-
Sreenivas Subramoney, Shankar Balachandran, Rahul Bera, Joydeep Rakshit, Anant V. Nori, Avishaii Abuhatzera, Belliappa Kuttanna, and Omer Om J
- Subjects
Reduct ,Multi-core processor ,Out-of-order execution ,Computer science ,Pipeline (computing) ,Bandwidth (signal processing) ,Inference ,Central processing unit ,Parallel computing ,Cache - Abstract
Deep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizing general-purpose single thread performance, and is overkill for relatively simpler, but still hugely important, data parallel DNN inference workloads. Addressing this disparity efficiently can enable both raw performance scaling and overall performance/Watt improvements for multi-core CPU DNN inference. We present REDUCT, where we build innovative solutions that bypass traditional CPU resources which impact DNN inference power and limit its performance. Fundamentally, REDUCT's "Keep it close" policy enables consecutive pieces of work to be executed close to each other. REDUCT enables instruction delivery/decode close to execution and instruction execution close to data. Simple ISA extensions encode the fixed-iteration count loop-y workload behavior enabling an effective bypass of many power-hungry front-end stages of the wide Out-of-Order (OoO) CPU pipeline. Per core performance scales efficiently by distributing lightweight tensor compute near all caches in a multi-level cache hierarchy. This maximizes the cumulative utilization of the existing architectural bandwidth resources in the system and minimizes movement of data. Across a number of DNN models, REDUCT achieves a 2.3X increase in convolution performance/Watt with a 2X to 3.94X scaling in raw performance. Similarly, REDUCT achieves a 1.8X increase in inner-product performance/Watt with 2.8X scaling in performance. REDUCT performance/power scaling is achieved with no increase to cache capacity or bandwidth and a mere 2.63% increase in area. Crucially, REDUCT operates entirely within the CPU programming and memory model, simplifying software development, while achieving performance similar to or better than state-of-the-art Domain Specific Accelerators (DSA) for DNN inference, providing fresh design choices in the AI era.
- Published
- 2021
6. Look-Up Table based Energy Efficient Processing in Cache Support for Neural Network Acceleration
- Author
-
Kalsi Gurpreet S, Akshay Krishna Ramanathan, Vijaykrishnan Narayanan, Pillai Kamlesh R, Tarun Makesh Chandran, Srivatsa Srinivasa, Sreenivas Subramoney, and Omer Om J
- Subjects
010302 applied physics ,Speedup ,Artificial neural network ,Computer science ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,Data flow diagram ,0103 physical sciences ,Lookup table ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Static random-access memory ,Cache ,Efficient energy use - Abstract
This paper presents a Look-Up Table (LUT) based Processing-In-Memory (PIM) technique with the potential for running Neural Network inference tasks. We implement a bitline computing free technique to avoid frequent bitline accesses to the cache sub-arrays and thereby considerably reducing the memory access energy overhead. LUT in conjunction with the compute engines enables sub-array level parallelism while executing complex operations through data lookup which otherwise requires multiple cycles. Sub-array level parallelism and systolic input data flow ensure data movement to be confined to the SRAM slice.Our proposed LUT based PIM methodology exploits substantial parallelism using look-up tables, which does not alter the memory structure/organization, that is, preserving the bit-cell and peripherals of the existing SRAM monolithic arrays. Our solution achieves 1.72x higher performance and 3.14x lower energy as compared to a state-of-the-art processing-in-cache solution. Sub-array level design modifications to incorporate LUT along with the compute engines will increase the overall cache area by 5.6%. We achieve 3.97x speedup w.r.t neural network systolic accelerator with a similar area. The re-configurable nature of the compute engines enables various neural network operations and thereby supporting sequential networks (RNNs) and transformer models. Our quantitative analysis demonstrates 101x, 3x faster execution and 91x, 11x energy efficient than CPU and GPU respectively while running the transformer model, BERT-Base.
- Published
- 2020
7. Descriptor Scoring for Feature Selection in Real-Time Visual Slam
- Author
-
Dipan Kumar Mandal, Kalsi Gurpreet S, Omer Om J, Sreenivas Subramoney, and Prashant Laddha
- Subjects
0209 industrial biotechnology ,Computer science ,business.industry ,Detector ,Feature extraction ,Feature selection ,02 engineering and technology ,Invariant (physics) ,Simultaneous localization and mapping ,Visualization ,Extended Kalman filter ,020901 industrial engineering & automation ,Odometry ,0202 electrical engineering, electronic engineering, information engineering ,Feature descriptor ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,business - Abstract
Many emerging applications of Visual SLAM running on resource constrained hardware platforms impose very aggressive pose accuracy requirements and highly demanding latency constraints. To achieve the required pose accuracy under constrained compute budget, real-time SLAM implementations have to work with few but highly repeatable and invariant features. While many state-of-the-art techniques, proposed for selecting good features to track, do address some of these concerns, they are computationally complex and therefore, not suitable for power, latency and cost sensitive edge devices. On the other hand, simpler feature selection methods based on detector (corner) score, lack in identifying features with required invariance and trackability. We present a notion of feature descriptor score as a measure of invariance under distortions. We further propose feature selection method based on descriptor score requiring very minimal compute and demonstrate its performance with binary descriptors on an EKF based visual inertial odometry (VIO). Compared to detector score based methods, our method provides an improvement up to 10% in ATE (Absolute Trajectory Error) score on EuroC dataset.
- Published
- 2020
8. Characterization of Data Generating Neural Network Applications on x86 CPU Architecture
- Author
-
Virendra Singh, Antara Ganguly, Shankar Balachandran, Anant V. Nori, and Sreenivas Subramoney
- Subjects
Hardware architecture ,Instruction prefetch ,Computer engineering ,Kernel (image processing) ,Artificial neural network ,Computer science ,Bandwidth (computing) ,x86 ,Scope (computer science) ,Convolution - Abstract
This paper analyzes the performance of two contemporary data-generating neural network-based workloads, Neural Style Transfer and Super Resolution GAN run on x86 hardware architecture. In understanding the impact of data-readiness, we find how certain layers benefit from forced data warming-up. In examining bandwidth utilization of these layers, we identify several memory-bound layers as not necessarily being bandwidthbound hinting at the feasibility of prefetch-based solutions for improved performance. We also observe layers with specific kernel sizes performing poorly because of their unoptimized library kernel implementation. Based on our findings, we suggest directions for removing these performance bottlenecks by utilizing available bandwidth margins ≥ 90% and realizing convolution operations through vector-based functional units with a scope of at least 20x more such software-to-hardware mappings than existing implementation.
- Published
- 2020
9. Focused Value Prediction: Concepts, techniques and implementations presented in this paper are subject matter of pending patent applications, which have been filed by Intel Corporation
- Author
-
Sumeet Bandishte, Zeev Sperber, Jayesh Gaur, Adi Yoaz, Lihu Rappoport, and Sreenivas Subramoney
- Subjects
010302 applied physics ,Out-of-order execution ,Speedup ,Computer science ,Contrast (statistics) ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Subject matter ,Computer engineering ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Baseline (configuration management) ,Instruction-level parallelism ,Implementation ,Value (mathematics) - Abstract
Value Prediction was proposed to speculatively break true data dependencies, thereby allowing Out of Order (OOO) processors to achieve higher instruction level parallelism (ILP) and gain performance. State-of-the-art value predictors try to maximize the number of instructions that can be value predicted, with the belief that a higher coverage will unlock more ILP and increase performance. Unfortunately, this comes at increased complexity with implementations that require multiple different types of value predictors working in tandem, incurring substantial area and power cost. In this paper we motivate towards lower coverage, but focused, value prediction. Instead of aggressively increasing the coverage of value prediction, at the cost of higher area and power, we motivate refocusing value prediction as a mechanism to achieve an early execution of instructions that frequently create performance bottlenecks in the OOO processor. Since we do not aim for high coverage, our implementation is light-weight, needing just 1.2 KB of storage. Simulation results on 60 diverse workloads show that we deliver 3.3% performance gain over a baseline similar to the Intel Skylake processor. This performance gain increases substantially to 8.6% when we simulate a futuristic up-scaled version of Skylake. In contrast, for the same storage, state-of-the-art value predictors deliver a much lower speedup of 1.7% and 4.7% respectively. Notably, our proposal is similar to these predictors in performance, even when they are given nearly eight times the storage and have 60% more prediction coverage than our solution.
- Published
- 2020
10. Towards Noise Resilient SLAM
- Author
-
Anirud Thyagharajan, Sreenivas Subramoney, Omer Om J, and Dipan Kumar Mandal
- Subjects
0209 industrial biotechnology ,Adaptive algorithm ,Computer science ,business.industry ,Feature extraction ,02 engineering and technology ,Filter (signal processing) ,Simultaneous localization and mapping ,Noise ,020901 industrial engineering & automation ,Outlier ,0202 electrical engineering, electronic engineering, information engineering ,RGB color model ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,business - Abstract
Sparse-indirect SLAM systems have been dominantly popular due to their computational efficiency and photometric invariance properties. Depth sensors are critical to SLAM frameworks for providing scale information to the 3D world, yet known to be plagued by a wide variety of noise sources, possessing lateral and axial components. In this work, we demonstrate the detrimental impact of these depth noise components on the performance of the state-of-the-art sparse-indirect SLAM system (ORB-SLAM2). We propose (i) Map-Point Consensus based Outlier Rejection (MC-OR) to counter lateral noise, and (ii) Adaptive Virtual Camera (AVC) to combat axial noise accurately. MC-OR utilizes consensus information between multiple sightings of the same landmark to disambiguate noisy depth and filter it out before pose optimization. In AVC, we introduce an error vector as an accurate representation of the axial depth error. We additionally propose an adaptive algorithm to find the virtual camera location for projecting the error used in the objective function of the pose optimization. Our techniques work equally well for stereo image pairs and RGB-D input directly used by sparse-indirect SLAM systems. Our methods were tested on the TUM (RGB-D) and EuRoC (stereo) datasets and we show that they outperform existing state-of-the-art ORB-SLAM2 by 2-3x, especially in sequences critically affected by depth noise.
- Published
- 2020
11. PSB-RNN: A Processing-in-Memory Systolic Array Architecture using Block Circulant Matrices for Recurrent Neural Networks
- Author
-
Kalsi Gurpreet S, Makesh Chandran, Sahithi Rampalli, Vijaykrishnan Narayanan, Jack Sampson, Sreenivas Subramoney, and Nagadastagiri Challapalle
- Subjects
010302 applied physics ,Artificial neural network ,Dataflow ,Computer science ,Systolic array ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,symbols.namesake ,Fourier transform ,Recurrent neural network ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,symbols ,Crossbar switch ,Throughput (business) ,Circulant matrix ,Block (data storage) - Abstract
Recurrent Neural Networks (RNNs) are widely used in Natural Language Processing (NLP) applications as they inherently capture contextual information across spatial and temporal dimensions. Compared to other classes of neural networks, RNNs have more weight parameters as they primarily consist of fully connected layers. Recently, several techniques such as weight pruning, zero-skipping, and block circulant compression have been introduced to reduce the storage and access requirements of RNN weight parameters. In this work, we present a ReRAM crossbar based processing-in-memory (PIM) architecture with systolic dataflow incorporating block circulant compression for RNNs. The block circulant compression decomposes the operations in a fully connected layer into a series of Fourier transforms and point-wise operations resulting in reduced space and computational complexity. We formulate the Fourier transform and point-wise operations into in-situ multiply-and-accumulate (MAC) operations mapped to ReRAM crossbars for high energy efficiency and throughput. We also incorporate systolic dataflow for communication within the crossbar arrays, in contrast to broadcast and multicast communications, to further improve energy efficiency. The proposed architecture achieves average improvements in compute efficiency of 44x and 17x over a custom FPGA architecture and conventional crossbar based architecture implementations, respectively.
- Published
- 2020
12. Bandwidth-Aware Last-Level Caching: Efficiently Coordinating Off-Chip Read and Write Bandwidth
- Author
-
Sreenivas Subramoney, Mainak Chaudhuri, and Jayesh Gaur
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,02 engineering and technology ,Write buffer ,Chip ,01 natural sciences ,020202 computer hardware & architecture ,Scheduling (computing) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,business ,Dram ,Computer network - Abstract
The last two decades have witnessed a large number of proposals on the last-level cache (LLC) replacement policy aiming to minimize the number of LLC read misses. Another independent large body of work has explored mechanisms to address the inefficiencies arising from the DRAM writes introduced by the LLC replacement policy. These DRAM scheduling proposals, however, leave the LLC replacement policy unchanged and, as a result, miss the opportunity of synergistically shaping and scheduling the DRAM write bandwidth demand. In this paper, we argue that DRAM read and write bandwidth demands must be coordinated carefully from the LLC side and hence, introduce bandwidth-awareness in the LLC policy. Our bandwidth-aware LLC policy proposal enables long uninterrupted stretches of DRAM reads while maintaining the efficiency of the last-level cache and controlling precisely when and for how long writes can demand DRAM bandwidth. Our proposal comfortably outperforms the state-of-the-art eager DRAM write scheduling proposals and bridges 75% of the performance gap between the baseline and a hypothetical system that deploys an unbounded DRAM write buffer.
- Published
- 2019
13. Visual Inertial Odometry At the Edge: A Hardware-Software Co-design Approach for Ultra-low Latency and Power
- Author
-
Sreenivas Subramoney, Hong Wang, Belliappa Kuttanna, Eagle Jones, Jim Radford, Gopi Neela, Biji George, Srivatsava Jandhyala, Omer Om J, Dipan Kumar Mandal, Kalsi Gurpreet S, Santhosh Kumar Rethinagiri, and Lance Hacking
- Subjects
010302 applied physics ,Inertial frame of reference ,Computer science ,Real-time computing ,Image processing ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Acceleration ,Odometry ,Inertial measurement unit ,Robustness (computer science) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Robot ,Pose - Abstract
Visual Inertial Odometry (VIO) is used for estimating pose and trajectory of a system and is a foundational requirement in many emerging applications like AR/VR, autonomous navigation in cars, drones and robots. In this paper, we analyze key compute bottlenecks in VIO and present a highly optimized VIO accelerator based on a hardware-software codesign approach. We detail a set of novel micro-architectural techniques that optimize compute, data movement, bandwidth and dynamic power to make it possible to deliver high quality of VIO at ultra-low latency and power required for budget constrained edge devices. By offloading the computation of the critical linear algebra algorithms from the CPU, the accelerator enables high sample rate IMU usage in VIO processing while acceleration of image processing pipe increases precision, robustness and reduces IMU induced drift in final pose estimate. The proposed accelerator requires a small silicon footprint (1.3 mm2 in a 28nm process at 600 MHz), utilizes a modest on-chip shared SRAM (560KB) and achieves 10x speedup over a software-only implementation in terms of image sample-based pose update latency while consuming just 2.2 mW power. In a FPGA implementation, using the EuRoC VIO dataset (VGA 30fps images and 100Hz IMU) the accelerator design achieves pose estimation accuracy (loop closure error) comparable to a software based VIO implementation.
- Published
- 2019
14. Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM Based Last Level Cache
- Author
-
Sreenivas Subramoney, Jayesh Gaur, Sasikanth Manipatruni, Huichu Liu, Kunal Korgaonkar, Hong Wang, Tanay Karnik, Ishwar Bhati, Ian A. Young, and Steven Swanson
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,02 engineering and technology ,Supercomputer ,01 natural sciences ,020202 computer hardware & architecture ,Non-volatile memory ,Reduction (complexity) ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Hit rate ,Static random-access memory ,Cache ,Performance improvement ,Latency (engineering) ,business - Abstract
Increasing the capacity of the Last Level Cache (LLC) can help scale the memory wall. Due to prohibitive area and leakage power, however, growing conventional SRAM LLC already incurs diminishing returns. Emerging Non-Volatile Memory (NVM) technologies like Spin Torque Transfer RAM (STTRAM) promise high density and low leakage, thereby offering an attractive alternative for building large capacity LLCs. However these technologies have significantly longer write latency compared to SRAM, which interferes with reads and severely limits their performance potential. Despite the recent work showing the write latency reduction at NVM technology level, practical considerations like high yield and low bit error rates will result a significant loss of NVM density when these techniques are implemented. Therefore, improving the write latency while compromising on the density results in sub-optimal usage of the NVM technology. In this paper we present a novel STTRAM LLC design that mitigates the long write latency, thereby delivering SRAM like performance while preserving the benefits of high density. Based on a light-weight learning mechanism, our solution relieves LLC congestion through two schemes. Firstly, we propose write congestion aware bypass that eliminates a large fraction of writes. Despite dropping LLC hit rates which could severely degrade performance in a conventional LLC, our policy smartly modulates the bypass, overcomes the hit rate loss and delivers significant performance gain. Furthermore, our solution establishes a virtual hybrid cache that absorbs and eliminates the redundant writes, which otherwise might be repeatedly and slowly written to the NVM LLC. Detailed simulation of traditional SPEC CPU 2006 suite as well as important industry workloads running on a 4-core system shows that our proposal delivers on an average 26% performance improvement over a baseline LLC design using 8MB STTRAM, while reducing the memory system energy by 10%. Our design outperforms a similar area SRAM LLC by nearly 18%, thereby making NVM technology an attractive alternative for future high performance computing.
- Published
- 2018
15. Criticality Aware Tiered Cache Hierarchy: A Fundamental Relook at Multi-Level Cache Hierarchies
- Author
-
Sreenivas Subramoney, Anant V. Nori, Hong Wang, Siddharth Rai, and Jayesh Gaur
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Out-of-order execution ,Computer science ,CPU cache ,business.industry ,02 engineering and technology ,Energy consumption ,01 natural sciences ,CAS latency ,020202 computer hardware & architecture ,Server ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Latency (engineering) ,business ,Critical path method ,Computer network - Abstract
On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches are typically built as a multi-level cache hierarchy. One such popular hierarchy that has been adopted by modern microprocessors is the three level cache hierarchy. Building a three level cache hierarchy enables a low average hit latency since most requests are serviced from faster inner level caches. This has motivated recent microprocessors to deploy large level-2 (L2) caches that can help further reduce the average hit latency. In this paper, we do a fundamental analysis of the popular three level cache hierarchy and understand its performance delivery using program criticality. Through our detailed analysis we show that the current trend of increasing L2 cache sizes to reduce average hit latency is, in fact, an inefficient design choice. We instead propose Criticality Aware Tiered Cache Hierarchy (CATCH) that utilizes an accurate detection of program criticality in hardware and using a novel set of inter-cache prefetchers ensures that on-die data accesses that lie on the critical path of execution are served at the latency of the fastest level-1 (L1) cache. The last level cache (LLC) serves the purpose of reducing slow memory accesses, thereby making the large L2 cache redundant for most applications. The area saved by eliminating the L2 cache can then be used to create more efficient processor configurations. Our simulation results show that CATCH outperforms the three level cache hierarchy with a large 1 MB L2 and exclusive LLC by an average of 8.4%, and a baseline with 256 KB L2 and inclusive LLC by 10.3%. We also show that CATCH enables a powerful framework to explore broad chip-level area, performance and power tradeoffs in cache hierarchy design. Supported by CATCH, we evaluate radical architecture directions such as eliminating the L2 altogether and show that such architectures can yield 4.5% performance gain over the baseline at nearly 30% lesser area or improve the performance by 7.3% at the same area while reducing energy consumption by 11%.
- Published
- 2018
16. Overcoming interconnect scaling challenges using novel process and design solutions to improve both high-speed and low-power computing modes
- Author
-
Lavanya Subramanian, Hong Wang, Jayesh Gaur, Huichu Liu, Ian A. Young, Sreenivas Subramoney, Daniel H. Morris, Tanay Karnik, Ishwar Bhati, Uygar E. Avci, and Kaushik Vaidyanathan
- Subjects
Interconnection ,Computer science ,Processor design ,Process (computing) ,Control reconfiguration ,02 engineering and technology ,021001 nanoscience & nanotechnology ,020202 computer hardware & architecture ,CMOS ,Logic gate ,Hardware_INTEGRATEDCIRCUITS ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,Performance improvement ,0210 nano-technology ,Efficient energy use - Abstract
Interconnect scaling in future CMOS technology nodes is projected to cause an unprecedented increase in resistance, making interconnects the key performance limiter instead of transistors. Both high-speed and low-power computing modes based on dynamically changing VDD and frequency will be the most impacted by this trend. To overcome this dilemma we present device-circuit-architecture solutions based on the reconfiguration of i) buffered interconnects and ii) execution architecture. Interconnect is dynamically reconfigured by either CMOS transistors or novel Insulator-Metal-Transition “via” devices. Processor execution resources are dynamically re-provisioned based on operating mode and activity metrics. With projected interconnect resistance, a simulated processor design shows 18% and 15% performance improvement from interconnect and execution architecture reconfiguration, respectively. When combined, these techniques provide 35% performance improvement in high-speed mode and energy efficiency in low-power mode.
- Published
- 2017
17. A coordinated multi-agent reinforcement learning approach to multi-level cache co-partitioning
- Author
-
Preeti Ranjan Panda, Sreenivas Subramoney, and Rahul Jain
- Subjects
020203 distributed computing ,Multi-core processor ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,CPU cache ,Cache coloring ,02 engineering and technology ,Cache pollution ,Cache-oblivious algorithm ,Simultaneous multithreading ,020202 computer hardware & architecture ,Smart Cache ,Computer architecture ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Cache hierarchy ,business ,Cache algorithms - Abstract
The widening gap between the processor and memory performance has led to the inclusion of multiple levels of caches in the modern multi-core systems. Processors with simultaneous multithreading (SMT) support multiple hardware threads on the same physical core, which results in shared private caches. Any inefficiency in the cache hierarchy can negatively impact the system performance and motivates the need to perform a co-optimization of multiple cache levels by trading off individual application throughput for better system throughput and energy-delay-product (EDP). We propose a novel coordinated multiagent reinforcement learning technique for performing Dynamic Cache Co-partitioning, called Machine Learned Caches (MLC). MLC has low implementation overhead and does not require any special hardware data profilers. We have validated our proposal with 15 8-core workloads created using Spec2006 benchmarks and found it to be an effective co-partitioning technique. MLC exhibited system throughput and EDP improvements of up to 14% (gmean:9.35%) and 19.2% (gmean: 13.5%) respectively. We believe this is the first attempt at addressing the problem of multi-level cache co-partitioning.
- Published
- 2017
18. Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources
- Author
-
Sreenivas Subramoney, Jayesh Gaur, Mainak Chaudhuri, and Pradeep Ramachandran
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Computer science ,Cache coloring ,CPU cache ,Cache-only memory architecture ,02 engineering and technology ,Parallel computing ,Cache pollution ,01 natural sciences ,020202 computer hardware & architecture ,Computer architecture ,Bus sniffing ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Page cache ,Cache ,Cache algorithms - Abstract
The memory wall continues to be a major performance bottleneck. While small on-die caches have been effective so far in hiding this bottleneck, the ever-increasing footprint of modern applications renders such caches ineffective. Recent advances in memory technologies like embedded DRAM (eDRAM) and High Bandwidth Memory (HBM) have enabled the integration of large memories on the CPU package as an additional source of bandwidth other than the DDR main memory. Because of limited capacity, these memories are typically implemented as a memory-side cache. Driven by traditional wisdom, many of the optimizations that target improving system performance have been tried to maximize the hit rate of the memory-side cache. A higher hit rate enables better utilization of the cache, and is therefore believed to result in higher performance. In this paper, we challenge this traditional wisdom and present DAP, a Dynamic Access Partitioning algorithm that sacrifices cache hit rates to exploit under-utilized bandwidth available at main memory. DAP achieves a near-optimal bandwidth partitioning between the memory-side cache and main memory by using a light-weight learning mechanism that needs just sixteen bytes of additional hardware. Simulation results show a 13% average performance gain when DAP is implemented on top of a die-stacked memory-side DRAM cache. We also show that DAP delivers large performance benefits across different implementations, bandwidth points, and capacity points of the memory-side cache, making it a valuable addition to any current or future systems based on multiple heterogeneous bandwidth sources beyond the on-chip SRAM cache hierarchy.
- Published
- 2017
19. Base-Victim Compression: An Opportunistic Cache Compression Architecture
- Author
-
Sreenivas Subramoney, Alaa R. Alameldeen, and Jayesh Gaur
- Subjects
CPU cache ,Cache coloring ,Computer science ,Pipeline burst cache ,02 engineering and technology ,Parallel computing ,Cache pollution ,01 natural sciences ,Write-once ,Cache invalidation ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Cache algorithms ,010302 applied physics ,Snoopy cache ,Hardware_MEMORYSTRUCTURES ,business.industry ,Adaptive replacement cache ,MESI protocol ,General Medicine ,MESIF protocol ,020202 computer hardware & architecture ,Smart Cache ,Bus sniffing ,Hit rate ,Page cache ,Cache ,business ,Computer network - Abstract
The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity in cache compression implementations could increase cache power and access latency. On the other hand, advanced cache replacement mechanisms use heuristics to reduce misses, leading to significant performance gains. Both cache compression and replacement policies should collaborate to improve performance. In this paper, we demonstrate that cache compression and replacement policies can interact negatively. In many workloads, performance gains from replacement policies are lost due to the need to alter the replacement policy to accommodate compression. This leads to sub-optimal replacement policies that could lose performance compared to an uncompressed cache. We introduce a novel, opportunistic cache compression mechanism, Base-Victim, based on an efficient cache design. Our compression architecture improves performance on top of advanced cache replacement policies, and guarantees a hit rate at least as high as that of an uncompressed cache. For cache-sensitive applications, Base-Victim achieves an average 7.3% performance gain for single-threaded workloads, and 8.7% gain for four-thread multi-program workload mixes.
- Published
- 2016
20. Array scalarization in high level synthesis
- Author
-
Preeti Ranjan Panda, Gummidipudi Krishnaiah, Sreenivas Subramoney, Namita Sharma, Arun Kumar Pilania, and Ashok Jagannathan
- Subjects
Loop unrolling ,Computer science ,High-level synthesis ,Scalar (mathematics) ,Parallel computing ,Static random-access memory ,Integrated circuit design ,Latency (engineering) - Abstract
Parallelism across loop iterations present in behavioral specifications can typically be exposed and optimized using well known techniques such as Loop Unrolling. However, since behavioral arrays are usually mapped to memories (SRAM) during synthesis, performance bottlenecks arise due to memory port constraints. We study array scalarization, the transformation of an array into a group of scalar variables. We propose a technique for selectively scalarizing arrays for improving the performance of synthesized designs by taking into consideration the latency benefits as well as the area overhead caused by using discrete registers for storing array elements instead of denser SRAM. Our experiments on several benchmark examples indicate promising speedups of more than 10x for several designs due to scalarization.
- Published
- 2014
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.