6,859 results on '"CPU cache"'
Search Results
2. 冲突型缓存侧信道攻击的构建驱逐集研究综述.
- Author
-
李真真 and 宋威
- Abstract
Copyright of Cyber Security & Data Governance is the property of Editorial Office of Information Technology & Network Security and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
3. Understanding Perception of Cache-Based Side-Channel Attack on Cloud Environment
- Author
-
Ainapure, Bharati S., Shah, Deven, Rao, A. Ananda, Kacprzyk, Janusz, Series editor, Pal, Nikhil R., Advisory editor, Bello Perez, Rafael, Advisory editor, Corchado, Emilio S., Advisory editor, Hagras, Hani, Advisory editor, Kóczy, László T., Advisory editor, Kreinovich, Vladik, Advisory editor, Lin, Chin-Teng, Advisory editor, Lu, Jie, Advisory editor, Melin, Patricia, Advisory editor, Nedjah, Nadia, Advisory editor, Nguyen, Ngoc Thanh, Advisory editor, Wang, Jun, Advisory editor, Sa, Pankaj Kumar, editor, Sahoo, Manmath Narayan, editor, Murugappan, M., editor, Wu, Yulei, editor, and Majhi, Banshidhar, editor
- Published
- 2018
- Full Text
- View/download PDF
4. PAPP: Prefetcher-Aware Prime and Probe Side-channel Attack.
- Author
-
Daimeng Wang, Zhiyun Qian, Abu-Ghazaleh, Nael, and Krishnamurthy, Srikanth V.
- Subjects
CACHE memory ,REVERSE engineering ,BANDWIDTH allocation ,CIPHERS ,PEARSON correlation (Statistics) - Abstract
CPU memory prefetchers can substantially interfere with prime and probe cache side-channel attacks, especially on in-order CPUs which use aggressive prefetching. This interference is not accounted for in previous attacks. In this paper, we propose PAPP, a Prefetcher-Aware Prime Probe attack that can operate even in the presence of aggressive prefetchers. Specifically, we reverse engineer the prefetcher and replacement policy on several CPUs and use these insights to design a prime and probe attack that minimizes the impact of the prefetcher. We evaluate PAPP using Cache Side-channel Vulnerability (CSV) metric and demonstrate the substantial improvements in the quality of the channel under different conditions. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
5. Zero-serialization, Zero-copy memory pooling in compute clusters: Disaggregated memory made accessible
- Author
-
Groet, Philip (author) and Groet, Philip (author)
- Abstract
With the rise of the new interconnect standards CXL and previously OpenCAPI, has come a great deal of possibilities to step away from the classical approach where CPUs are in charge of moving data between external devices and local memory. Specifically, OpenCAPI allows for attached devices to directly interface with the host memory bus in a near cache coherent way. IBM has developed the ThymesisFlow system which allows for other servers to access each others Random Access Memory through this OpenCAPI link. ThymesisFlow however is not fully coherent in some cases. ThymesisFlow is designed for the situation where a borrower is able access a lender's memory, and the lender not accessing that borrowed memory. Coherency problems arise in the case where both a lender of memory, as well as a borrower of memory write to the lender's memory. This thesis proposes the use of the Apache Arrow in-memory data format to not only access memory in a near coherent fashion, but in a fully coherent fashion. This will allow compute clusters to more efficiently use memory resources, allow for applications to dynamically hotplug memory, and allow for data sharing without copying over ethernet connection. The protocols devised in this thesis are able to create disaggregated Arrow objects, which are readable by all nodes in a cluster in a coherent fashion. The creation of these coherent disaggregated objects is the only performance penalty in making them coherent, after initialization all nodes use their local CPU caches to cache remote objects. A working proof-of-concept has been created which is able to share Apache Arrow objects stored in the memory of a single node. It is also possible to create Arrow objects which span the memory of multiple nodes, allowing for objects bigger than the memory of a single node. The proof-of-concept was able to be run thanks to the setup provided by the Hasso Plattner Institute., Electrical Engineering | Embedded Systems
- Published
- 2023
6. Multi-GPU Efficient Indexing For Maximizing Parallelism of High Dimensional Range Query Services
- Author
-
Ling Liu, Mincheol Kim, and Woink Choi
- Subjects
Information Systems and Management ,Range query (data structures) ,Computer Networks and Communications ,Computer science ,CPU cache ,Node (networking) ,Search engine indexing ,Sorting ,Parallel computing ,Computer Science Applications ,Hardware and Architecture ,Parallelism (grammar) ,Massively parallel ,Access structure - Abstract
Numerous research efforts have been proposed for efficient processing of range queries in high-dimensional space by either redesigning R-tree access structure for exploring massive parallelism on single GPU or exploring distributed framework of R-tree. However, none of the existing efforts explores the integration of the parallelization of the R-tree on a single GPU with a distributed framework for the R-tree. The problem of designing an efficient multi-GPU indexing method, which can effectively combine the parallelism maximization with distributed processing of the R-tree, remains an open challenge. In this paper, we present a novel multi-GPU efficient parallel and distributed indexing method, called LBPG-tree. The rationale of the LBPG-tree is to combine the advantages of an instruction pipeline in CPU with the massive parallel processing potential of multiple GPUs by introducing two new optimization strategies: First, we exploit the GPU L2 cache for accelerating both index search and index node access on GPUs. Second, we further improve utilization of L2 cache on GPUs by compacting and sorting candidate nodes called Compact-and-Sort. Our experimental results show that the LBPG-tree outperforms G-tree, the previous representative GPU index method and effectively support multiple GPUs for providing efficient high dimensional range query service.
- Published
- 2022
7. Evolutionary Design of the Memory Subsystem
- Author
-
J. Manuel Colmenar, Josefa Díaz Álvarez, and José L. Risco-Martín
- Subjects
FOS: Computer and information sciences ,Flat memory model ,Cache coloring ,CPU cache ,Computer science ,Computer Science - Artificial Intelligence ,Real-time computing ,0211 other engineering and technologies ,Register file ,02 engineering and technology ,Overlay ,law.invention ,law ,Hardware Architecture (cs.AR) ,0202 electrical engineering, electronic engineering, information engineering ,Computing with Memory ,Neural and Evolutionary Computing (cs.NE) ,Computer Science - Hardware Architecture ,021106 design practice & management ,Dynamic random-access memory ,Memory hierarchy ,business.industry ,Uniform memory access ,Computer Science - Neural and Evolutionary Computing ,020202 computer hardware & architecture ,Physical address ,Memory management ,Artificial Intelligence (cs.AI) ,Shared memory ,Embedded system ,Distributed memory ,Cache ,business ,Software - Abstract
The memory hierarchy has a high impact on the performance and power consumption in the system. Moreover, current embedded systems, included in mobile devices, are specifically designed to run multimedia applications, which are memory intensive. This increases the pressure on the memory subsystem and affects the performance and energy consumption. In this regard, the thermal problems, performance degradation and high energy consumption, can cause irreversible damage to the devices. We address the optimization of the whole memory subsystem with three approaches integrated as a single methodology. Firstly, the thermal impact of register file is analyzed and optimized. Secondly, the cache memory is addressed by optimizing cache configuration according to running applications and improving both performance and power consumption. Finally, we simplify the design and evaluation process of general-purpose and customized dynamic memory manager, in the main memory. To this aim, we apply different evolutionary algorithms in combination with memory simulators and profiling tools. This way, we are able to evaluate the quality of each candidate solution and take advantage of the exploration of solutions given by the optimization algorithm. We also provide an experimental experience where our proposal is assessed using well-known benchmark applications.
- Published
- 2023
8. Optimizing L1 cache for embedded systems through grammatical evolution
- Author
-
J. Manuel Colmenar, Josefa Díaz Álvarez, José L. Risco-Martín, Oscar Garnica, and Juan Lanchares
- Subjects
FOS: Computer and information sciences ,Computer science ,CPU cache ,Cache coloring ,Computer Science - Artificial Intelligence ,Pipeline burst cache ,02 engineering and technology ,Parallel computing ,Cache pollution ,Cache-oblivious algorithm ,Theoretical Computer Science ,Cache invalidation ,Write-once ,0202 electrical engineering, electronic engineering, information engineering ,Neural and Evolutionary Computing (cs.NE) ,Cache algorithms ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer Science - Neural and Evolutionary Computing ,020207 software engineering ,Smart Cache ,Artificial Intelligence (cs.AI) ,Bus sniffing ,Embedded system ,020201 artificial intelligence & image processing ,Geometry and Topology ,Cache ,business ,Software - Abstract
Nowadays, embedded systems are provided with cache memories that are large enough to influence in both performance and energy consumption as never occurred before in this kind of systems. In addition, the cache memory system has been identified as a component that improves those metrics by adapting its configuration according to the memory access patterns of the applications being run. However, given that cache memories have many parameters which may be set to a high number of different values, designers are faced with a wide and time-consuming exploration space. In this paper, we propose an optimization framework based on Grammatical Evolution (GE) which is able to efficiently find the best cache configurations for a given set of benchmark applications. This metaheuristic allows an important reduction of the optimization runtime obtaining good results in a low number of generations. Besides, this reduction is also increased due to the efficient storage of evaluated caches. Moreover, we selected GE because the plasticity of the grammar eases the creation of phenotypes that form the call to the cache simulator required for the evaluation of the different configurations. Experimental results for the Mediabench suite show that our proposal is able to find cache configurations that obtain an average improvement of 62 % versus a real world baseline configuration.
- Published
- 2023
9. Low-Complexity High-Performance Cyclic Caching for Large MISO Systems
- Author
-
Emanuele Parrinello, Seyed Pooya Shariatpanahi, Mohammad Javad Salehi, Petros Elia, and Antti Tolli
- Subjects
FOS: Computer and information sciences ,Beamforming ,Computer science ,CPU cache ,Computer Science - Information Theory ,multi-antenna communication ,Duality (mathematics) ,Data_CODINGANDINFORMATIONTHEORY ,02 engineering and technology ,Multiplexing ,Bottleneck ,finite-SNR ,0202 electrical engineering, electronic engineering, information engineering ,coded caching ,Electrical and Electronic Engineering ,Computer Science::Information Theory ,low-subpacketization ,Multicast ,Information Theory (cs.IT) ,Applied Mathematics ,020206 networking & telecommunications ,Coding gain ,Computer Science Applications ,Exponential function ,optimized beamforming ,Computer engineering - Abstract
Multi-antenna coded caching is known to combine a global caching gain that is proportional to the cumulative cache size found across the network, with an additional spatial multiplexing gain that stems from using multiple transmitting antennas. However, a closer look reveals two severe bottlenecks; the well-known exponential subpacketization bottleneck that dramatically reduces performance when the communicated file sizes are finite, and the considerable optimization complexity of beamforming multicast messages when the SNR is finite. We here present an entirely novel caching scheme, termed cyclic multi-antenna coded caching, whose unique structure allows for the resolution of the above bottlenecks in the crucial regime of many transmit antennas. For this regime, where the multiplexing gain can exceed the coding gain, our new algorithm is the first to achieve the exact one-shot linear optimal DoF with a subpacketization complexity that scales only linearly with the number of users, and the first to benefit from a multicasting structure that allows for exploiting uplink-downlink duality in order to yield optimized beamformers ultra-fast. In the end, our novel solution provides excellent performance for networks with finite SNR, finite file sizes, and many users.
- Published
- 2022
10. Energy-Aware Coded Caching Strategy Design With Resource Optimization for Satellite-UAV-Vehicle-Integrated Networks
- Author
-
Xinyi Sun, Shushi Gu, Wei Xiang, Keping Yu, Tao Huang, and Zhihua Yang
- Subjects
Multicast ,Computer Networks and Communications ,business.industry ,Computer science ,CPU cache ,Energy consumption ,Computer Science Applications ,Backhaul (telecommunications) ,Hardware and Architecture ,Server ,Signal Processing ,Cache ,business ,Heterogeneous network ,Information Systems ,Efficient energy use ,Computer network - Abstract
The Internet of Vehicles (IoV) can offer safe and comfortable driving experience, by the enhanced advantages of space-air-ground integrated networks (SAGINs), i.e., global seamless access, wide-area coverage and flexible traffic scheduling. However, due to the huge popular traffic volume, the limited cache/power resources and the heterogeneous network infrastructures, the burden of backhaul link will be seriously enlarged, degrading the energy efficient of the IoV in SAGIN. In this paper, to implement the popular content severing multiple vehicle users (VUs), we consider a Cache-enabled Satellite-UAV-Vehicle Integrated Network (CSUVIN), where geosynchronous earth orbit (GEO) satellite is regard as a cloud server, unmanned aerial vehicles (UAVs) are deployed as edge caching servers. Then, we propose an energy-aware coded caching strategy employed in our system model to provide more multicast opportunities, and to reduce the backhaul transmission volume, considering the effects of file popularity, cache size, request frequency, and mobility in different road sections (RSs). Furthermore, we derive the closed-form expressions of total energy consumption both in single-RS and multi-RSs scenarios with asynchronous and synchronous services schemes, respectively. An optimization problem is formulated to minimize the total energy consumption, and the optimal content placement matrix, power allocation vector and coverage deployment vector are obtained by well-designed algorithms. We finally show, numerically, our coded caching strategy can greatly improve energy efficient performance in CSUVINs, compared with other benchmarked caching schemes under the heterogeneous network conditions.
- Published
- 2022
11. On the Reliability of FeFET On-Chip Memory
- Author
-
Jorg Henkel, Victor M. van Santen, Hussam Amrouch, and Paul R. Genssler
- Subjects
business.industry ,CPU cache ,Computer science ,Transistor ,Theoretical Computer Science ,law.invention ,Reliability (semiconductor) ,Computational Theory and Mathematics ,Hardware and Architecture ,Memory cell ,law ,Power consumption ,Embedded system ,System level ,business ,Cmos process ,Software ,Voltage - Abstract
FeFET is a promising technology for non-volatile on-chip memories. It is rapidly attracting an ever-increasing attention from industry. The advantage of FeFETs is full compatibility with the existing CMOS process beside their low power consumption. To enable ultra-dense memories, 1-FeFET AND arrays were proposed in which a memory cell is formed from a single FeFET. All access transistors, which are traditionally needed to operate memory cells, are removed. This imposes a new reliability challenge due to indirect write disturbances. In this work, we are the first to investigate the reliability of FeFET memories from device to system level. We develop a unified model capturing the impact of both indirect disturbances and direct writes on the reliability of FeFET cells. We then investigate different array sizes, write voltages, write methods under the effect of a wide range of workloads using CPU cache as an example of on-chip memory. We demonstrate that indirect write disturbances are the dominate effect degrading the reliability of FeFET memories. For most cells, it contributes over 90% to the overall induced degradation. This provides guidelines for researchers at both device and circuit levels to optimize the FeFET reliability further, while considering the hidden impact of workloads.
- Published
- 2022
12. Area and Energy Efficient Joint 2T SOT-MRAM-Based on Diffusion Region Sharing With Adjacent Cells
- Author
-
Jongsun Park and Yunho Jang
- Subjects
Magnetoresistive random-access memory ,Hardware_MEMORYSTRUCTURES ,CPU cache ,Computer science ,Transistor ,Capacitance ,law.invention ,Computational science ,Hardware_GENERAL ,law ,x86 ,Electrical and Electronic Engineering ,Diffusion (business) ,Joint (audio engineering) ,Efficient energy use - Abstract
In this paper, we present a novel low area joint 2T spin orbit torque magnetic random access memory (SOT-MRAM) cell architecture. The proposed joint 2T cell achieves up to 15% of SOT-MRAM cell area reduction by sharing the diffusion regions of transistors between adjacent cells. In addition, the small bit-line capacitance of the proposed SOT-MRAM can lead to 27% read energy reduction in comparison to the conventional SOT-MRAM. When the proposed 1 MB SOT-MRAM is used as L2 cache of X86 processor, the gem5 simulation results show the average of 18% dynamic energy savings in various workloads of SPEC2006 benchmarks.
- Published
- 2022
13. Observing the Invisible: Live Cache Inspection for High-Performance Embedded Systems
- Author
-
Renato Mancuso, Dharmesh Tarapore, Steven Brzozowski, and Shahin Roozhkhosh
- Subjects
FOS: Computer and information sciences ,Profiling (computer programming) ,Structure (mathematical logic) ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,CPU cache ,Other Computer Science (cs.OH) ,Linux kernel ,Theoretical Computer Science ,Computational Theory and Mathematics ,Computer Science - Other Computer Science ,Hardware and Architecture ,Embedded system ,Leverage (statistics) ,System on a chip ,Cache ,Cache hierarchy ,business ,Software - Abstract
The vast majority of high-performance embedded systems implement multi-level CPU cache hierarchies. But the exact behavior of these CPU caches has historically been opaque to system designers. Absent expensive hardware debuggers, an understanding of cache makeup remains tenuous at best. This enduring opacity further obscures the complex interplay among applications and OS-level components, particularly as they compete for the allocation of cache resources. Notwithstanding the relegation of cache comprehension to proxies such as static cache analysis, performance counter-based profiling, and cache hierarchy simulations, the underpinnings of cache structure and evolution continue to elude software-centric solutions. In this paper, we explore a novel method of studying cache contents and their evolution via snapshotting. Our method complements extant approaches for cache profiling to better formulate, validate, and refine hypotheses on the behavior of modern caches. We leverage cache introspection interfaces provided by vendors to perform live cache inspections without the need for external hardware. We present CacheFlow, a proof-of-concept Linux kernel module which snapshots cache contents on an NVIDIA Tegra TX1 SoC (system on chip).
- Published
- 2022
14. Asymmetric Decentralized Caching With Coded Prefetching Under Nonuniform Requests
- Author
-
Jun Li, Long Shi, Kui Cai, Zhe Wang, and Wentu Song
- Subjects
Hardware_MEMORYSTRUCTURES ,Computer Networks and Communications ,CPU cache ,business.industry ,Computer science ,Context (language use) ,Computer Science Applications ,Set (abstract data type) ,Rate analysis ,Control and Systems Engineering ,Code (cryptography) ,Cache ,Electrical and Electronic Engineering ,business ,Information Systems ,Computer network - Abstract
We investigate a basic decentralized caching network with coded prefetching under nonuniform requests and arbitrary file popularities, where a server containing $N$ files is connected to $K$ users, each with limited cache memory of $M$ files through a shared link. In the decentralized placement phase, the server encodes all files by the maximum distance separable (MDS) codes with different rates, and each user allocates different files with different cache weights, resulting in that each user randomly prefetches the coded subfiles with diverse sizes. In this context, the symmetric delivery in existing decentralized caching networks with coded prefetching cannot be directly applied. To address this problem, we develop an asymmetric delivery procedure for the decentralized caching network with arbitrary MDS code rates and cache weights. Furthermore, we characterize the expected normalized rate induced by the asymmetric delivery using the concept of user grouping and leader set. Following the proposed asymmetric delivery and rate analysis, we derive the exact rate–memory tradeoff for a decentralized two-user and two-file caching network and optimize cache weights to minimize the expected normalized rate. Finally, numerical results corroborate our analytical results in the two-user two-file caching scenario.
- Published
- 2022
15. DBI, debuggers, VM: gotta catch them all: How to escape or fool debuggers with internal architecture CPU flaws?
- Author
-
Plumerault, François and David, Baptiste
- Published
- 2021
- Full Text
- View/download PDF
16. Optimal caching scheme in D2D networks with multiple robot helpers
- Author
-
Yu Lin, Faming Cai, Feng Ke, Zhikai Liu, Hui Song, and Weizhao Yan
- Subjects
Scheme (programming language) ,Computer Networks and Communications ,CPU cache ,Wireless network ,Computer science ,Software deployment ,Distributed computing ,Robot ,Particle swarm optimization ,Mobile robot ,Energy consumption ,computer ,computer.programming_language - Abstract
Mobile robots are playing an important role in modern industries. The deployment of robots which act as mobile helpers in a wireless network is rarely considered in the existing studies of the device-to-device (D2D) caching schemes. In this paper, we investigate the optimal caching scheme for D2D networks with multiple robot helpers with large cache size. An improved caching scheme named robot helper aided caching (RHAC) scheme is proposed to optimize the system performance by moving the robot helpers to the optimal positions. The optimal locations of the robot helpers can be found based on partitioned adaptive particle swarm optimization (PAPSO) algorithm. And based on these two algorithms, we propose a mobility-aware optimization strategy for the robot helpers. The simulation results demonstrate that compared with other conventional caching schemes, the proposed RHAC scheme can bring significant performance improvements in terms of hitting probability, cost, delay and energy consumption. Furthermore, the location distribution and mobility of the robot helpers are studied, which provides a reference for introducing robot helpers into different scenarios such as smart factories.
- Published
- 2022
17. Dual-Port Content Addressable Memory for Cache Memory Applications
- Author
-
Mujahed Eleyat, Allam Abumwais, Kaan Uyar, and Adil Amirjanov
- Subjects
Biomaterials ,Mechanics of Materials ,Computer science ,business.industry ,CPU cache ,Modeling and Simulation ,Port (circuit theory) ,Electrical and Electronic Engineering ,DUAL (cognitive architecture) ,Content-addressable memory ,business ,Computer hardware ,Computer Science Applications - Published
- 2022
18. Analysis and visualization of proxy caching using LRU, AVL tree and BST with supervised machine learning
- Author
-
P. Ambily Pramitha, Anurag Shrivastava, John T. Abraham, Deepak Kumar Gupta, Munindra Lunagaria, and Jitendra Singh Kushwah
- Subjects
Hardware_MEMORYSTRUCTURES ,AVL tree ,Computer science ,CPU cache ,Parallel computing ,Python (programming language) ,computer.software_genre ,Proxy server ,Cache ,Proxy (statistics) ,Cache algorithms ,computer ,Access time ,computer.programming_language - Abstract
Proxy cache access speeds and decreases load time. There is an uncertainty regarding cache level 2 which you neglect. This research shall investigate the L1 cache, primary cache, and L2 cache as a secondary proxy server cache. LRU is typically utilised instead of cache. LRU for cold cache removal used to be a time-consuming process, but it isn't particularly efficient today. The performance of the cache L1 used LRU and L2 used LRU_AVL has risen with these solutions. The output is the LRU_AVL among other ways that utilize LRU tables and graphs. proxy cache LRU, LRU_AVL, and LRU_BST have average time of access calculated. Median access time is estimated using Python tools including Pandas, MatPlotLib as well as LRU, LRU_AVL, and LRU_BST. This research will anticipate the average period of usage of LRU, LRU_AVL, and LRU_BST cache algorithms.
- Published
- 2022
19. PROWL: A Cache Replacement <u>P</u>olicy fo<u>r</u> C<u>o</u>nsistency A<u>w</u>are Renewab<u>l</u>e Powered Devices
- Author
-
Mohammad Abbasinia, Alireza Ejlali, and Ali Hoseinghorban
- Subjects
Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,CPU cache ,Distributed computing ,Response time ,Computer Science Applications ,Renewable energy ,Human-Computer Interaction ,Consistency (database systems) ,Computer Science (miscellaneous) ,Cache ,State (computer science) ,business ,Energy harvesting ,Information Systems ,Efficient energy use - Abstract
Energy harvesting systems powered by renewable energy sources employ hybrid volatile-nonvolatile memory to enhance energy efficiency and forward progress. These systems have unreliable power sources and energy buffers with limited capacity, so they complete long-running applications across multiple power outages. However, a power outage might cause data inconsistency, because the data in nonvolatile memories are persistent, while the data in volatile memories are unsteady. State of the art studies proposed various memory architectures and compiler-based techniques to tackle the data inconsistency in these systems. These approaches impose too many unnecessary check-points on the system to avoid data inconsistency. These studies did not consider the effect of cache memory to mask and postpone the imposed check-points on the system for the sake of consistency. In this paper, we utilize the cache memory and propose PROWL, a consistency aware cache replacement policy to avoid data inconsistency with fewer check-points. The results show that PROWL has by up to 85% fewer check-points compare to the state of the art approaches, and PROWL improves the average response time of the system by up to 65%.
- Published
- 2022
20. Association and Caching in Relay-Assisted mmWave Networks: A Stochastic Geometry Perspective
- Author
-
Chang Wen Chen, Ming Zhang, Haizhou Sun, Hancheng Lu, and Zhuojia Gu
- Subjects
Computer science ,CPU cache ,business.industry ,Applied Mathematics ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,Bandwidth (signal processing) ,Computer Science Applications ,law.invention ,Backhaul (telecommunications) ,Base station ,Relay ,law ,Convex optimization ,Electrical and Electronic Engineering ,business ,Stochastic geometry ,Selection algorithm ,Computer network - Abstract
Limited backhaul bandwidth and blockage effects are two main factors limiting the practical deployment of millimeter wave (mmWave) networks. To tackle these issues, we study the feasibility of relaying as well as caching in mmWave networks. A user association and relaying (UAR) criterion dependent on both caching status and maximum biased received power is proposed by considering the spatial correlation caused by the coexistence of base stations (BSs) and relay nodes (RNs). A joint UAR and caching placement problem is then formulated to maximize the backhaul offloading traffic. Using stochastic geometry tools, we decouple the joint UAR and caching placement problem by analyzing the relationship between UAR probabilities and caching placement probabilities. We then optimize the transformed caching placement problem based on polyblock outer approximation by exploiting the monotonic property in the general case and utilizing convex optimization in the noise-limited case. Accordingly, we propose a BS and RN selection algorithm where caching status at BSs and maximum biased received power are jointly considered. Experimental results demonstrate a significant enhancement of backhaul offloading using the proposed algorithms, and show that deploying more RNs and increasing cache size in mmWave networks is a more cost-effective alternative than increasing BS density to achieve similar backhaul offloading performance.
- Published
- 2021
21. Hybrid 2D/1D Blocking as Optimal Matrix-Matrix Multiplication
- Author
-
Gusev, Marjan, Ristov, Sasko, Velkoski, Goran, Markovski, Smile, editor, and Gusev, Marjan, editor
- Published
- 2013
- Full Text
- View/download PDF
22. MeF-RAM: A New Non-Volatile Cache Memory Based on Magneto-Electric FET
- Author
-
AngiziShaahin, FanDeliang, KhoshaviNavid, DowbenPeter, and MarshallAndrew
- Subjects
Non-volatile memory ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,CPU cache ,Electrical engineering ,Electrical and Electronic Engineering ,business ,Computer Graphics and Computer-Aided Design ,Magneto ,Computer Science Applications - Abstract
Magneto-Electric FET ( MEFET ) is a recently developed post-CMOS FET, which offers intriguing characteristics for high-speed and low-power design in both logic and memory applications. In this article, we present MeF-RAM , a non-volatile cache memory design based on 2-Transistor-1-MEFET ( 2T1M ) memory bit-cell with separate read and write paths. We show that with proper co-design across MEFET device, memory cell circuit, and array architecture, MeF-RAM is a promising candidate for fast non-volatile memory ( NVM ). To evaluate its cache performance in the memory system, we, for the first time, build a device-to-architecture cross-layer evaluation framework to quantitatively analyze and benchmark the MeF-RAM design with other memory technologies, including both volatile memory (i.e., SRAM, eDRAM) and other popular non-volatile emerging memory (i.e., ReRAM, STT-MRAM, and SOT-MRAM). The experiment results for the PARSEC benchmark suite indicate that, as an L2 cache memory, MeF-RAM reduces Energy Area Latency ( EAT ) product on average by ~98% and ~70% compared with typical 6T-SRAM and 2T1R SOT-MRAM counterparts, respectively.
- Published
- 2021
23. Efficient classification of private memory blocks
- Author
-
Bhargavi R. Upadhyay, Alberto Ros, and Jalpa Shah
- Subjects
Scheme (programming language) ,Multi-core processor ,Hardware_MEMORYSTRUCTURES ,Computer Networks and Communications ,Computer science ,CPU cache ,Translation lookaside buffer ,Directory ,Theoretical Computer Science ,Computer architecture ,Shared memory ,Artificial Intelligence ,Hardware and Architecture ,Granularity ,Latency (engineering) ,computer ,Software ,computer.programming_language - Abstract
Shared memory architectures are pervasive in the multicore technology era. Still, sequential and parallel applications use most of the data as private in a multicore system. Recent proposals using this observation and driven by a classification of private/shared memory data can reduce the coherence directory area or the memory access latency. The effectiveness of these proposals depends on the accuracy of the classification. The existing proposals perform the private/shared classification at page granularity, leading to a miss-classification and reducing the number of detected private memory blocks. We propose a mechanism able to accurately classify memory blocks using the existing translation lookaside buffers (TLB), which increases the effectiveness of proposals relying on a private/shared classification. Our experimental results show that the proposed scheme reduces L1 cache misses by 25% compared to a page-grain classification approach, which translates into an improvement in system performance by 8.0% with respect to a page-grain approach.
- Published
- 2021
24. A novel low power hybrid cache using GC-EDRAM cells
- Author
-
Shlomo Weiss, Roger Kahn, and Junyi Zhou
- Subjects
Hardware_MEMORYSTRUCTURES ,business.industry ,CPU cache ,Computer science ,Energy consumption ,eDRAM ,CMOS ,Hardware and Architecture ,Embedded system ,Central processing unit ,Cache ,Static random-access memory ,Electrical and Electronic Engineering ,business ,Software ,Dram - Abstract
In a typical embedded CPU, large on-chip storage is critical to meet high performance requirements. However, the fast increasing size of the on-chip storage based on traditional SRAM cells makes the area cost and energy consumption unsustainable for future embedded applications. Replacing SRAM with DRAM on the CPU’s chip is generally considered not worthwhile because DRAM is not compatible with the common CMOS logic and requires additional processing steps beyond what is required for CMOS. However a special DRAM technology, Gain-Cell embedded-DRAM (GC-eDRAM) [1] , [2] , [3] is logic compatible and retains some of the good properties of DRAM (small and low power). In this paper we evaluate the performance of a novel hybrid cache memory where the data array, generally populated with SRAM cells, is replaced with GC-eDRAM cells while the tag array continues to use SRAM cells. Our evaluation of this cache demonstrates that, compared to the conventional SRAM-based designs, our novel architecture exhibits comparable performance with less energy consumption and smaller silicon area, enabling the sustainable on-chip storage scaling for future embedded CPUs.
- Published
- 2021
25. Optimization of data allocation in hierarchical memory for blocked shortest paths algorithms
- Author
-
A. A. Prihozhy
- Subjects
hierarchical memory ,equitable coloring ,defective coloring ,CPU cache ,Computer science ,Information technology ,T58.5-58.64 ,direct mapped cache ,Upper and lower bounds ,Randomized algorithm ,Reduction (complexity) ,Matrix (mathematics) ,Equitable coloring ,Cache ,data allocation ,block conflict graph ,Algorithm ,shortest paths algorithm ,performance ,Block (data storage) - Abstract
This paper is devoted to the reduction of data transfer between the main memory and direct mapped cache for blocked shortest paths algorithms (BSPA), which represent data by a D[M×M] matrix of blocks. For large graphs, the cache size S = δ×M2, δ < 1 is smaller than the matrix size. The cache assigns a group of main memory blocks to a single cache block. BSPA performs multiple recalculations of a block over one or two other blocks and may access up to three blocks simultaneously. If the blocks are assigned to the same cache block, conflicts occur among the blocks, which imply active transfer of data between memory levels. The distribution of blocks on groups and the block conflict count strongly depends on the allocation and ordering of the matrix blocks in main memory. To solve the problem of optimal block allocation, the paper introduces a block conflict weighted graph and recognizes two cases of block mapping: non-conflict and minimum-conflict. In first case, it formulates an equitable color-class-size constrained coloring problem on the conflict graph and solves it by developing deterministic and random algorithms. In second case, the paper formulates a problem of weighted defective color-count constrained coloring of the conflict graph and solves it by developing a random algorithm. Experimental results show that the equitable random algorithm provides an upper bound of the cache size that is very close to the lower bound estimated over the size of a complete subgraph, and show that a non-conflict matrix allocation is possible at δ = 0.5 for M = 4 and at δ = 0.1 for M = 20. For a low cache size, the weighted defective algorithm gives the number of remaining conflicts that is up to 8.8 times less than the original BSPA gives. The proposed model and algorithms are applicable to set-associative cache as well.
- Published
- 2021
26. Modeling cache performance for embedded systems
- Author
-
A M Ahaneku, C V Chijindu, N J Eneh, Ogechukwu Kingsley Ugwueze, Obinna M. Ezeja, C. C. Udeze, and Edward C. Anoliefo
- Subjects
Hardware_MEMORYSTRUCTURES ,Control and Optimization ,Mean squared error ,Computer Networks and Communications ,Computer science ,business.industry ,CPU cache ,Cumulative distribution function ,Fast Fourier transform ,Reuse ,Set (abstract data type) ,Hardware and Architecture ,Control and Systems Engineering ,Margin (machine learning) ,Embedded system ,Computer Science (miscellaneous) ,Cache ,Electrical and Electronic Engineering ,business ,Instrumentation ,Information Systems - Abstract
This paper presents a cache performance model for embedded systems. The need for efficient cache design in embedded systems has led to the exploration of various methods of design for optimal cache configurations for embedded processor. Better users’ experiences are realized by improving performance parameters of embedded systems. This work presents a cache hit rate estimation model for embedded systems that can be used to explore optimal cache configurations using Bourneli’s binomial cumulative probability based on application of reuse distance profiles. The model presented was evaluated using three mibench benchmarks which are bitcount, basicmath and FFT for 4kb, 8kb, 16kb, 32kb and 64kb sizes of cache under 2-way, 4-ways, 8-ways and 16-ways set associative configurations, all using least recently-used (LRU) replacement policy. The results were compared with the results obtained using sim-cheetah from simplescalar simulators suite. The mean errors for bitcount, basicmath, and FFT benchmarks are 0.0263%, 2.4476%, and 1.9000% respectively. Therefore, the mean error for the three benchmarks is equal to 1.4579%. The margin of errors in the results was below 5% and within the acceptable limits showing that the model can be used to estimate hit rates of cache and to explore cache design options.
- Published
- 2021
27. Performance and Area Trade-Off of 3D-Stacked DRAM Based Chip Multiprocessor with Hybrid Interconnect
- Author
-
Rakesh Pandey and Aryabartta Sahu
- Subjects
Interconnection ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,CPU cache ,Memory bandwidth ,Multiprocessing ,Chip ,Die (integrated circuit) ,Computer Science Applications ,Human-Computer Interaction ,Embedded system ,Hardware_INTEGRATEDCIRCUITS ,Computer Science (miscellaneous) ,Cache ,business ,Dram ,Information Systems - Abstract
Nowadays, the number of cores on a chip multiprocessor is growing to increase the system performance. However, inadequate on-chip interconnection and memory bandwidth have diminished the potential of these chip multiprocessors. High performance interconnects, 3D-stacked main memory and large on-chip caches are the architectural parameters used to tackle this issue. For a fixed die size, high performance interconnects, and the 3D-stacked memory fosters the growing rate of the number of cores on a chip multiprocessor whereas increasing the size of on-chip cache poses a restriction. In this work, we study the trade-off between the performance and overall chip area (evaluated using the number of cores, their types and cache size per core) of the chip multiprocessor for different combinations of the interconnection network, DRAM memory (off-chip or on-chip) and self-adaptive page mapping. Our experiments show that for the base-case (chip multiprocessor with the off-chip DRAM and without hybrid interconnect) architecture and without page mapping technique, to increase the core count for a fixed die size, merely shrinking the cache size degrades the performance. Whereas, reducing the cache size and increasing the chip multiprocessor core count along with our considered target architecture and page mapping technique scale up the performance.
- Published
- 2021
28. Spread Estimation With Non-Duplicate Sampling in High-Speed Networks
- Author
-
Qingjun Xiao, Yang Du, Haibo Wang, Shigang Chen, Chaoyi Ma, He Huang, and Yu-E Sun
- Subjects
Computer Networks and Communications ,CPU cache ,Computer science ,Probabilistic logic ,Sampling (statistics) ,Estimator ,Internet traffic ,Computer Science Applications ,Computer engineering ,Query throughput ,System on a chip ,Electrical and Electronic Engineering ,Throughput (business) ,Software - Abstract
Per-flow spread measurement in high-speed networks has many practical applications. It is a more difficult problem than the traditional per-flow size measurement. Most prior work is based on sketches, focusing on reducing their space requirements in order to fit in on-chip cache memory. This design allows the measurement to be performed at the line rate, but it suffers from expensive computation for spread queries (unsuitable for online operations) and large errors in spread estimation for small flows. This paper complements the prior art with a new spread estimator design based on an on-chip/off-chip model. By storing traffic statistics in off-chip memory, our new design faces a key technical challenge to design an efficient online module of non-duplicate sampling that cuts down the off-chip memory access. We first propose a two-stage solution for non-duplicate sampling, which is efficient but cannot handle well a sampling probability that is either too small or too big. We then address this limitation through a three-stage solution that is more space-efficient. Our analysis shows that the proposed spread estimator is highly configurable for a variety of probabilistic performance guarantees. We implement our spread estimator in hardware using FPGA. The experiment results based on real Internet traffic traces show that our estimator produces spread estimation with much better accuracy than the prior art, reducing the mean relative (absolute) error by about one order of magnitude. Moreover, it increases the query throughput by around three orders of magnitude, making it suitable for supporting online queries in real time.
- Published
- 2021
29. Write-Optimized B+ Tree Index Technology for Persistent Memory
- Author
-
Changsheng Xie, Ma Ruixiang, Wei-Jun Li, Bu-Rong Dong, Fei Wu, and Meng Zhang
- Subjects
Write amplification ,CPU cache ,business.industry ,Computer science ,Parallel computing ,Computer Science Applications ,Theoretical Computer Science ,Tree (data structure) ,Computational Theory and Mathematics ,Hardware and Architecture ,Node (computer science) ,Computer data storage ,Benchmark (computing) ,Overhead (computing) ,Latency (engineering) ,business ,Software - Abstract
Due to its low latency, byte-addressable, non-volatile, and high density, persistent memory (PM) is expected to be used to design a high-performance storage system. However, PM also has disadvantages such as limited endurance, thereby proposing challenges to traditional index technologies such as B+ tree. B+ tree is originally designed for dynamic random access memory (DRAM)-based or disk-based systems and has a large write amplification problem. The high write amplification is detrimental to a PM-based system. This paper proposes WO-tree, a write-optimized B+ tree for PM. WO-tree adopts an unordered write mechanism for the leaf nodes, and the unordered write mechanism can reduce a large number of write operations caused by maintaining the entry order in the leaf nodes. When the leaf node is split, WO-tree performs the cache line flushing operation after all write operations are completed, which can reduce frequent data flushing operations. WO-tree adopts a partial logging mechanism and it only writes the log for the leaf node. The inner node recognizes the data inconsistency by the read operation and the data can be recovered using the leaf node information, thereby significantly reducing the logging overhead. Furthermore, WO-tree adopts a lock-free search for inner nodes, which reduces the locking overhead for concurrency operation. We evaluate WO-tree using the Yahoo! Cloud Serving Benchmark (YCSB) workloads. Compared with traditional B+ tree, wB-tree, and Fast-Fair, the number of cache line flushes caused by WO-tree insertion operations is reduced by 84.7%, 22.2%, and 30.8%, respectively, and the execution time is reduced by 84.3%, 27.3%, and 44.7%, respectively.
- Published
- 2021
30. A Neighborhood Aware Caching and Interest Dissemination Scheme for Content Centric Networks
- Author
-
Krishna Kant and Amitangshu Pal
- Subjects
Scheme (programming language) ,Hardware_MEMORYSTRUCTURES ,Exploit ,Computer Networks and Communications ,Computer science ,business.industry ,CPU cache ,Bloom filter ,Popularity ,Server ,Cache ,Electrical and Electronic Engineering ,Latency (engineering) ,business ,computer ,computer.programming_language ,Computer network - Abstract
Content-Centric Networking (CCN) is a promising framework for the next generation Internet architecture that exploits ubiquitous in-network caching to minimize content delivery latency and reduce network traffic. In this paper, we introduce a neighborhood aware mechanism for content caching, named Neighborhood Aware Caching and Interest Dissemination ( NACID ) that accounts for the popularity of contents and how close the content copies are in the neighborhood. We use a very low-overhead, Bloom Filter based dissemination of caching information in the neighborhood. Given the neighborhood cached contents, the proposed scheme decides when and how to handle the additional caching of content and its eviction. Simulation results show that NACID performs substantially better than the existing CCN caching policies. We also study different heterogeneous cache memory allocation strategies and show that the simpler homogeneous allocation strategies work almost as well.
- Published
- 2021
31. LeaD: Large-Scale Edge Cache Deployment Based on Spatio-Temporal WiFi Traffic Statistics
- Author
-
Xuemin Sherman Shen, Minglu Li, Ju Ren, Feng Lyu, Nan Cheng, Peng Yang, and Yaoxue Zhang
- Subjects
Computer Networks and Communications ,Computer science ,CPU cache ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,Mobile computing ,020206 networking & telecommunications ,02 engineering and technology ,Bottleneck ,Backhaul (telecommunications) ,Statistics ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Quality of experience ,Cache ,Enhanced Data Rates for GSM Evolution ,Electrical and Electronic Engineering ,Software - Abstract
Widespread and large-scale WiFi systems have been deployed in many corporate locations, while the backhual capacity becomes the bottleneck in providing high-rate data services to a tremendous number of WiFi users. Mobile edge caching is a promising solution to relieve backhaul pressure and deliver quality services by proactively pushing contents to access points (APs). However, how to deploy cache in large-scale WiFi system is not well studied yet quite challenging since numerous APs can have heterogeneous traffic characteristics, and future traffic conditions are unknown ahead. In this paper, given the cache storage budget, we explore the cache deployment in a large-scale WiFi system, which contains 8,000 APs and serves more than 40,000 active users, to maximize the long-term caching gain. Specifically, we first collect two-month user association records and conduct intensive spatio-temporal analytics on WiFi traffic consumption, gaining two major observations. First, per AP traffic consumption varies in a rather wide range and the proportion of AP distributes evenly within the range, indicating that the cache size should be heterogeneously allocated in accordance to the underlying traffic demands. Second, compared to a single AP, the traffic consumption of a group of APs (clustered by physical locations) is more stable, which means that the short-term traffic statistics can be used to infer the future long-term traffic conditions. We then propose our cache deployment strategy, named LeaD (i.e., L arge-scale WiFi E dge c A che D eployment), in which we first cluster large-scale APs into well-sized edge nodes, then conduct the stationary testing on edge level traffic consumption and sample sufficient traffic statistics in order to precisely characterize long-term traffic conditions, and finally devise the TEG ( T raffic-w E ighted G reedy) algorithm to solve the long-term caching gain maximization problem. Extensive trace-driven experiments are carried out, and the results demonstrate that LeaD is able to achieve the near-optimal caching performance and can outperform other benchmark strategies significantly.
- Published
- 2021
32. On-path caching based on content relevance in Information-Centric Networking
- Author
-
Hanbo Wang, Wu Yang, Dapeng Man, Qi Lu, Jiguang Lv, and Jiafei Guo
- Subjects
Network architecture ,Computer Networks and Communications ,business.industry ,CPU cache ,Computer science ,Node (networking) ,020206 networking & telecommunications ,02 engineering and technology ,Telecommunications network ,Information-centric networking ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Relevance (information retrieval) ,The Internet ,Cache ,business ,Computer network - Abstract
The Internet of things (IoT) is a new network concept. Based on the Internet, the idea of connecting everything is proposed, which is likely to become the development direction of computer and communication in the future. Meanwhile, as a disruptive new communication network model, Information-Centric Network (ICN) has become a hot spot in the field of future network architecture research in recent years. ICN is information-centric, and uses the name of the information to implement data identification, retrieval, and routing and forwarding. The information centric network uses the cache as a built-in structure, and nodes store all the data that flows by default, so that subsequent requests can be responded to as soon as possible. In the actual network, the spread and trend of an information constantly change with time, and the popularity of the information is different in different time periods. However, the existing ICN native caching mechanism ignores the correlation between contents, and the existing relevant research makes insufficient use of content relevance. In this paper may exist a certain correlation between nodes have access to the content of the facts, we design a path cache method based on the correlation content, by discovering target content and the correlation between nodes store content, at the same time considering the node position in the path, make caching decisions, make the cache memory more efficient. The feasibility and effectiveness of the proposed method are verified by simulation experiments under different parameters.
- Published
- 2021
33. Towards Enhanced System Efficiency while Mitigating Row Hammer
- Author
-
Shirshendu Das, Dip Sankar Banerjee, and Kaustav Goswami
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Computer science ,CPU cache ,Row hammer ,Probabilistic logic ,02 engineering and technology ,Parallel computing ,Blocking (statistics) ,01 natural sciences ,020202 computer hardware & architecture ,Hardware and Architecture ,0103 physical sciences ,Lookup table ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Row ,Software ,Dram ,Information Systems - Abstract
In recent years, DRAM-based main memories have become susceptible to the Row Hammer (RH) problem, which causes bits to flip in a row without accessing them directly. Frequent activation of a row, called an aggressor row , causes its adjacent rows’ ( victim ) bits to flip. The state-of-the-art solution is to refresh the victim rows explicitly to prevent bit flipping. There have been several proposals made to detect RH attacks. These include both probabilistic as well as deterministic counter-based methods. The technique of handling RH attacks, however, remains the same. In this work, we propose an efficient technique for handling the RH problem. We show that the mechanism is agnostic of the detection mechanism. Our RH handling technique omits the necessity of refreshing the victim rows. Instead, we use a small non-volatile Spin-Transfer Torque Magnetic Random Access Memory (STTRAM) that ensures no unnecessary refreshes of the victim rows on the DRAM device and thus allowing more time for normal applications in the same DRAM device. Our model relies on the migration of the aggressor rows. This accounts for removing blocking of the DRAM operations due to the refreshing of victim rows incurred in the previous solution. After extensive evaluation, we found that, compared to the conventional RH mitigation techniques, our model minimizes the blocking time of the memory that is imposed due to explicit refreshing by an average of 80.72% in the worst-case scenario and provides energy savings of about 15.82% on average, across different types of RH-based workloads. A lookup table is necessary to pinpoint the location of a particular row, which, when combined with the STTMRAM, limits the storage overhead to 0.39% of a 2 GB DRAM. Our proposed model prevents repeated refreshing of the same victim rows in different refreshing windows on the DRAM device and leads to an efficient RH handling technique.
- Published
- 2021
34. Implementation of Meltdown Attack Simulation for Cybersecurity Awareness Material
- Author
-
Eka Chattra and Obrin Candra Brillyant
- Subjects
Password ,Scheme (programming language) ,Source code ,Out-of-order execution ,Exploit ,CPU cache ,Computer science ,Applied Mathematics ,media_common.quotation_subject ,Vulnerability ,Cyber-physical system ,Computer security ,computer.software_genre ,computer ,media_common ,computer.programming_language - Abstract
One of the rising risk in cybersecurity is an attack on cyber physical system. Today’s computer systems has evolve through the development of processor technology, namely by the use of optimization techniques such as out-of-order execution. Using this technique, processors can improve computing system performance without sacrificing manufacture processes. However, the use of these optimization techniques has vulnerabilities, especially on Intel processors. The vulnerability is in the form of data exfiltration in the cache memory that can be exploit by an attack. Meltdown is an exploit attack that takes advantage of such vulnerabilities in modern Intel processors. This vulnerability can be used to extract data that is processed on that specific computer device using said processors, such as passwords, messages, or other credentials. In this paper, we use qualitative research which aims to describe a simulation approach with experience meltdown attack in a safe environment with applied a known meltdown attack scheme and source code to simulate the attack on an Intel Core i7 platform running Linux OS. Then we modified the source code to prove the concept that the Meltdown attack can extract data on devices using Intel processors without consent from the authorized user.
- Published
- 2021
35. DOVA PRO: A Dynamic Overwriting Voltage Adjustment Technique for STT-MRAM L1 Cache Considering Dielectric Breakdown Effect
- Author
-
Xiaochen Guo, Yuanqing Cheng, Patrick Girard, Chengcheng Lu, Jiacheng Ni, Jinbo Chen, School of Electronics and Information Engineering, Beihang University, 100191 Beijing, China, Beihang University (BUAA), TEST (TEST), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), and School of Integrated Circuit Science and Technology [Beihang Univ] (SME BUAA)
- Subjects
CPU cache ,Computer science ,02 engineering and technology ,Power budget ,Bottleneck ,automatic write-back ,0202 electrical engineering, electronic engineering, information engineering ,Breakdown voltage ,Static random-access memory ,[SPI.NANO]Engineering Sciences [physics]/Micro and nanotechnologies/Microelectronics ,Electrical and Electronic Engineering ,signature improvement ,Magnetoresistive random-access memory ,Hardware_MEMORYSTRUCTURES ,business.industry ,physical unclonable function (PUF) ,020202 computer hardware & architecture ,ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS ,13. Climate action ,Hardware and Architecture ,Embedded system ,hardware security ,Cache ,Spin-Transfer Torque magnetic Cell (STT-mCell) ,business ,Software ,Voltage - Abstract
As device integration density increases exponentially as predicted by Moore’s law, power consumption becomes a bottleneck for system scaling where leakage power of on-chip cache occupies a large fraction of the total power budget. Spin transfer torque magnetic random access memory (STT-MRAM) is a promising candidate to replace static random access memory (SRAM) as an on-chip last level cache (LLC) due to its ultralow leakage power, high integration density, and nonvolatility. Moreover, with the prevalence of edge computing and Internet-of-Things (IoT) applications, it can be beneficial to build a total nonvolatile cache hierarchy, including the L1 cache. However, building an L1 cache with STT-MRAM still faces severe challenges particularly because reducing its relatively high write latency by increasing write voltage can accelerate oxide breakdown of the MTJ device and threaten the L1 cache lifetime significantly due to intensive accesses. In our previous work, we proposed a dynamic overwriting voltage adjustment (DOVA) technique to deal with this challenge. In this article, we improve this technique by a DOVA promotion (DOVA PRO) technique for the STT-MRAM L1 cache, considering the cache write endurance and performance simultaneously. A high write voltage is used for performance-critical cache lines, while a low write voltage is used for other cache lines to approach an optimal tradeoff between reliability and performance. Experimental results show that the proposed technique DOVA PRO can improve cache performance by 23.5%, on average, compared to the DOVA technique. In the meantime, the average degradation of cache lifetime remains almost unchanged compared with the DOVA technique on average. Furthermore, DOVA PRO can support flexible configurations to achieve various optimization targets, such as higher performance or a longer lifetime.
- Published
- 2021
36. Survey of CPU Cache-Based Side-Channel Attacks: Systematic Analysis, Security Models, and Countermeasures
- Author
-
Chao Su and Qingkai Zeng
- Subjects
Science (General) ,Computer Networks and Communications ,Computer science ,CPU cache ,0211 other engineering and technologies ,Cryptography ,0102 computer and information sciences ,02 engineering and technology ,Computer security ,computer.software_genre ,01 natural sciences ,Q1-390 ,T1-995 ,Side channel attack ,Technology (General) ,Vulnerability (computing) ,021110 strategic, defence & security studies ,business.industry ,Information security ,Attack surface ,Computer security model ,010201 computation theory & mathematics ,Cache ,business ,computer ,Information Systems - Abstract
Privacy protection is an essential part of information security. The use of shared resources demands more privacy and security protection, especially in cloud computing environments. Side-channel attacks based on CPU cache utilize shared CPU caches within the same physical device to compromise the system’s privacy (encryption keys, program status, etc.). Information is leaked through channels that are not intended to transmit information, jeopardizing system security. These attacks have the characteristics of both high concealment and high risk. Despite the improvement in architecture, which makes it more difficult to launch system intrusion and privacy leakage through traditional methods, side-channel attacks ignore those defenses because of the shared hardware. Difficult to be detected, they are much more dangerous in modern computer systems. Although some researchers focus on the survey of side-channel attacks, their study is limited to cryptographic modules such as Elliptic Curve Cryptosystems. All the discussions are based on real-world applications (e.g., Curve25519), and there is no systematic analysis for the related attack and security model. Firstly, this paper compares different types of cache-based side-channel attacks. Based on the comparison, a security model is proposed. The model describes the attacks from four key aspects, namely, vulnerability, cache type, pattern, and range. Through reviewing the corresponding defense methods, it reveals from which perspective defense strategies are effective for side-channel attacks. Finally, the challenges and research trends of CPU cache-based side-channel attacks in both attacking and defending are explored. The systematic analysis of CPU cache-based side-channel attacks highlights the fact that these attacks are more dangerous than expected. We believe our survey would draw developers’ attention to side-channel attacks and help to reduce the attack surface in the future.
- Published
- 2021
37. CacheInspector
- Author
-
Lotfi Benmohamed, Frederic J. de Vaulx, Hakim Weatherspoon, Zhiming Shen, Christina Delimitrou, Charif Mahmoudi, Robbert van Renesse, and Weijia Song
- Subjects
010302 applied physics ,Reverse engineering ,Multi-core processor ,Computer science ,business.industry ,CPU cache ,Distributed computing ,Cloud computing ,Throughput ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,CAS latency ,020202 computer hardware & architecture ,Hardware and Architecture ,Virtual machine ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,business ,computer ,Software ,Information Systems - Abstract
Infrastructure-as-a-Service cloud providers sell virtual machines that are only specified in terms of number of CPU cores, amount of memory, and I/O throughput. Performance-critical aspects such as cache sizes and memory latency are missing or reported in ways that make them hard to compare across cloud providers. It is difficult for users to adapt their application’s behavior to the available resources. In this work, we aim to increase the visibility that cloud users have into shared resources on public clouds. Specifically, we present CacheInspector , a lightweight runtime that determines the performance and allocated capacity of shared caches on multi-tenant public clouds. We validate CacheInspector ’s accuracy in a controlled environment, and use it to study the characteristics and variability of cache resources in the cloud, across time, instances, availability regions, and cloud providers. We show that CacheInspector ’s output allows cloud users to tailor their application’s behavior, including their output quality, to avoid suboptimal performance when resources are scarce.
- Published
- 2021
38. PAVER
- Author
-
Daniel Wong, AmirAli Abdolrashidi, Devashree Tripathy, Laxmi N. Bhuyan, and Liang Zhou
- Subjects
Dependency graph ,Hardware and Architecture ,CPU cache ,Computer science ,Locality ,Thrashing ,Graph (abstract data type) ,Multiprocessing ,Thread (computing) ,Parallel computing ,Software ,Information Systems ,Scheduling (computing) - Abstract
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored. We present PAVER , a P riority- A ware V ertex schedul ER , which takes a graph-theoretic approach toward thread scheduling. We analyze the cache locality behavior among thread blocks ( TBs ) through a just-in-time compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph is then partitioned to TB groups that display maximum data sharing, which are then assigned to the same streaming multiprocessor by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal, and Volta architectures using a number of scheduling techniques, we show that PAVER reduces L2 accesses by 43.3%, 48.5%, and 40.21% and increases the average performance benefit by 29%, 49.1%, and 41.2% for the benchmarks with high inter-TB locality.
- Published
- 2021
39. An Approach to Reduce the Redundancy of Placement Delivery Array Schemes for Random Demands
- Author
-
Minquan Cheng, Kai Wan, Qiaoling Zhang, and Xiaojun Li
- Subjects
Scheme (programming language) ,Multicast ,business.industry ,Wireless network ,CPU cache ,Computer science ,05 social sciences ,050801 communication & media studies ,020206 networking & telecommunications ,02 engineering and technology ,Information theory ,0508 media and communications ,Control and Systems Engineering ,Multicast algorithms ,Server ,0202 electrical engineering, electronic engineering, information engineering ,Redundancy (engineering) ,Electrical and Electronic Engineering ,business ,computer ,Computer network ,computer.programming_language - Abstract
Centralized coded caching schemes have been widely studied in wireless networks. Placement delivery array (PDA) can be used to generate a coded caching scheme with low subpacketization when all the requested files are different. However, all existing schemes based on PDA design treat the request of each user as an independent file, without leveraging the multicast opportunities when one file is requested by several users. In this letter, for any existing scheme based on PDA design, after its coded multicast messages generation, we proposed an approach to remove the redundant multicast messages.
- Published
- 2021
40. Comparative Analysis of Processor-FPGA Communication Performance in Low-Cost FPSoCs
- Author
-
Jose Farina, Roberto Fernandez Molanes, Juan J. Rodriguez-Andina, and Lucia Costas
- Subjects
business.industry ,Computer science ,CPU cache ,Controller (computing) ,020208 electrical & electronic engineering ,02 engineering and technology ,computer.software_genre ,Chip ,Computer Science Applications ,Control and Systems Engineering ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Compiler ,Electrical and Electronic Engineering ,Field-programmable gate array ,business ,computer ,Information Systems ,Data transmission - Abstract
Field-programmable system-on-chip (FPSoC) devices, combining high-performance processors and FPGA fabric in the same chip, are currently a leading technology in the design of complex digital systems. Since design times are longer than those of systems based on graphic processing units or standalone processors, many efforts are being devoted to develope efficient compilers from high-level languages. Even though, efficient processor-FPGA communication is still an important open issue. To contribute to this area, this article presents an extensive characterization of the processor-FPGA communication delays in Zynq-7000 devices. Although partial analyses of communication performance in these devices have been reported, this is the first work to address very important issues such as the use of DMA for data transfers or the effect of L2 cache controller settings and external RAM controller settings. As a result, data transfer rates are analyzed considering all parameters that influence them. The performance of Zynq-7000 devices is also compared to that of Cyclone V devices, hence covering the two most important current families that dominate the FPSoC market. This information is of utmost importance for designers to optimize processor-FPGA communication and, in turn, the performance of their FPSoC-based systems.
- Published
- 2021
41. A Novel Hilbert Curve for Cache-Locality Preserving Loops
- Author
-
Christian Bohm, Martin Perdacher, and Claudia Plant
- Subjects
Information Systems and Management ,Memory hierarchy ,CPU cache ,Computer science ,Z-order curve ,02 engineering and technology ,Parallel computing ,Cache-oblivious algorithm ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Cache ,Nested loop join ,Time complexity ,Information Systems ,Cholesky decomposition - Abstract
Modern microprocessors offer a rich memory hierarchy including various levels of cache and registers. Some of these memories (like main memory, L3 cache) are big but slow and shared among all cores. Others (registers, L1 cache) are fast and exclusively assigned to a single core but small. Only if the data accesses have a high locality, we can avoid excessive data transfers between the memory hierarchy. In this paper we consider fundamental algorithms like matrix multiplication, K-Means, Cholesky decomposition as well as the algorithm by Floyd and Warshall typically operating in two or three nested loops. We propose to traverse these loops whenever possible not in the canonical order but in an order defined by a space-filling curve. This traversal order dramatically improves data locality over a wide granularity allowing not only to efficiently support a cache of a single, known size (cache conscious) but also a hierarchy of various caches where the effective size available to our algorithms may even be unknown (cache oblivious). We propose a new space-filling curve called Fast Unrestricted (FUR) Hilbert with the following advantages: (1) we overcome the usual limitation to square-like grid sizes where the side-length is a power of 2 or 3. Instead, our approach allows arbitrary loop boundaries for all variables. (2) FUR-Hilbert is non-recursive with a guaranteed constant worst case time complexity per loop iteration (in contrast to O(log(grid-size)) for previous methods). (3) Our non-recursive approach makes the application of our cache-oblivious loops in any host algorithm as easy as conventional loops and facilitates automatic optimization by the compiler. (4) We demonstrate that crucial algorithms like Cholesky decomposition as well as the algorithm by Floyd and Warshall by can be efficiently supported. (5) Extensive experiments on runtime efficiency, cache usage and energy consumption demonstrate the profit of our approach. We believe that future compilers could translate nested loops into cache-oblivious loops either fully automatic or by a user-guided analysis of the data dependency.
- Published
- 2021
42. Spatial Concentration of Caching in Wireless Heterogeneous Networks
- Author
-
Derya Malak, Muriel Medard, and Jeffrey G. Andrews
- Subjects
Networking and Internet Architecture (cs.NI) ,FOS: Computer and information sciences ,Discrete mathematics ,Hardware_MEMORYSTRUCTURES ,CPU cache ,Computer Science - Information Theory ,Information Theory (cs.IT) ,Applied Mathematics ,020206 networking & telecommunications ,Throughput ,02 engineering and technology ,State (functional analysis) ,Computer Science Applications ,Computer Science - Networking and Internet Architecture ,Reduction (complexity) ,Knapsack problem ,Content (measure theory) ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Electrical and Electronic Engineering ,Heterogeneous network ,Mathematics - Abstract
We propose a decentralized caching policy for wireless heterogeneous networks that makes content placement decisions based on pairwise interactions between cache nodes. We call our proposed scheme {\gamma}-exclusion cache placement (GEC), where a parameter {\gamma} controls an exclusion radius that discourages nearby caches from storing redundant content. GEC takes into account item popularity and the nodes' caching priorities and leverages negative dependence to relax the classic 0-1 knapsack problem to yield spatially balanced sampling across caches. We show that GEC guarantees a better concentration (reduced variance) of the required cache storage size than the state of the art, and that the cache size constraints can be satisfied with high probability. Given a cache hit probability target, we compare the 95\% confidence intervals of the required cache sizes for three caching schemes: (i) independent placement, (ii) hard exclusion caching (HEC), and (iii) the proposed GEC approach. For uniform spatial traffic, we demonstrate that GEC provides approximately a 3x and 2x reduction in required cache size over (i) and (ii), respectively. For non-uniform spatial traffic based on realistic peak-hour variations in urban scenarios, the gains are even greater., Comment: to appear, IEEE TWC. arXiv admin note: text overlap with arXiv:1901.11102
- Published
- 2021
43. Performance comparison between OOD and DOD with multithreading in games
- Author
-
Wingqvist, David, Wickström, Filip, Wingqvist, David, and Wickström, Filip
- Abstract
Background. The frame rate of a game is important for both the end-user and the developer. Maintaining at least 60 FPS in a PC game is the current standard, and demands for efficient game applications rise. Currently, the industry standard within programming is to use Object-Oriented Design (OOD). But with the trend of larger sized games, this frame rate might not be maintainable using OOD. A design pattern that mitigates this is the Data-Oriented Design (DOD) which focuses on utilizing the CPU and memory efficiently. These design patterns differ in how they handle the data associated with them. Objectives. In this thesis, two games were created with two versions that used either OOD or DOD. The first game had multithreading included. New hardware utilizes several CPU cores, therefore, this thesis compares both singlethreaded and multithreaded versions of these design patterns.Methods. Experiments were made to measure the execution time and cache misses on the CPU. Each experiment started with a baseline that was gradually increased to stress the systems under test.Results. The results gathered from the experiments showed that the sections of the code that used DOD were significantly faster than OOD. DOD also had a better affinity with multithreading and was able to achieve at certain parts up to 13 times the speed of equivalent conditioned OOD. In the special case comparison DOD, even though it had larger objects, proved to be faster than OOD.Conclusions. DOD has shown to be significantly faster in execution time with fewer cache misses compared to OOD. Using multithreading for DOD presented to be the most efficient.
- Published
- 2022
44. Accelerated superpixel image segmentation with a parallelized DBSCAN algorithm
- Author
-
Seng Cheong Loke, Burkhard C. Wünsche, Bruce A. MacDonald, and Matthew Parsons
- Subjects
DBSCAN ,CPU cache ,Computer science ,business.industry ,Process (computing) ,020207 software engineering ,Pattern recognition ,02 engineering and technology ,Image segmentation ,Computer graphics ,Pattern recognition (psychology) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Segmentation ,Noise (video) ,Artificial intelligence ,business ,Information Systems - Abstract
Segmentation of an image into superpixel clusters is a necessary part of many imaging pathways. In this article, we describe a new routine for superpixel image segmentation (F-DBSCAN) based on the DBSCAN algorithm that is six times faster than previous existing methods, while being competitive in terms of segmentation quality and resistance to noise. The gains in speed are achieved through efficient parallelization of the cluster search process by limiting the size of each cluster thus enabling the processes to operate in parallel without duplicating search areas. Calculations are performed in large consolidated memory buffers which eliminate fragmentation and maximize memory cache hits thus improving performance. When tested on the Berkeley Segmentation Dataset, the average processing speed is 175 frames/s with a Boundary Recall of 0.797 and an Achievable Segmentation Accuracy of 0.944.
- Published
- 2021
45. A Survey on Cache Timing Channel Attacks for Multicore Processors
- Author
-
Jaspinder Kaur and Shirshendu Das
- Subjects
Multi-core processor ,business.industry ,Computer science ,CPU cache ,Key (cryptography) ,Overhead (computing) ,Cryptography ,Cache ,business ,Branch predictor ,Communication channel ,Computer network - Abstract
Cache timing channel attacks has attained a lot of attention in the last decade. These attacks exploits the timing channel created by the significant time gap between cache and main memory accesses. It has been successfully implemented to leak the secret key of various cryptography algorithms. The latest advancements in cache attacks also exploit other micro-architectural components such as hardware prefetchers, branch predictor, and replacement engine, in addition to the cache memory. Detection of these attacks is a difficult task as the attacker process running in the processor must be detected before significant portion of the attack is complete. The major challenge for mitigation and defense mechanisms against these attacks is maintaining the system performance while disabling or avoiding these attacks. The overhead caused by detection, mitigation and defense mechanism must not be significant to system’s performance. This paper discusses the research carried out in three aspects of cache security: cache timing channel attacks, detection techniques of these attacks, and defense mechanisms in details.
- Published
- 2021
46. A Novel Joint Mobile Cache and Power Management Scheme for Energy-Efficient Mobile Augmented Reality Service in Mobile Edge Computing
- Author
-
Jung-Yeon Hwang, Dusit Niyato, Joo-hyung Lee, Yong-jun Seo, Jun Kyun Choi, and Hong-Shik Park
- Subjects
Power management ,Hardware_MEMORYSTRUCTURES ,Mobile edge computing ,business.industry ,Computer science ,CPU cache ,020206 networking & telecommunications ,020302 automobile design & engineering ,02 engineering and technology ,Energy consumption ,0203 mechanical engineering ,Control and Systems Engineering ,0202 electrical engineering, electronic engineering, information engineering ,Wireless ,Augmented reality ,Cache ,Electrical and Electronic Engineering ,business ,Efficient energy use ,Computer network - Abstract
In this letter, we propose a novel joint mobile cache and power management scheme for energy-efficient mobile augmented reality (MAR) services in mobile edge computing (MEC). For this purpose, depending on whether the cache is hit or not, analytical models for 1) the energy consumption of an MAR device (MD) and 2) the service latency for the MAR are derived and investigated. Considering a tradeoff between energy and service latency in terms of the cache size in MAR with MEC, we design a theoretical framework for the mobile cache management of the MD to optimize the mobile cache size as well as transmission power while guaranteeing the required service latency. From the numerical experiments, we evaluate the performance of the proposed scheme and demonstrate insights regarding the optimization of the performance of MEC-assisted MAR services with a mobile cache.
- Published
- 2021
47. Decentralized Coded Caching for Shared Caches
- Author
-
Anoop Thomas and Monolina Dutta
- Subjects
Scheme (programming language) ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,CPU cache ,020206 networking & telecommunications ,02 engineering and technology ,Variance (accounting) ,Computer Science Applications ,Large networks ,Modeling and Simulation ,Encoding (memory) ,Server ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Electrical and Electronic Engineering ,business ,computer ,Computer network ,computer.programming_language ,Coding (social sciences) - Abstract
The demands of the clients in the client-server framework exhibit temporal variance leading to congestion in the network at random intervals. To alleviate this problem, popular data is loaded in cache memories scattered across the network. In the conventional cache framework, each user has an associated cache and cache loading is centrally coordinated. For large networks, a more practical approach is to make the loading of the caches decentralized. This letter considers the shared caching problem in which each cache can serve multiple clients. A new and optimal delivery scheme is proposed for the decentralized shared caching problem. The delivery scheme is shown to be optimal among all linear schemes, using techniques from index coding. It is shown that the rate achieved by the proposed scheme is comparable to the existing scheme which uses centralized prefetching.
- Published
- 2021
48. Understanding the Insecurity of Processor Caches Due to Cache Timing-Based Vulnerabilities
- Author
-
Wenjie Xiong, Shuwen Deng, and Jakub Szefer
- Subjects
Program testing ,Hardware_MEMORYSTRUCTURES ,Computer Networks and Communications ,business.industry ,CPU cache ,Computer science ,05 social sciences ,Cryptography ,Coherence (statistics) ,01 natural sciences ,Data modeling ,ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS ,010104 statistics & probability ,0502 economics and business ,Test suite ,Cache ,0101 mathematics ,Electrical and Electronic Engineering ,business ,Law ,050205 econometrics ,Computer network - Abstract
This article discusses a recently developed test suite for checking timingbased vulnerabilities in processor caches, which has revealed the insecurity of today's processor caches. The susceptibility of caches to these vulnerabilities calls for more research on secure processor caches.
- Published
- 2021
49. A Compute Cache System for Signal Processing Applications
- Author
-
Joao Vieira, Pedro Tomás, Gabriel Falcao, and Nuno Roma
- Subjects
Signal processing ,Computer science ,CPU cache ,business.industry ,Distributed computing ,Big data ,020206 networking & telecommunications ,Context (language use) ,02 engineering and technology ,Bottleneck ,Theoretical Computer Science ,Hardware and Architecture ,Control and Systems Engineering ,Modeling and Simulation ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Cache ,Central processing unit ,Cluster analysis ,business ,Information Systems - Abstract
Nowadays, processing systems are constrained by the low efficiency of their memory subsystems. Although memories evolved into faster and more efficient devices through the years, they were still unable to keep up with the computational power offered by processors, i.e., feed the processors with the data they require at the rhythm it is consumed. Consequently, with the advent of Big Data, the need for fetching large amounts of data from memory became the most prominent performance bottleneck. Naturally, several approaches seeking to mitigate this problem have arisen through the years, such as application-specific accelerators and Near Data Processing (NDP) solutions. However, none were capable to offer a satisfactory general-purpose solution without imposing rather limiting constraints. For instance, NDP solutions often require the programmer to have low-level knowledge of how data is physically stored in memory. In this paper, we propose an alternative mechanism that operates at the cache level, leveraging both proximity to the data and the parallelism enabled by accessing an entire cache line per cycle. We detail the internal architecture of the Cache Compute System (CCS) and demonstrate its integration with a conventional high-performance ARM Cortex-A53 Central Processing Unit (CPU). Furthermore, we assess the performance benefits of the novel CCS using an extensive set of microbenchmarks as well as six kernels widely used in the context of Convolutional Neural Networks (CNNs) and clustering algorithms. Results show that the CCS provides performance improvements ranging from 3.9× to 40.6× regarding the six tested kernels.
- Published
- 2021
50. The Root Tile Design for Level 1 Cache for Non Uniform Architecture
- Author
-
Manjudevi and Suma Sannamani
- Subjects
Root (linguistics) ,Hardware_MEMORYSTRUCTURES ,CPU cache ,Computer science ,General Mathematics ,Multiprocessing ,Parallel computing ,Education ,Computational Mathematics ,Computational Theory and Mathematics ,Search algorithm ,visual_art ,visual_art.visual_art_medium ,Cache ,Tile ,Latency (engineering) ,Architecture - Abstract
NUCA has become solution for wire delay problems, where wire delay problems increases on chip latency in multiprocessor system. Non uniform architecture is used for cache memory. Here cache is divided into tiles ,each tiled cache is accessed with different latency. Hence it is called non uniform. Access data defines search algorithm across architecture. This paper involves design of root tiles which accepts request from processor and forward request to child cache tiles. Here we have used Xilinx simulation tool to analyze the performance.
- Published
- 2021
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.