1,060 results on '"Locality of reference"'
Search Results
2. Frozen Cache: Mitigating Filter Effect and Redundancy for Network of Caches
- Author
-
Saeid Montazeri Shahtouri, Mostafa Rezazad, and Richard T. B. Ma
- Subjects
Frozen-cache ,network of caches ,information centric networking ,filter effect ,locality of reference ,coordinated cache scheme ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Information-Centric Networking (ICN) architecture leverages the network of caches’ idea to bring content closer to consumers to ultimately reduce the load on content servers and prevent unnecessary packet re-transmissions. Nevertheless, the performance of existing cache management schemes mainly developed for a single cache is inadequate for a network of caches. There are many factors such as data dependencies, data redundancy, the limited size of caches, poor replacement policies, and many other factors that negatively impact a network of caches. Besides, traffic correlation among different caches on the network influences the performance of the network of caches. One of the essential correlations is the edge filtering effect. In the presence of data redundancy, the edge filtering effect even becomes more severe. The cache filtering effect happens when all arriving requests inspect the first cache for data. Therefore, the subsequent caches in the network receive only those requests that could not find data (cache-miss) from the edge cache. In this paper, we propose Frozen-cache to mitigate the filtering effect. This policy repeatedly freezes content in a cache to allow subsequent caches to receive popular content. A lightweight coordinated scheme incorporated with Frozen-cache policy to cope with the data redundancy problem. Based on our experiments obtained from realistic scenarios, the Frozen-cache idea highly outperforms state of the art caching schemes. Depending on the network setup, this superiority varies from 25% to 700%.
- Published
- 2021
- Full Text
- View/download PDF
3. CuWide: Towards Efficient Flow-Based Training for Sparse Wide Models on GPUs
- Author
-
Lele Yu, Xupeng Miao, Zhi Yang, Lingxiao Ma, Jiawei Jiang, Bin Cui, and Yingxia Shao
- Subjects
Schema (genetic algorithms) ,High memory ,Computational Theory and Mathematics ,Memory hierarchy ,Computer science ,Property (programming) ,Computation ,Bandwidth (signal processing) ,Locality of reference ,Parallel computing ,Performance improvement ,Computer Science Applications ,Information Systems - Abstract
Wide models such as generalized linear models and factorization-based models have been extensively used in various predictive applications, e.g., recommendation systems. Due to the memory bounded property of the models, the performance improvement on CPU is reaching the limitation. GPU is known to have many computation units and high memory bandwidth, and becomes a promising platform for training machine learning models. However, the GPU training for the wide models is far from optimal due to the sparsity and irregularity in wide models. The existing GPU-based wide models are even slower than the ones using CPU. The classical training schema of the wide models does not optimized for the GPU architecture, which suffers from large amount of random memory accesses and redundant read/write of intermediate values. In this paper, we propose an efficient GPU-training framework for the large-scale wide models, named cuWide. To fully benefit from the memory hierarchy of GPU, cuWide applies a new flow-based schema for training, which leverages the spatial and temporal locality of wide models to drastically reduce the amount of communication with GPU global memory. We show that cuWide can be up to more than 20° faster than the state-of-the-art GPU solutions and multi-core CPU solutions.
- Published
- 2022
- Full Text
- View/download PDF
4. A new temporal locality-based workload prediction approach for SaaS services in a cloud environment
- Author
-
Tarek Hamrouni and Wiem Matoussi
- Subjects
Service (systems architecture) ,General Computer Science ,business.industry ,Computer science ,Software as a service ,Quality of service ,Distributed computing ,020206 networking & telecommunications ,Context (language use) ,Workload ,Cloud computing ,02 engineering and technology ,Capacity planning ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,020201 artificial intelligence & image processing ,business - Abstract
As the paradigm shift toward Software as a Service (SaaS) continues to gain the interest of companies and the scientific community, performances must be optimal. Indeed, cloud providers must provide an optimal quality of service (QoS) for their users in order to survive in such a competitive cloud market. Workload forecasting techniques have been proposed in order to improve capacity planning, ensure efficient management of resources and, hence, maintain SLA contracts with end users. In this context, we propose a new approach to predict the number of requests arriving at a SaaS service in order to prepare the virtualized resources necessary to respond to user requests. The method will be implemented in order to simultaneously achieve a twofold benefit: obtain precise forecast results while optimizing response time. In this regard, we have chosen to control the computation time by dynamizing the size of the sliding window associated to the recent history to be analyzed, since the larger the size of the entry in the prediction model, the more the algorithmic complexity increases. Then, the prediction will be established based on the temporal locality principle and the dynamic assignment of weights to different data points in recent history. Moreover, the proposed method can be extended to cover other uses in prediction. Experiments were carried out to assess the performance of the proposed method using two real workload traces and compared to state-of-the-art methods. The proposed method offers a compromise between the execution time and the accuracy of the prediction.
- Published
- 2022
- Full Text
- View/download PDF
5. E3X: Encrypt-Everything-Everywhere ISA eXtensions for Private Computation
- Author
-
Michail Maniatakos, Nektarios Georgios Tsoutsos, Eduardo Chielle, and Oleg Mazonka
- Subjects
Information privacy ,business.industry ,Computer science ,OpenRISC ,computer.software_genre ,Encryption ,Microarchitecture ,Public-key cryptography ,Operating system ,Secure multi-party computation ,Locality of reference ,State (computer science) ,Electrical and Electronic Engineering ,business ,computer - Abstract
The rapid increase of recent privacy attacks has significantly decreased trust on behalf of the users. A root cause to these problems is that modern computer architectures have always been designed for performance, while security protections are traditionally addressed reactively. Practical security protections, such as Intel SGX, rely on processing unencrypted data in the architectural state, which leaves them exposed to software attacks (e.g., SGXpectre). This work revisits the traditional computation stack and introduces a novel computation paradigm, where data is never decrypted in the architectural state. Through our architecture, data are protected with symmetric or asymmetric encryption and the programmer manipulates them directly in the encrypted domain. To increase performance, we exploit data locality by introducing decryption caches in the microarchitectural state. Our proposal addresses all abstraction levels in the computation stack: from microarchitecture to library support for high-level programming. The proposed architecture is instantiated through new assembly instructions, registers and functional units operating on large integers. In our evaluation, we extend the OpenRISC 1000 architecture and develop open-source libraries for C++. As a case study, we employ data-oblivious benchmarks and observe that for benchmarks with high temporal locality, our architecture can achieve comparable performance to processing unencrypted data.
- Published
- 2022
- Full Text
- View/download PDF
6. CBR and Neural Networks Based Technique for Predictive Prefetching
- Author
-
Sarwar, Sohail, Ul-Qayyum, Zia, Malik, Owais Ahmed, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Sidorov, Grigori, editor, Hernández Aguirre, Arturo, editor, and Reyes García, Carlos Alberto, editor
- Published
- 2010
- Full Text
- View/download PDF
7. Impact of Wrapped System Call Mechanism on Commodity Processors
- Author
-
Yamada, Satoshi, Kusakabe, Shigeru, Taniguchi, Hideo, Filipe, Joaquim, editor, Shishkov, Boris, editor, and Helfert, Markus, editor
- Published
- 2008
- Full Text
- View/download PDF
8. Locality of Reference in an Hierarchy of Web Caches
- Author
-
Duarte, Fernando, Benevenuto, Fabrício, Almeida, Virgílio, Almeida, Jussara, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Dough, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Boavida, Fernando, editor, Plagemann, Thomas, editor, Stiller, Burkhard, editor, Westphal, Cedric, editor, and Monteiro, Edmundo, editor
- Published
- 2006
- Full Text
- View/download PDF
9. Automatic Sublining for Efficient Sparse Memory Accesses
- Author
-
Ibrahim Hur, Wim Heirman, Stijn Eyerman, and Kristof Du Bois
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Source code ,Dense graph ,Computer science ,media_common.quotation_subject ,Locality ,Memory bandwidth ,02 engineering and technology ,Parallel computing ,Data structure ,01 natural sciences ,020202 computer hardware & architecture ,Hardware and Architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Bandwidth (computing) ,Locality of reference ,Cache ,Software ,Information Systems ,media_common - Abstract
Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues. Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal. We propose the Instruction Spatial Locality Estimator ( ISLE ), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses.
- Published
- 2021
- Full Text
- View/download PDF
10. Intelligent forecast engine for short-term wind speed prediction based on stacked long short-term memory
- Author
-
Muhammad Javaid Iqbal, Farah Shahid, and Aneela Zameer
- Subjects
0209 industrial biotechnology ,Mean squared error ,Computer science ,Computational intelligence ,02 engineering and technology ,computer.software_genre ,Wind speed ,Random forest ,Term (time) ,Support vector machine ,020901 industrial engineering & automation ,Artificial Intelligence ,Robustness (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,020201 artificial intelligence & image processing ,Data mining ,computer ,Software - Abstract
This paper presents machine learning techniques for short-term wind speed prediction by exploiting their computational intelligence capabilities such as random forest regression (RFR), support vector regression (SVR), radial basis function neural networks (RBFNN), and long short-term memory (LSTM) on various wind farm datasets located in Pakistan. Initially, predictions are obtained by employing baseline regressors, RFR and RBFNN in terms of error indices. Later, a stacked LSTM forecast engine (SLFE) has been proposed to improve the efficacy, accuracy, and prediction capability of the forecast engine by using leaky rectified linear units (Leaky ReLU) as kernel function. SLFE has been implemented and tested on the datasets acquired from four different wind farms having a temporal locality of 10 min for short-term wind speed prediction. Furthermore, an ensemble of stacked LSTM with RFR, SVR, and RBFNN has also been developed for the comparison. The strength of the SLFE has been evaluated in terms of various performance measures such as mean absolute error (MAE), root mean squared error (RMSE), R2Score, and explained variance score (EVS). The efficacy of the proposed models is evaluated in terms of performance metrics, MAE, RMSE, R2Score and EVS, 0.043, 0.065, 0.813, and 0.814, respectively, demonstrating the worth of the forecast engine. Additionally, statistical one-way ANOVA is also carried out with multiple independent executions of the proposed algorithm to demonstrate the robustness, efficiency, and reliability of the model.
- Published
- 2021
- Full Text
- View/download PDF
11. Analysis of Worst-Case Data Dependent Temporal Approximation in Floating Point Units
- Author
-
Chandan Kumar Jha, Ishita Doshi, and Joycee Mekie
- Subjects
Discrete mathematics ,Floating point ,020208 electrical & electronic engineering ,Dot product ,02 engineering and technology ,020202 computer hardware & architecture ,Normal distribution ,symbols.namesake ,Principal component analysis ,0202 electrical engineering, electronic engineering, information engineering ,symbols ,Locality of reference ,Probability distribution ,Multiplier (economics) ,Pareto distribution ,Electrical and Electronic Engineering ,Mathematics - Abstract
In this brief, we study the impact of input data distribution on temporal approximation (TA) in floating point units (FPUs). In TA, rather than performing computations, prior computed results are used as output to introduce approximation. Thus, temporal locality of inputs plays an important role in TA. We show that efficacy of TA is strongly dependent on the input data distribution. While in prior works, uniform random input data distribution is used to perform the worst case analysis in approximate FPUs, it fails to capture the worst case for TA. We show that contrary to conventional idea, input data samples from normal distribution with mean ( $\mu $ ) equal to zero captures the worst case for TA irrespective of the FP operations. We evaluated TA in FP multipliers and FP dividers by studying four different data distributions: a) Normal Distribution ( $\mu =0$ ) b) Normal Distribution ( $\mu =1$ ) c) Uniform Distribution d) Power Law Distribution. The inputs generated by sampling from these distributions were applied to algorithms– dot product, principal component analysis, page rank, vector normalizations, and dot divide. On average, normal distribution ( $\mu =0$ ) is more efficient in capturing the worst case as compared to the widely used uniform distribution by 8% and 12% in for FP multiplier and FP dividers respectively. We also highlight that prior knowledge of input data distribution can be exploited to reduce power delay product.
- Published
- 2021
- Full Text
- View/download PDF
12. Frozen Cache: Mitigating Filter Effect and Redundancy for Network of Caches
- Author
-
Mostafa Rezazad, Saeid Montazeri Shahtouri, and Richard T. B. Ma
- Subjects
Hardware_MEMORYSTRUCTURES ,General Computer Science ,locality of reference ,Computer science ,Network packet ,business.industry ,General Engineering ,Network topology ,TK1-9971 ,network of caches ,Information-centric networking ,Data redundancy ,Server ,filter effect ,Redundancy (engineering) ,information centric networking ,General Materials Science ,coordinated cache scheme ,Electrical engineering. Electronics. Nuclear engineering ,Enhanced Data Rates for GSM Evolution ,Cache ,Electrical and Electronic Engineering ,business ,Frozen-cache ,Computer network - Abstract
Information-Centric Networking (ICN) architecture leverages the network of caches’ idea to bring content closer to consumers to ultimately reduce the load on content servers and prevent unnecessary packet re-transmissions. Nevertheless, the performance of existing cache management schemes mainly developed for a single cache is inadequate for a network of caches. There are many factors such as data dependencies, data redundancy, the limited size of caches, poor replacement policies, and many other factors that negatively impact a network of caches. Besides, traffic correlation among different caches on the network influences the performance of the network of caches. One of the essential correlations is the edge filtering effect. In the presence of data redundancy, the edge filtering effect even becomes more severe. The cache filtering effect happens when all arriving requests inspect the first cache for data. Therefore, the subsequent caches in the network receive only those requests that could not find data (cache-miss) from the edge cache. In this paper, we propose Frozen-cache to mitigate the filtering effect. This policy repeatedly freezes content in a cache to allow subsequent caches to receive popular content. A lightweight coordinated scheme incorporated with Frozen-cache policy to cope with the data redundancy problem. Based on our experiments obtained from realistic scenarios, the Frozen-cache idea highly outperforms state of the art caching schemes. Depending on the network setup, this superiority varies from 25% to 700%.
- Published
- 2021
- Full Text
- View/download PDF
13. Characterizing File Accesses in Android Applications and Caching Implications
- Author
-
Soojung Lim and Hyokyung Bahn
- Subjects
Scheme (programming language) ,non-volatile memory ,General Computer Science ,Computer science ,file access ,General Engineering ,computer.software_genre ,Disk buffer ,Write buffer ,smartphone ,TK1-9971 ,Non-volatile memory ,buffer cache ,Android ,Operating system ,Locality of reference ,General Materials Science ,Electrical engineering. Electronics. Nuclear engineering ,Electrical and Electronic Engineering ,Android (operating system) ,computer ,application ,Block (data storage) ,computer.programming_language - Abstract
In this paper, we explore Android applications’ file access characteristics, and find out that smartphone file accesses are different from traditional desktop applications in terms of the following aspects. 1) There exist a limited number of hot blocks, which are accessed consistently during the entire execution of an application. 2) Block accesses in Android are highly biased such that the top 20% blocks account for 80% of total accesses. 3) Hot blocks of the top 100 rankings are mostly involved in SQLite. 4) Unlike desktop applications, file accesses in Android applications are write-intensive. 5) In predicting future file accesses in Android applications, frequency is a better estimator than temporal locality. 6) The effect of traditional buffer cache is limited in Android as file I/O in Android has a lot of synchronous writes, which incurs immediate storage flushing. Based on these analyses, this paper presents the implication of buffer cache management in Android. Specifically, we add a small non-volatile write buffer and present how this write buffer can be managed efficiently. Experimental results show that the proposed scheme improves the storage write traffic by an average of 21.7% and a maximum of 48.1% compared to the conventional buffer cache system.
- Published
- 2021
14. Characterization of Android Memory References and Implication to Hybrid Memory Management
- Author
-
Hyokyung Bahn and Soyoon Lee
- Subjects
Scheme (programming language) ,General Computer Science ,Computer science ,02 engineering and technology ,memory reference ,computer.software_genre ,smartphone ,write operation ,Android ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,General Materials Science ,Android (operating system) ,computer.programming_language ,Random access memory ,General Engineering ,020202 computer hardware & architecture ,TK1-9971 ,Non-volatile memory ,Memory management ,Operating system ,NVM ,020201 artificial intelligence & image processing ,Electrical engineering. Electronics. Nuclear engineering ,computer ,application ,Dram - Abstract
In this article, we analyze Android applications’ memory reference behaviors, and observe that smartphone memory accesses are different from traditional computer systems with respect to the following five aspects: 1) A limited number of hot pages account for a majority of memory writes, and these hot pages have similar logical addresses regardless of application types; 2) The identities of these hot pages are shared library, linker, and stack regions; 3) The memory access behaviors of hot pages do not change significantly as time progresses even after applications finish their launching; 4) The skewness of memory write accesses in Android is extremely stronger than that of desktop systems; 5) In predicting re-reference likelihood of hot pages, temporal locality is better than reference frequency. Based on these observations, we present a new smartphone memory management scheme for DRAM-NVM hybrid memory. Adopting NVM is effective in power-saving of smartphones, but NVM has weaknesses in write operations. Thus, we aim to identify write-intensive pages and place them on DRAM. Unlike previous studies, we prevent migration of pages between DRAM and NVM, which eliminates unnecessary NVM write traffic that accounts for 32-42% of total write traffic. By judiciously managing the admission of hot pages in DRAM, our scheme reduces the write traffic to NVM by 42% on average without performance degradations.
- Published
- 2021
15. Data Reuse for Accelerated Approximate Warps
- Author
-
Mohsen Imani, Hamid Nejatollahi, Tajana Rosing, Daniel Peroni, and Nikil Dutt
- Subjects
Artificial neural network ,Computer science ,02 engineering and technology ,Thread (computing) ,Parallel computing ,Content-addressable memory ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Instruction set ,Lookup table ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,Electrical and Electronic Engineering ,General-purpose computing on graphics processing units ,Software - Abstract
Many data-driven applications, including computer vision, machine learning, speech recognition, and medical diagnostics show tolerance to computation error. These applications are often accelerated on GPUs, but the performance improvements require high energy usage. In this article, we present DRAAW, an approximate computing technique capable of accelerating GPGPU applications at a warp level. In GPUs, warps are groups of threads which issued together across multiple cores. The slowest thread dictates the pace of the warp, so DRAAW identifies these bottlenecks and avoids them during approximation. We alleviate computation costs by using an approximate lookup table which tracks recent operations and reuses them to exploit temporal locality within applications. To improve neural network performance, we propose neuron aware approximation, a technique which profiles operations within network layers and automatically configures DRAAW to ensure computations with more impact on the output accuracy are subject to less approximation. We evaluate our design by placing DRAAW within each core of an Nvidia Kepler Architecture Titan. DRAAW improves throughput by up to $2.8\times $ and improves energy-delay product (EDP) by $5.6\times $ for six GPGPU applications while maintaining less than 5% output error. We show neuron aware approximation accelerates the inference of six neutral networks by $2.9\times $ and improves EDP by $6.2\times $ with less than 1% impact on prediction accuracy.
- Published
- 2020
- Full Text
- View/download PDF
16. Traffic-Aware Erasure-Coded Archival Schemes for In-Memory Stores
- Author
-
Jianzhong Huang, Qiang Cao, Bin Xu, and Xiao Qin
- Subjects
020203 distributed computing ,Hardware_MEMORYSTRUCTURES ,Distributed database ,business.industry ,Computer science ,Replica ,Locality ,Fault tolerance ,02 engineering and technology ,Load balancing (computing) ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Redundancy (engineering) ,Locality of reference ,Erasure ,Erasure code ,business ,Computer network - Abstract
Redundancy schemes are introduced to in-memory stores to provide fault tolerance. To achieve good trade-off between access performance and memory efficiency, it is appropriate to adopt replication and erasure coding to keep popular and unpopular data, respectively. Within such a hybrid-redundancy in-memory store, an issue of redundancy transition from replication to erasure coding (a.k.a., erasure-coded archival ) should be addressed for unpopular in-memory datasets, since caching workloads exhibit long-tail distributions and most in-memory data are unpopular. If data replicas are distributed across nodes in randomly-selected racks, then subsequent data-block-replica retrieval for erasure-coded archival will create cross-rack traffic, and final parity-block relocation will cause extra cross-rack communications. In this article, we propose an encoding-oriented replica placement policy - ERP - by incorporating an interleaved declustering mechanism. We design two traffic-aware erasure-coded archival schemes - TEA-TL and TEA-SL - for ERP-powered in-memory stores by taking into account temporal locality and spatial locality, respectively. With ERP in place, both TEA-TL and TEA-SL schemes embrace the following three salient features: (i) they alleviate cross-rack traffic raised by retrieving required data-block replicas; (ii) they improve rack-level load balancing by distributing replicas via load-aware primary-rack-selection approach; and (iii) they mitigate block-relocation operations launched to sustain rack-level and node-level fault-tolerance. We conduct quantitative performance evaluations using the YCSB benchmark. The empirical results show that both TEA-TL and TEA-SL schemes not only bring forth lower cross-rack traffic than the four candidate encoding schemes, but also exhibit superb archival-throughput and rack-level-balancing performance. In particular, within a group of comparative tests using the baseline configurations, TEA-TL and TEA-SL accelerate archival throughput by at least 36.3 and 70.8 percent, respectively; both TEA-TL and TEA-SL schemes improve rack-level load-balancing by a factor of more than 1.45x relative to the four candidate encoding schemes.
- Published
- 2020
- Full Text
- View/download PDF
17. Miss Penalty Aware Cache Replacement for Hybrid Memory Systems
- Author
-
Hai Jin, Di Chen, Rentong Guo, Yu Zhang, Haikun Liu, and Xiaofei Liao
- Subjects
Random access memory ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,02 engineering and technology ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Non-volatile memory ,Memory management ,Embedded system ,Scalability ,Memory architecture ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,Cache ,Electrical and Electronic Engineering ,business ,Average memory access time ,Software ,Dram - Abstract
Current DRAM-based memory systems face the scalability challenges in terms of memory density, energy consumption, and monetary cost. Hybrid memory architectures composed of emerging nonvolatile memory (NVM) and DRAM is a promising approach to large-capacity and energy-efficient main memory. However, hybrid memory systems pose a new challenge to on-chip cache management due to the asymmetrical penalty of memory access to DRAM and NVM in case of cache misses. Cache hit rate is no longer an effective metric for evaluating memory access performance in hybrid memory systems. Current cache replacement policies that aim to improve the cache hit rate are not efficient either. In this article, we take into account the asymmetry of the cache miss penalty on DRAM and NVM, and advocate a more general metric, average memory access time (AMAT), to evaluate the performance of hybrid memories. We propose a miss penalty aware LRU-based cache replacement policy (MALRU) for hybrid memory systems. MALRU is aware of the source (DRAM or NVM) of missing blocks and preserves high-latency NVM blocks as well as low-latency DRAM blocks with good temporal locality in the last level cache. The experimental results show that MALRU can improve system performance by up to 22.8% and 13.1%, compared to LRU and the state-of-the-art hybrid memory aware cache partitioning technique policy, respectively.
- Published
- 2020
- Full Text
- View/download PDF
18. Optimization of Intercache Traffic Entanglement in Tagless Caches With Tiling Opportunities
- Author
-
Sumitha George, Vijaykrishnan Narayanan, Hariram Thirucherai Govindarajan, John Sampson, Jagadish B. Kotra, Madhu Mutyam, S. R. Swamy Saranam Chongala, and Mahmut Kandemir
- Subjects
Random access memory ,Hardware_MEMORYSTRUCTURES ,Computer science ,02 engineering and technology ,Quantum entanglement ,Parallel computing ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Non-volatile memory ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,Cache ,Electrical and Electronic Engineering ,Latency (engineering) ,Software - Abstract
So-called “tagless” caches have become common as a means to deal with the vast L4 last-level caches (LLCs) enabled by increasing device density, emerging memory technologies, and advanced integration capabilities (e.g., 3-D). Tagless schemes often result in intercache entanglement between tagless cache (L4) and the cache (L3) stewarding its metadata. We explore new cache organization policies that mitigate overheads stemming from the intercache-level replacement entanglement. We incorporate support for explicit tiling shapes that can better match software access patterns to improve the spatial and temporal locality of large block allocations in many essential computational kernels. To address entanglement overheads and pathologies, we propose new replacement policies and energy-friendly mechanisms for tagless LLCs, such as restricted block caching (RBC) and victim tag buffer caching (VBC) to incorporate L4 eviction costs into L3 replacement decisions efficiently. We evaluate our schemes on a range of linear algebra kernels that are software tiled. RBC and VBC demonstrate a reduction in memory traffic of 83/4.4/67% and 69/35.5/76% for 8/32/64 MB L4s, respectively. Besides, RBC and VBC provide speedups of 16/0.3/0.6% and 15.7/1.8/0.8%, respectively, for systems with 8/32/64 MB L4, over a tagless cache with an LRU policy in the L3. We also show that matching the shape of the hardware allocation for each tagless region superblocks to the access order of the software tile improves latency by 13.4% over the baseline tagless cache with reductions in memory traffic of 51% over linear superblocks.
- Published
- 2020
- Full Text
- View/download PDF
19. Temporal locality-aware sampling for accurate triangle counting in real graph streams
- Author
-
Dongjin Lee, Kijung Shin, and Christos Faloutsos
- Subjects
Computer science ,Sampling (statistics) ,02 engineering and technology ,Graph ,Hardware and Architecture ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,020201 artificial intelligence & image processing ,Algorithm ,Streaming algorithm ,Time complexity ,MathematicsofComputing_DISCRETEMATHEMATICS ,Information Systems - Abstract
If we cannot store all edges in a dynamic graph, which edges should we store to estimate the triangle count accurately? Counting triangles (i.e., cliques of size three) is a fundamental graph problem with many applications in social network analysis, web mining, anomaly detection, etc. Recently, much effort has been made to accurately estimate the counts of global triangles (i.e., all triangles) and local triangles (i.e., all triangle incident to each node) in large dynamic graphs, especially with limited space. Although existing algorithms use sampling techniques without considering temporal dependencies in edges, we observe temporal locality in the formation of triangles in real dynamic graphs. That is, future edges are more likely to form triangles with recent edges than with older edges. In this work, we propose a family of single-pass streaming algorithms called Waiting-Room Sampling (WRS) for estimating the counts of global and local triangles in a fully dynamic graph, where edges are inserted and deleted over time, within a fixed memory budget. WRS exploits the temporal locality by always storing the most recent edges, which future edges are more likely to form triangles with, in the waiting room, while it uses reservoir sampling and its variant for the remaining edges. Our theoretical and empirical analyses show that WRS is: (a) Fast and ‘any time’: runs in linear time, always maintaining and updating estimates, while the input graph evolves, (b) Effective: yields up to 47% smaller estimation error than its best competitors, and (c) Theoretically sound: gives unbiased estimates with small variances under the temporal locality.
- Published
- 2020
- Full Text
- View/download PDF
20. Page Reusability-Based Cache Partitioning for Multi-Core Systems
- Author
-
Yongseok Son, Heon-Young Yeom, and Jiwoong Park
- Subjects
Multi-core processor ,Hardware_MEMORYSTRUCTURES ,Computer science ,Cache coloring ,Linux kernel ,02 engineering and technology ,Parallel computing ,Cache pollution ,020202 computer hardware & architecture ,Theoretical Computer Science ,Computational Theory and Mathematics ,Shared memory ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,Cache ,Page table ,Software - Abstract
Most modern multi-core processors provide a shared last level cache (LLC) where data from all cores are placed to improve performance. However, this opens a new challenge for cache management, owing to cache pollution. With cache pollution, data with weak temporal locality can evict other data with strong temporal locality when both are mapped into the same cache set. In this article, we propose page reusability-based cache partitioning (PRCP) for multi-core systems to maximize cache utilization by minimizing cache pollution. To achieve this, PRCP divides pages into two groups: (1) highly-reused pages and (2) lowly-reused pages. The reusability of each page is collected online via periodic page table scans. PRCP then dynamically partitions the shared cache into two corresponding areas using page coloring technique. We have implemented PRCP in Linux kernel and evaluated it using SPEC CPU2006 benchmarks. The results show that our scheme can achieve comparable performance to the optimal offline MRC-guided process-based cache partitioning scheme without a priori knowledge of workloads.
- Published
- 2020
- Full Text
- View/download PDF
21. CACHE MEMORY LOACLITY OPTIMIZATION FOR IMPLEMENTATION OF COMPUTER VISION AND IMAGE PROCESSING ALGORITHMS
- Author
-
A Al-Marakeby
- Subjects
Computer science ,CPU cache ,Computer vision and image processing ,Locality of reference ,Computer vision algorithms ,Image processing ,Algorithm - Abstract
One of the main problems in developing fast image processing and computer vision systems is the memory speed. Memory speed represents the performance bottleneck due to the large gap between processor and memory speeds. Cache memory is very fast, but it is small to store all required data and instructions. In this paper , image processing and computer vision algorithms are optimized to enhance performance by increasing the cache memory utilization. This optimization increases the spatial locality and temporal locality and improves the system performance. The proposed optimization is applied on a set of image processing operations such as image intensity transformation, image filtering, geometric transformation, and CNN. The time analysis of the systems has shown a speed improvement of 30% to 70% compared with direct algorithm implementation. سرعة الذاکرة تمثل واحدة من أهم المشکلات التي تواجه عملية تطوير أنظمة سريعة لمعالجة الصور والرؤية بالحاسب. إن سرعة الذاکرة تمثل عنق الزجاجة في تحقيق أداء جيد نظرا للفجوة الکبيرة بين سرعة الذاکرة وسرعة المعالج. الذاکرة المخبأة هي ذاکرة سريعة جدا ولکنها صغيرة ولا يمکنها تخزين جميع البيانات والتعليمات المطلوبة. في هذا البحث يتم عمل تحسين لخوازميات الرؤية بالحاسب ومعالجة الصور من أجل تحسين الأداء عن طريق زيادة استغلال أفضل للذاکرة المخبأة. هذه الأمثلة تزيد من الترکيز المکاني والترکيز الوقتي لتحسين الأداء. الأمثلة المقترحة تم تطبيقها على العديد من عمليات معالجة الصور مثل تحويلات الوضوح و مرشحات الصور و التحويلات الهندسية و الشبکات العصبية. إن تحليلات الوقت لهذه الأنظمة أثبتت تحسينات في السرعة تصل من 30% إلى 70% مقارنة بالطرق المباشرة لتنفيذ هذه الخوارزميات.
- Published
- 2020
- Full Text
- View/download PDF
22. Memory-Optimized Wavefront Parallelism on GPUs
- Author
-
Yuanzhe Li and Loren Schwiebert
- Subjects
010302 applied physics ,Computer science ,CPU cache ,Concurrency ,Locality ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,Theoretical Computer Science ,Shared memory ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,Edit distance ,Nested loop join ,Massively parallel ,Software ,Information Systems - Abstract
Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research of such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. To achieve optimal performance, tiling has been introduced as a popular solution to achieve a balanced workload. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. In addition, shared memory is used in place of the L1 cache. The well-tailored mapping of the operations to the shared memory improves both spatial and temporal locality. We evaluate the performance of our proposed technique for four different applications: sequence alignment, edit distance, summed-area table, and 2D-SOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation.
- Published
- 2020
- Full Text
- View/download PDF
23. An Energy-Efficient Low-Latency 3D-CNN Accelerator Leveraging Temporal Locality, Full Zero-Skipping, and Hierarchical Load Balance
- Author
-
Yifan He, Siyuan Qiu, Changchun Zhou, Hailong Jiao, and Min Liu
- Subjects
Speedup ,Computer engineering ,Computer science ,Frame (networking) ,Locality of reference ,Latency (engineering) ,Convolutional neural network ,Dropout (neural networks) ,Efficient energy use ,Microarchitecture - Abstract
Three-dimensional convolutional neural network (3D-CNN) has demonstrated outstanding classification performance in video recognition compared to two-dimensional CNN (2D-CNN), since 3D-CNN not only learns the spatial features of each frame, but also learns the temporal features across all frames. However, 3D-CNN suffers from intensive computation and data movement. To solve these issues, an energy-efficient low-latency 3D-CNN accelerator is proposed. Temporal locality and small differential value dropout are used to increase the sparsity of activation. Furthermore, to fully utilize the sparsity of weight and activation, a full zero-skipping convolutional microarchitecture is proposed. A hierarchical load-balancing scheme is also introduced to improve resource utilization. With the proposed techniques, a 3D-CNN accelerator is designed in a 55-nm low-power CMOS technology, bringing in up to 9.89× speedup compared to the baseline implementation. Benchmarked with C3D, the proposed accelerator achieves an energy efficiency of 4.66 TOPS/W at 100 MHz and 1.08 V supply voltage.
- Published
- 2021
- Full Text
- View/download PDF
24. On list update with locality of reference.
- Author
-
Albers, Susanne and Lauer, Sonja
- Subjects
- *
ONLINE algorithms , *BENCHMARK problems (Computer science) , *MATHEMATICAL proofs , *MATHEMATICAL bounds , *DATA structures , *SELF-organizing systems - Abstract
We present a comprehensive study of the list update problem with locality of reference. More specifically, we present a combined theoretical and experimental study in which the theoretically proven and experimentally observed performance guarantees of algorithms match or nearly match. Firstly, we introduce a new model of locality of reference that closely captures the concept of runs. Using this model we develop refined theoretical analyses of popular list update algorithms. Secondly, we present an extensive experimental study in which we have tested the algorithms on traces from benchmark libraries. The theoretical and experimental bounds differ by just a few percent. Our new theoretical bounds are substantially lower than those provided by standard competitive analysis. It shows that the well-known Move-To-Front strategy exhibits the best performance. Its refined competitive ratio tends to 1 as the degree of locality increases. This confirms that Move-To-Front is the method of choice in practice. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
25. Exploiting user activeness for data retention in HPC systems
- Author
-
Wei Zhang, Yong Chen, Suren Byna, Sangkeun Lee, Hyogi Sim, and Sudharshan S. Vazhkudai
- Subjects
Reduction (complexity) ,Process (engineering) ,Storage resource management ,business.industry ,Computer science ,Distributed computing ,Data management ,Locality of reference ,Data retention ,business ,Purge - Abstract
HPC systems typically rely on the fixed-lifetime (FLT) data retention strategy, which only considers temporal locality of data accesses to parallel file systems. However, our extensive analysis based on the leadership-class HPC system traces suggests that the FLT approach often fails to capture the dynamics in users' behavior and leads to undesired data purge. In this study, we propose an activeness-based data retention (ActiveDR) solution, which advocates considering the data retention approach from a holistic activeness-based perspective. By evaluating the frequency and impact of users' activities, ActiveDR prioritizes the file purge process for inactive users and rewards active users with extended file lifetime on parallel storage. Our extensive evaluations based on the traces of the prior Titan supercomputer show that, when reaching the same purge target, ActiveDR achieves up to 37% file miss reduction as compared to the current FLT retention methodology.
- Published
- 2021
- Full Text
- View/download PDF
26. Efficient Shortest Distance Query Processing in Road Networks for Spatially Skewed Workloads
- Author
-
Jingyi Wan
- Subjects
Vertex (graph theory) ,Theoretical computer science ,Speedup ,Computer science ,Skew ,Process (computing) ,Reinforcement learning ,Locality of reference ,Preprocessor ,Interval (mathematics) ,Computer Science::Databases - Abstract
Shortest distance computing in road networks is an essential component in a range of applications. As a well-adopted method, 2-hop labeling assigns each vertex a label and enables the distance computation only by a sort-merge join on the labels. However, few existing 2-hop labeling based proposals consider the spatio-temporal characteristics of dynamic query workloads. To process massive-scale shortest distance query workloads, we propose a Workload-aware Core-Forest label index (WCF) to exploit spatial skew in workloads. In addition, we develop a Reinforcement Learning based Time Interval Partitioning (RL-TIP) that utilizes temporal locality to further improve the query performance. Extensive experiments on real-world data demonstrate that our proposal is capable of achieving a query processing speedup of an order of magnitude with less preprocessing time and space, when compared to the state-of-the-art proposals.
- Published
- 2021
- Full Text
- View/download PDF
27. Accurate Online Tensor Factorization for Temporal Tensor Streams with Missing Values
- Author
-
Seyun Kim, Dawon Ahn, and U Kang
- Subjects
Tensor factorization ,Exploit ,Computer science ,Tensor (intrinsic definition) ,Locality of reference ,Epidemic disease ,Missing data ,Regularization (mathematics) ,Algorithm ,Task (project management) - Abstract
Given a time-evolving tensor stream with missing values, how can we accurately discover latent factors in an online manner to predict missing values? Online tensor factorization is a crucial task with many important applications including the analysis of climate, network traffic, and epidemic disease. However, existing online methods have disregarded temporal locality and thus have limited accuracy. In this paper, we propose STF (Streaming Tensor Factorization), an accurate online tensor factorization method for real-world temporal tensor streams with missing values. We exploit an attention-based temporal regularization to learn inherent temporal patterns of the streams. We also propose an efficient online learning algorithm which allows each row of the temporal factor matrix to be updated from past and future information. Extensive experiments show that the proposed method gives the state-of-the-art accuracy, and quickly processes each tensor slice.
- Published
- 2021
- Full Text
- View/download PDF
28. RETRACTED ARTICLE: An improved memory adaptive up-growth to mine high utility itemsets from large transaction databases
- Author
-
D. Sharmila and D. Sathyavani
- Subjects
General Computer Science ,Database ,Computer science ,Node (networking) ,Locality ,02 engineering and technology ,Data structure ,computer.software_genre ,Set (abstract data type) ,Memory management ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,020201 artificial intelligence & image processing ,Depth-first search ,Database transaction ,computer - Abstract
High utility itemset (HUI) mining identifies an interesting pattern from transactional databases by taking into consider the utility of each in the transaction for instance, margins or profits. Many candidates are generated for HUIs which reduces the mining performance. Frequent pattern-growth (FP-Growth) algorithm was widely used to discover the frequent itemsets using FP-Tree. But, UP-Tree was constructed to store high utility item set. The mining of UP-Tree by FP-Growth extracts high utility itemset and generates too many candidates. So, UP-Growth and UP-Growth+ was proposed to shorten the candidate itemsets. In UP-Growth, two tactics such as discarding local unpromising items (DLU) and decreasing local node (DLN) were used in FP-Growth. In UP-Growth+, two strategies such as DLU and its estimated node utilities (DNU) and DLN tools for the nodes of internal UP-Tree by evaluated the descendant nodes (DNN) were incorporated in FP-Growth. However, the UP-Growth and UP-Growth+ has the problem of poor spatial and temporal localities. Initially, the UP-Tree is created in available main memory then extended to secondary memory when the transaction is large. The storage of data structure in secondary memory and accessibility is not clearly derived in existing algorithms. In this paper, the spatial locality issues of UP-Growth and UP-Growth+ is solved by rearranging nodes of UP-Tree in a depth first manner and temporal locality issues of UP-Growth and UP-Growth+ is solved by page blocking technique which reorganizes the execution part of UP-Growth and UP-Growth+. The computational part is rearranged. In addition to, the memory management strategy is introduced to minimize the space requirement of high utility itemsets. Thus the proposed Improved Adaptive UP-Growth (IAUP-Growth) and Improved Adaptive UP-Growth+ (IAUP-Growth+) overcomes the spatial and temporal locality problem and effectively reduces the memory usage.
- Published
- 2020
- Full Text
- View/download PDF
29. A Cache Replacement Policy with Considering Fluctuation Patterns of Total Priority Value
- Author
-
Ryosuke Higashi and Jubee Tada
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Memory hierarchy ,CPU cache ,Computer science ,Distributed computing ,Demotion ,Thrashing ,Parallel computing ,030204 cardiovascular system & hematology ,01 natural sciences ,Reduction (complexity) ,03 medical and health sciences ,0302 clinical medicine ,0103 physical sciences ,Overhead (computing) ,Locality of reference ,State (computer science) ,Cache - Abstract
In recent computer systems, a high-performance memory system is required for improving the computing performance. To improve the performance of the memory system, the capacity and the associativity of cache memories tend to be enlarged in all the memory hierarchy. Therefore, a high-performance cache replacement policy is required, which can be implemented with a small overhead and can adapt the behavior of various applications. Under such a situation, the adaptive demotion policy with considering global fluctuations of priority values (ADP-G) is proposed. ADP-G is based on the adaptive demotion policy (ADP) which can adapt the behavior of various applications by changing its insertion, promotion, demotion and selection (IPDS) policies. ADP-G selects the appropriate IPDS policies from two alternatives, one is suitable for a temporal locality and the other has a scan/thrashing tolerant. For this selection, ADP-G uses the global fluctuations of priority values of all the blocks in the cache. Although ADP-G is a high-performance cache replacement policy as compared with conventional ones, it causes a problem when the performance of the scan/thrashing tolerant IPDS policies is higher than that of the temporal locality IPDS policies. This paper discusses a novel cache replacement policy to solve this problem. At first, the fluctuation patterns of total priority value in several benchmarks are analyzed. As a result, two characteristics are observed which are useful to detect that the running application the suitable for the temporal locality IPDS policies or the scan/thrashing tolerant IPDS policies. Based on this analysis, a cache replacement policy with considering fluctuation patterns of total priority value is proposed, and it is called ADP-P. ADP-P has two states which suitable for the temporal locality and the scan/thrashing tolerant. To detect the suitable state, ADP-P uses the total of the priority values of all the blocks in the cache, and the value is stored in the global total register (GTR). When the state is for the temporal locality, the state is changed to the scan/thrashing tolerant state if the reduction of GTR is larger than the threshold value. When the state is for the scan/thrashing tolerant, the state is changed to the temporal locality state if the fluctuation of GTR which is larger than the threshold value is not caused for a fixed number of cache accesses. Experimental results show the proposed cache replacement policy achieves a reduction of cache misses compared to LRU policy and ADP-G. When the L3 cache size is 4MB, in the geometric mean of all the benchmarks, the proposed cache replacement policy achieves a 6.0% and 11% MPKI reduction compared to ADP-G and LRU policy, respectively.
- Published
- 2020
- Full Text
- View/download PDF
30. Online Multi-Object Tracking Using Joint Domain Information in Traffic Scenarios
- Author
-
Long Chen, Wei Tian, and Martin Lauer
- Subjects
050210 logistics & transportation ,Computer science ,business.industry ,Mechanical Engineering ,Association (object-oriented programming) ,05 social sciences ,Object (computer science) ,Computer Science Applications ,Domain (software engineering) ,Video tracking ,0502 economics and business ,Automotive Engineering ,Trajectory ,Locality of reference ,Eye tracking ,Computer vision ,Artificial intelligence ,Noise (video) ,business - Abstract
Visual tracking of multiple objects is an essential component for a perception system in autonomous driving vehicles. One of the favorable approaches is the tracking-by-detection paradigm, which links current detection hypotheses to previously estimated object trajectories (also known as tracks) by searching appearance or motion similarities between them. As this search operation is usually based on a very limited spatial or temporal locality, the association can fail in cases of motion noise or long-term occlusion. In this paper, we propose a novel tracking method that solves this problem by putting together information from both enlarged structural and temporal domain. For efficiency without loss of optimality, this approach is decomposed in to three stages, with each dealing with only one constrained association task, and thus, it follows the alternating optimization fashion. In our approach, detections are first assembled into small tracklets based on meta-measurements of object affinity. The association task for tracklets-to-tracks is solved by structural information based on a motion pattern between them. Here, we propose new rules to decouple the processing time from the tracklet length. Furthermore, constraints from temporal domain are introduced to recover objects, which are long-time disappearing due to failed detection or long-term occlusion. By putting together the heterogeneous domain information, our approach exhibits an improved state-of-the-art performance on standard benchmarks. With relatively little processing time, an online and real-time tracking is also permitted in our approach.
- Published
- 2020
- Full Text
- View/download PDF
31. MetaStrider
- Author
-
Anirudh Jain, Jeanine Cook, Joseph M. Lennon, Sriseshan Srikanth, Thomas M. Conte, and Erik P. DeBenedictis
- Subjects
Computer science ,Distributed computing ,010103 numerical & computational mathematics ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Variety (cybernetics) ,Rendering (computer graphics) ,Reduction (complexity) ,Hardware and Architecture ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Parallelism (grammar) ,Key (cryptography) ,Locality of reference ,0101 mathematics ,Software ,Information Systems ,Sparse matrix - Abstract
Reduction is an operation performed on the values of two or more key-value pairs that share the same key. Reduction of sparse data streams finds application in a wide variety of domains such as data and graph analytics, cybersecurity, machine learning, and HPC applications. However, these applications exhibit low locality of reference, rendering traditional architectures and data representations inefficient. This article presents MetaStrider, a significant algorithmic and architectural enhancement to the state-of-the-art, SuperStrider. Furthermore, these enhancements enable a variety of parallel, memory-centric architectures that we propose, resulting in demonstrated performance that scales near-linearly with available memory-level parallelism.
- Published
- 2019
- Full Text
- View/download PDF
32. Spin Summations
- Author
-
Devin A. Matthews, Paul Springer, and Paolo Bientinesi
- Subjects
G.4 ,FOS: Computer and information sciences ,D.1.3 ,Sequence ,Computer Science - Performance ,010304 chemical physics ,Computer science ,Applied Mathematics ,Perspective (graphical) ,010103 numerical & computational mathematics ,Supercomputer ,01 natural sciences ,Performance (cs.PF) ,0103 physical sciences ,Vectorization (mathematics) ,Memory footprint ,Locality of reference ,Computer Science - Mathematical Software ,Tensor ,0101 mathematics ,Mathematical Software (cs.MS) ,Algorithm ,Software ,Spin-½ - Abstract
In addition to tensor contractions, one of the most pronounced computational bottlenecks in the nonorthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms—including CCSD(T) and CCSDT(Q)—are spin summations. At a first sight, spin summations are operations similar to tensor transpositions, but a closer look reveals additional challenges to high-performance calculations, including temporal locality and scattered memory accesses. This article explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g., caches, vectorization) or the problem itself (e.g., factorizability). The final algorithm combines the advantages of all the solutions while avoiding their drawbacks; this algorithm achieves high performance through parallelization and vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4× and 5.5× over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place , thus reducing the memory footprint by 2× over an out-of-place variant.
- Published
- 2019
- Full Text
- View/download PDF
33. Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing
- Author
-
Hans Vandierendonck, Mohsen Koohi Esfahani, and Peter Kilpatrick
- Subjects
Sparse Matrix-Vector Multiplication ,Computer science ,Locality ,High Performance Computing ,Sparse matrix-vector multiplication ,Graph Algorithms ,Degree distribution ,Vertex (geometry) ,Tree traversal ,Graph traversal ,Locality of reference ,Multiplication ,Graph Traversal ,Memory Locality ,Algorithm ,MathematicsofComputing_DISCRETEMATHEMATICS - Abstract
The skewed degree distribution of real-world graphs is the main source of poor locality in traversing all edges of the graph, known as Sparse Matrix-Vector (SpMV) Multiplication. Conventional graph traversal methods, such as push and pull, traverse all vertices in the same manner, and we show applying a uniform traversal direction for all edges leads to sub-optimal memory locality, hence poor efficiency. This paper argues that different vertices in power-law graphs have different locality characteristics and the traversal method should be adapted to these characteristics.To solve this problem, we propose to inspect the number of destination and source vertices in selecting a cache-compatible traversal direction for each type of vertex. We introduce in-Hub Temporal Locality (iHTL), a structure-aware SpMV that combines push and pull in one graph traversal, but for different vertex types. iHTL exploits temporal locality by traversing incoming edges to in-hubs in push direction, while processing other edges in pull direction.The evaluation shows iHTL is 1.5-2.4 times faster than pull and 4.8-9.5 times faster than push in state-of-the-art graph processing frameworks such as GraphGrind, GraphIt and Galois. More importantly, iHTL is 1.3-1.5 times faster than pull traversal of state-of-the-art locality optimizing reordering algorithms such as SlashBurn, GOrder, and Rabbit-Order. https://blogs.qub.ac.uk/GraphProcessing/exploiting-in-hub-temporal-locality-in-spmv-based-graph-processing/
- Published
- 2021
- Full Text
- View/download PDF
34. Ascetic: Enhancing Cross-Iterations Data Efficiency in Out-of-Memory Graph Processing on GPUs
- Author
-
Pen-Chung Yew, Tang Ruiqi, Jin Zhang, Ziyi Zhao, Kailun Wang, Wenwen Wang, and Xiaoli Gong
- Subjects
Out of memory ,Speedup ,Data efficiency ,Computer science ,Graph traversal ,Memory footprint ,Graph (abstract data type) ,Overhead (computing) ,Locality of reference ,Parallel computing - Abstract
Graph analytics are widely used in real-world applications, and GPUs are major accelerators for such applications. However, as graph sizes become significantly larger than the capacity of GPU memory, the performance can degrade significantly due to the heavy overhead required in moving a large amount of graph data between CPU main memory and GPU memory. Some existing approaches have tried to exploit data locality and addressed the issues of memory oversubscription on GPUs. However, these approaches have yet to take advantage of the data reuse cross iterations because of the data sizes in most large-graph analytics. In our studies, we have found that in most graph applications the graph traversals exhibit a roughly sequential scan over the graph data with an extremely large memory footprint. Based on the observation, we propose a novel framework, called Ascetic, to exploit temporal locality with very long reuse distances. In Ascetic, the GPU memory is divided into a Static Region and an On-demand Region. The static region can exploit data reuse across iterations. The on-demand region is designed to load the data requested in the iteration of the graph traversal while not found in the static region. We have implemented a prototype of the Ascetic framework and conducted a series of experiments on performance evaluation. The experimental results show that Ascetic can significantly reduce the data transfer overhead, and allow more overlapped execution between GPU and CPU, which leads to an average of 2.0x speedup over a state-of-the-art approach.
- Published
- 2021
- Full Text
- View/download PDF
35. A Cache Policy Based on Request Association Analysis for Reliable NAND-Based Storage Systems
- Author
-
Chin-Hsien Wu and Chi-Hsiu Su
- Subjects
Technology ,Computer science ,Nand flash memory ,QH301-705.5 ,QC1-999 ,NAND gate ,02 engineering and technology ,computer.software_genre ,0202 electrical engineering, electronic engineering, information engineering ,cache ,Locality of reference ,General Materials Science ,Biology (General) ,solid-state drive (SSD) ,Instrumentation ,QD1-999 ,Fluid Flow and Transfer Processes ,Hardware_MEMORYSTRUCTURES ,Process Chemistry and Technology ,Physics ,Locality ,General Engineering ,NAND flash memory ,020206 networking & telecommunications ,Free space ,Engineering (General). Civil engineering (General) ,association analysis ,020202 computer hardware & architecture ,Computer Science Applications ,Chemistry ,Operating system ,Cache ,TA1-2040 ,computer ,Dram ,Garbage collection - Abstract
Compared with the traditional hard-disk drives (HDDs), solid-state drives (SSDs) have adopted NAND flash memory and become the current popular storage devices. However, when the free space in NAND flash memory is not enough, the garbage collection will be triggered to recycle the free space. The activities of the garbage collection include a large amount of data written and time-consuming erase operations that can reduce the performance of NAND flash memory. Therefore, DRAM is usually added to NAND flash memory as cache to store frequently used data. The typical cache methods mainly utilize the data characteristics of temporal locality and spatial locality to keep the frequently used data in the cache as much as possible. In addition, we find that there are not only temporal/spatial locality, but also certain associations between the accessed data. Therefore, we suggest that a cache policy should not only consider the temporal/spatial locality but also consider the association relationship between the accessed data to improve the cache hit ratio. In the paper, we will propose a cache policy based on request association analysis for reliable NAND-based storage systems. According to the experimental results, the cache hit ratio of the proposed method can be increased significantly when compared with the typical cache methods.
- Published
- 2021
- Full Text
- View/download PDF
36. Anomaly Detection in Photovoltaic Production Factories via Monte Carlo Pre-Processed Principal Component Analysis
- Author
-
Roberto Ferulano, Eleonora Arena, Dario Alfio Iuvara, Nour Alhuda Sulieman, Eric Stefan Miele, Massimo Villari, Alessandro Corsini, and Lorenzo Ricciardi Celsi
- Subjects
Production line ,PV cell production line ,Technology ,Control and Optimization ,principal component analysis ,Computer science ,Monte Carlo method ,Energy Engineering and Power Technology ,02 engineering and technology ,010501 environmental sciences ,computer.software_genre ,01 natural sciences ,predictive maintenance ,Robustness (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,Electrical and Electronic Engineering ,Engineering (miscellaneous) ,Monte Carlo simulation ,0105 earth and related environmental sciences ,Renewable Energy, Sustainability and the Environment ,Data stream mining ,anomaly detection ,Outlier ,Principal component analysis ,020201 artificial intelligence & image processing ,Anomaly detection ,Data mining ,computer ,Energy (miscellaneous) - Abstract
This paper investigates a use case of robust anomaly detection applied to the scenario of a photovoltaic production factory—namely, Enel Green Power’s 3SUN solar cell production plant in Catania, Italy—by considering a Monte Carlo based pre-processing technique as a valid alternative to other typically used methods. In particular, the proposed method exhibits the following advantages: (i) Outlier replacement, by contrast with traditional methods which are limited to outlier detection only, and (ii) the preservation of temporal locality with respect to the training dataset. After pre-processing, the authors trained an anomaly detection model based on principal component analysis and defined a suitable key performance indicator for each sensor in the production line based on the model errors. In this way, by running the algorithm on unseen data streams, it is possible to isolate anomalous conditions by monitoring the above-mentioned indicators and virtually trigger an alarm when exceeding a reference threshold. The proposed approach was tested on both standard operating conditions and an anomalous scenario. With respect to the considered use case, it successfully anticipated a fault in the equipment with an advance of almost two weeks, but also demonstrated its robustness to false alarms during normal conditions.
- Published
- 2021
- Full Text
- View/download PDF
37. Research on Periodic Precaching Optimization Strategy Based on Access Mode
- Author
-
Liang Ye, Weijie Li, and Wenhao Zhu
- Subjects
Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,Locality ,Byte ,Parallel computing ,Software ,Factor (programming language) ,Hit rate ,Locality of reference ,Cache ,business ,Cache algorithms ,computer ,computer.programming_language - Abstract
Traditional and classic caching algorithms are based on temporal locality to improve cache performance, When the locality of data requests is weak and there are certain periodic access modes, performance may be worse. Based on this factor, incorporating periodic access modes into the cache replacement algorithm is more suitable for some real scenarios. To solve this problem, based on the access mode, this paper proposes a periodic precaching strategy PRPC (periodically reinforce precaching), which combines two existing access strategies, LRU and LFU, to propose two improved algorithms. The running results in a real environment show that when the size of the cache space is close to the median of the size of the requested collection object, the precaching strategy can show better hit performance. Compared with the cache algorithm of the comparative test, the PRPC precaching strategy of the improved cache algorithm improves the cache hit rate and byte hit rate.
- Published
- 2021
- Full Text
- View/download PDF
38. FULL-W2V
- Author
-
Thomas Randall, Tyler Allen, and Rong Ge
- Subjects
Speedup ,Computer science ,Parallel algorithm ,02 engineering and technology ,Parallel computing ,Bottleneck ,Data access ,Shared memory ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,020201 artificial intelligence & image processing ,Performance improvement ,Throughput (business) - Abstract
Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm’s inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture’s peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.
- Published
- 2021
- Full Text
- View/download PDF
39. CuWide: Towards Efficient Flow-based Training for Sparse Wide Models on GPUs (Extended Abstract)
- Author
-
Jiawei Jiang, Lingxiao Ma, Xupeng Miao, Bin Cui, Yingxia Shao, Zhi Yang, and Lele Yu
- Subjects
Schema (genetic algorithms) ,Memory management ,Information engineering ,Memory hierarchy ,Flow (mathematics) ,Computer science ,Training (meteorology) ,Locality of reference ,Parallel computing ,Data modeling - Abstract
In this paper, we propose an efficient GPU-training framework for the large-scale wide models, named cuWide. To fully benefit from the memory hierarchy of GPU, cuWide applies a new flow-based schema for training, which leverages the spatial and temporal locality of wide models to drastically reduce the amount of communication with GPU global memory. Comprehensive experiments show that cuWide can be up to more than 20× faster than the state-of-the-art GPU solutions and multi-core CPU solutions.
- Published
- 2021
- Full Text
- View/download PDF
40. RaFIO: a random forest I/O-aware algorithm
- Author
-
Stéphane Rubini, Camélia Slimani, Yuan-Hao Chang, Jalil Boukhobza, Chun-Feng Wu, Equipe Software/HArdware and unKnown Environment inteRactions (Lab-STICC_SHAKER), Laboratoire des sciences et techniques de l'information, de la communication et de la connaissance (Lab-STICC), École Nationale d'Ingénieurs de Brest (ENIB)-Université de Bretagne Sud (UBS)-Université de Brest (UBO)-École Nationale Supérieure de Techniques Avancées Bretagne (ENSTA Bretagne)-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS)-Université Bretagne Loire (UBL)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-École Nationale d'Ingénieurs de Brest (ENIB)-Université de Bretagne Sud (UBS)-Université de Brest (UBO)-École Nationale Supérieure de Techniques Avancées Bretagne (ENSTA Bretagne)-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS)-Université Bretagne Loire (UBL)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT), Université de Brest (UBO), Academia Sinica, and École Nationale Supérieure de Techniques Avancées Bretagne (ENSTA Bretagne)
- Subjects
Computer science ,Process (computing) ,Decision tree ,020207 software engineering ,02 engineering and technology ,Workspace ,Execution time ,Random forest ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Tree (data structure) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,[INFO.INFO-ES]Computer Science [cs]/Embedded Systems ,[INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS] ,Algorithm ,ComputingMilieux_MISCELLANEOUS - Abstract
Random Forest based classification is a widely used Machine Learning algorithm. Training a random forest consists of building several decision trees that classify elements of the input dataset according to their features. This process is memory intensive. When datasets are larger than the available memory, the number of I/O operations grows significantly, causing a dramatic performance drop. Our experiments showed that, for a dataset that is 8 times larger than the available memory workspace, training a random forest is 25 times slower than the case when the dataset can fit in memory. In this paper, we revisit the tree building algorithm to optimize the performance for the datasets larger than the memory workspace. The proposed strategy aims at reducing the number of I/O operations by smartly taking benefit from the temporal locality exhibited by the random forest building algorithm. Experiments showed that our method reduced the execution time of the tree building by up to 90% and by 60% on average as compared to a state-of-the-art method, when the datasets are larger than the main memory workspace.
- Published
- 2021
- Full Text
- View/download PDF
41. Real-Time Characterization of Data Access Correlations
- Author
-
Michael Marzullo, Bryan T. Harris, and Nihat Altiparmak
- Subjects
Software ,Data access ,Memory management ,Computer science ,business.industry ,Distributed computing ,Computer data storage ,Locality ,Data analysis ,Locality of reference ,business ,Data structure - Abstract
Efficient and accurate detection of data access correlations can provide various storage system optimizations. However, existing offline detection techniques are costly in terms of computation and memory, and they rely on previously recorded disk I/O traces, thus wasting additional storage space and causing further I/O. Moreover, due to their offline nature, they are inadequate for allowing automatic optimizations. In this paper, we propose a real-time data access characterization framework that eliminates the drawbacks of offline analysis with a slight compromise in accuracy. The proposed framework continuously monitors I/O requests and forms correlations by applying single-pass data analysis techniques and maintaining a synopsis data structure to efficiently characterize access behavior in various dimensions including spatial locality, temporal locality, and frequency. Our evaluation using synthetic and real world storage workloads indicates that the proposed framework can detect over 90% of data access correlations in real-time, using limited memory. The proposed framework is crucial for enabling future self-optimizing storage systems that can automatically react to I/O bottlenecks.
- Published
- 2021
- Full Text
- View/download PDF
42. HVSM: An approach to exploiting the locality of the multi-node heterogeneous multicore system.
- Author
-
Li, Liang, Dong, Xiaoshe, Zhang, Bao, Liu, Chao, Zhang, Xingjun, and Wang, Endong
- Abstract
One of challenges to achieve good performance on the heterogeneous multicore architecture is the exploitation of the locality of reference. While it is an issue for the single node system, it is still a critical problem for the multi-node system with PGAS distributed shared memory. In this paper, we address the exploitation of the locality of shared memory references in the memory hierarchy of the multi-node heterogeneous multicore system. First, based on PGAS, we propose HVSM, a hierarchical virtual shared memory, to model the structure of the complex memory hierarchy of the heterogeneous multicore system. And then incorporate a heuristic mechanism to exploit the locality among shared memory references. The analyzing results as well as the experiment results on the prototype system Cell-HVSM show that applications on HVSM can obviously benefit from our approach. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
43. Variable-Sized Blocks for Locality-Aware SpMV
- Author
-
Naveen Namashivavam, Pen-Chung Yew, and Sanyam Mehta
- Subjects
Multi-core processor ,Speedup ,Xeon ,Computer science ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Kernel (linear algebra) ,Vectorization (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Locality of reference ,020201 artificial intelligence & image processing ,0101 mathematics ,Sparse matrix - Abstract
Blocking is an important optimization option available to mitigate the data movement overhead and improve the temporal locality in SpMV, a sparse BLAS kernel with irregular memory reference pattern. In this work, we propose an analytical model to determine the effective block size for highly irregular sparse matrices by factoring the distribution of non-zeros in the sparse dataset. As a result, the blocks generated by our scheme are variable-sized as opposed to constant-sized in most existing SpMV algorithms. We demonstrate our blocking scheme using Compressed Vector Blocks (CVB), a new column-based blocked data format, on Intel Xeon Skylake-X multicore processor. We evaluated the performance of CVB-based SpMV with variable-sized blocks using extensive set of matrices from Stanford Network Analysis Platform (SNAP). Our evaluation shows a speedup of up to 2.62X (with an average of 1.73X) and 2.02X (with an average of 1.18X) over the highly vendor tuned SpMV implementation in Intel's Math Kernel Library (MKL) on single and multiple Intel Xeon cores respectively.
- Published
- 2021
- Full Text
- View/download PDF
44. Cache-Aware Dynamic Skewed Tree for Fast Memory Authentication
- Author
-
Saru Vig, Siew-Kei Lam, and Rohan Juneja
- Subjects
010302 applied physics ,Authentication ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,Skew ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Parsec ,Tree (data structure) ,Overhead (business) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,System on a chip ,Cache ,business ,Computer network - Abstract
Memory integrity trees are widely-used to protect external memories in embedded systems against bus attacks. However, existing methods often result in high performance overheads incurred during memory authentication. To reduce memory accesses during authentication, the tree nodes are cached on-chip. In this paper, we propose a cache-aware technique to dynamically skew the integrity tree based on the application workloads in order to reduce the performance overhead. The tree is initialized using Van-Emde Boas (vEB) organization to take advantage of locality of reference. At run time, the nodes of the integrity tree are dynamically positioned based on their memory access patterns. In particular, frequently accessed nodes are placed closer to the root to reduce the memory access overheads. The pro-posed technique is compared with existing methods on Multi2Sim using benchmarks from SPEC-CPU2006, SPLASH-2 and PARSEC to demonstrate its performance benefits.
- Published
- 2021
- Full Text
- View/download PDF
45. EHDSktch
- Author
-
Chandran Goodchild, Smruti R. Sarangi, and Priyanka Singla
- Subjects
Hardware architecture ,020203 distributed computing ,Speedup ,Computer science ,Distributed computing ,02 engineering and technology ,020202 computer hardware & architecture ,Power Architecture ,Memory management ,0202 electrical engineering, electronic engineering, information engineering ,Locality of reference ,System on a chip ,Energy harvesting ,Streaming algorithm - Abstract
Energy harvesting devices (EHDs) are becoming extremely prevalent in remote and hazardous environments. They sense the ambient parameters and compute some statistics on them, which are then sent to a remote server. Due to the resource-constrained nature of EHDs, it is challenging to perform exact computations on streaming data; however, if we are willing to tolerate a slight amount of inaccuracy, we can leverage the power of sketching algorithms to provide quick answers with significantly lower energy consumption. In this paper, we propose a novel hardware architecture called EHDSktch -- a set of IP blocks that can be used to implement most of the popular sketching algorithms. We demonstrate an energy savings of 4-10X and a speedup of more than 10X over state-of-the-art software implementations. Leveraging the temporal locality further provides us a performance gain of 3-20% in energy and time and reduces the on-chip memory requirement by at least 50-75%.
- Published
- 2021
- Full Text
- View/download PDF
46. StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
- Author
-
Torsten Hoefler, Dominic Hofer, Tal Ben-Nun, Andreas Kuster, Johannes de Fine Licht, and Tiziano De Matteis
- Subjects
FOS: Computer and information sciences ,CUDA ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computer science ,Stratix ,Locality of reference ,Code generation ,Parallel computing ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Deadlock ,Directed acyclic graph ,Field-programmable gate array ,Stencil - Abstract
Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice.
- Published
- 2021
47. Reused-Based Replacement Policy for Last-Level Cache with Minimum Hardware Cost
- Author
-
Purnendu Das and Bishwa Ranjan Roy
- Subjects
Multi-core processor ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,CPU cache ,Locality ,Cache-only memory architecture ,Locality of reference ,Cache ,business ,Computer hardware ,Access time ,Block (data storage) - Abstract
In multicore processor, the demand for cache memory is higher to bring frequently accessed data close to processor. Off-chip miss reduces the performance of the processor. Nowadays, multilevel cache architecture is used to increase the capacity of the cache memory. Upper level caches are dedicated to particular core act as private cache, whereas the Last-Level Cache is shared by the multiple core of the processor. The size of Last-Level Cache is bigger than the upper level of cache. So, the chance of large number of dead block in Last-Level Caches is higher. Efficient replacement policy is required to handle Shared Last-level Cache with minimum hardware cost. Traditional LRU replacement policy fails to remove dead block early in the large-sized cache. Many dead block detection algorithms have been proposed with higher hardware cost. In this paper, we have proposed a replacement policy that utilizes the reused locality instead of temporal locality. Also, the proposed technique reduces the hardware cost significantly. The proposed technique has been experimentally compared with two different baseline systems. It is observed that the proposed technique is capable of detecting dead block early and reduces the memory access time of the system.
- Published
- 2021
- Full Text
- View/download PDF
48. Parameterized Analysis of Paging and List Update Algorithms.
- Author
-
Dorrigiv, Reza, Ehmsen, Martin, and López-Ortiz, Alejandro
- Subjects
- *
ONLINE algorithms , *PAGING (Computer science) , *PARAMETERIZATION , *ARBITRARY constants , *LOCALIZATION (Mathematics) - Abstract
It is well-established that input sequences for paging and list update have locality of reference. In this paper we analyze the performance of algorithms for these problems in terms of the amount of locality in the input sequence. We define a measure for locality that is based on Denning's working set model and express the performance of well known algorithms in terms of this parameter. This explicitly introduces parameterized-style analysis to online algorithms. The idea is that rather than normalizing the performance of an online algorithm by an (optimal) offline algorithm, we explicitly express the behavior of the algorithm in terms of two more natural parameters: the size of the cache and Denning's working set measure. This technique creates a performance hierarchy of paging algorithms which better reflects their experimentally observed relative strengths. It also reflects the intuition that a larger cache leads to a better performance. We also apply the parameterized analysis framework to list update and show that certain randomized algorithms which are superior to MTF in the classical model are not so in the parameterized case, which matches experimental results. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
49. Popularity and correlation aware data replication strategy based on half‐life concept and clustering in cloud system
- Author
-
Maawya Chellouf and Tarek Hamrouni
- Subjects
Computer Networks and Communications ,Computer science ,Cloud systems ,business.industry ,Cloud computing ,Machine learning ,computer.software_genre ,Popularity ,Computer Science Applications ,Theoretical Computer Science ,Correlation ,Computational Theory and Mathematics ,CloudSim ,Locality of reference ,Artificial intelligence ,business ,Cluster analysis ,computer ,Software - Published
- 2020
- Full Text
- View/download PDF
50. Kakkot List- An Improved Variant of Skip List
- Author
-
Rajesh R
- Subjects
Information retrieval ,Skip list ,Computer science ,Locality of reference ,Linked list ,Cache ,Binary logarithm ,Data structure ,GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries) ,Time complexity ,Sequence (medicine) - Abstract
Kakkot list is a new data structure used for quick searching in a well ordered sequence of list like Skip list. This ordered sequence of list is created using linked list data structure and the maximum number of levels here will be limited to log n in all input behavioral cases. The maximum number of items in each level is halved to that of previous levels and thus guarantees a fast searching in a list. The basic difference between Kakkot list and Skip list lies in the creation of levels and decision of when an item has to be included in the higher levels. In skip list the levels are created and items are added to each level during the insertion of an item where as in Kakkot list this will be done at the time of searching an item. This modification have made drastic impact in searching time complexity in the Kakkot list. Another issue in Skip list is that it is not cache friendly and does not optimize locality of reference wherein this problem is also addressed in Kakkot List.
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.