Descriptor: "Locality of reference" / Journal: journal of parallel and distributed computing - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Locality of reference"' showing total 9 results

Start Over Descriptor "Locality of reference" Journal journal of parallel and distributed computing

9 results on '"Locality of reference"'

1. Scalable training of 3D convolutional networks on multi- and many-cores

Author: H. Sebastian Seung, Aleksandar Zlateski, and Kisuk Lee
Subjects: Speedup, Computer Networks and Communications, Computer science, Node (networking), Parallel algorithm, Dynamic priority scheduling, Parallel computing, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, Theoretical Computer Science, 03 medical and health sciences, 0302 clinical medicine, Artificial Intelligence, Hardware and Architecture, Scalability, Locality of reference, Cache, 030217 neurology & neurosurgery, Software, Xeon Phi, 0105 earth and related environmental sciences
Abstract: Convolutional networks (ConvNets) have become a popular approach to computer vision. Here we consider the parallelization of ConvNet training, which is computationally costly. Our novel parallel algorithm is based on decomposition into a set of tasks, most of which are convolutions or FFTs. Theoretical analysis suggests that linear speedup with the number of processors is attainable. To attain such performance on real shared-memory machines, our algorithm computes convolutions converging on the same node of the network with temporal locality to reduce cache misses, and sums the convergent convolution outputs via an almost wait-free concurrent method to reduce time spent in critical sections. Benchmarking with multi-core CPUs shows speedup roughly equal to the number of physical cores. We also demonstrate 90x speedup on a many-core CPU (Xeon Phi Knights Corner). Our algorithm can be either faster or slower than certain GPU implementations depending on specifics of the network architecture, kernel sizes, and density and size of the output patch.
Published: 2017
Full Text: View/download PDF

2. Hint-based cache design for reducing miss penalty in HBS packet classification algorithm

Author: Fang-Chen Kuo and Yeim-Kuan Chang
Subjects: Computer Networks and Communications, Computer science, Network packet, Cache coloring, Network processor, Locality, Parallel computing, Cache pollution, Cache-oblivious algorithm, Theoretical Computer Science, Smart Cache, Artificial Intelligence, Hardware and Architecture, Cache invalidation, Locality of reference, Page cache, Cache, Cache algorithms, Algorithm, Software
Abstract: In this paper, we implement some notable hierarchical or decision-tree-based packet classification algorithms such as extended grid of tries (EGT), hierarchical intelligent cuttings (HiCuts), HyperCuts, and hierarchical binary search (HBS) on an IXP2400 network processor. By using all six of the available processing microengines (MEs), we find that none of these existing packet classification algorithms achieve the line speed of OC-48 provided by IXP2400. To improve the search speed of these packet classification algorithms, we propose the use of software cache designs to take advantage of the temporal locality of the packets because IXP network processors have no built-in caches for fast path processing in MEs. Furthermore, we propose hint-based cache designs to reduce the search duration of the packet classification data structure when cache misses occur. Both the header and prefix caches are studied. Although the proposed cache schemes are designed for all the dimension-by-dimension packet classification schemes, they are, nonetheless, the most suitable for HBS. Our performance simulations show that the HBS enhanced with the proposed cache schemes performs the best in terms of classification speed and number of memory accesses when the memory requirement is in the same range as those of HiCuts and HyperCuts. Based on the experiments with all the high and low locality packet traces, five MEs are sufficient for the proposed rule cache with hints to achieve the line speed of OC-48 provided by IXP2400.
Published: 2013
Full Text: View/download PDF

3. Predicting locality phases for dynamic memory optimization

Author: Xipeng Shen, Yutao Zhong, and Chen Ding
Subjects: Dynamic random-access memory, Computer Networks and Communications, Computer science, CPU cache, Locality, Parallel computing, Theoretical Computer Science, law.invention, Dynamic programming, Artificial Intelligence, Hardware and Architecture, law, Locality of reference, Cache, Software, TRACE (psycholinguistics)
Abstract: Dynamic data, cache, and memory adaptation can significantly improve program performance when they are applied on long continuous phases of execution that have dynamic but predictable locality. To support phase-based adaptation, this paper defines the concept of locality phases and describes a four-component analysis technique. Locality-based phase detection uses locality analysis and signal processing techniques to identify phases from the data access trace of a program; frequency-based phase marking inserts code markers that mark phases in all executions of the program; phase hierarchy construction identifies the structure of multiple phases; and phase-sequence prediction predicts the phase sequence from program input parameters. The paper shows the accuracy and the granularity of phase and phase-sequence prediction as well as its uses in dynamic data packing, memory remapping, and cache resizing.
Published: 2007
Full Text: View/download PDF

4. Translating submachine locality into locality of reference

Author: Geppino Pucci, Carlo Fantozzi, and Andrea Pietracaprina
Subjects: Random access memory, Theoretical computer science, Computer Networks and Communications, Computer science, Locality, Parallel algorithm, Task parallelism, Parallel computing, Theoretical Computer Science, Temporal database, Parallel processing (DSP implementation), Artificial Intelligence, Hardware and Architecture, Locality of reference, Concurrent computing, Memory model, Software
Abstract: Summary form only given. The design of algorithms exhibiting a high degree of temporal and spatial locality of reference is crucial to attain good performance on current and foreseeable computing systems featuring ever deeper memory hierarchies. Previous work has demonstrated that task parallelism can be efficiently transformed into locality of reference in two-level hierarchies. Recently, we moved a step forward and showed how the more structured type of parallelism exposed by submachine locality can be efficiently turned into temporal locality on arbitrarily deep hierarchies. We complete and extend the above result by encompassing also spatial locality. Specifically, we present a scheme to simulate parallel algorithms designed for the decomposable BSP (a BSP variant which captures submachine locality) on the hierarchical memory model with block transfer. The simulation yields good hierarchy-conscious sequential algorithms from parallel ones, and provides evidence of the strict relation between submachine locality in parallel computation and locality of reference (both temporal and spatial) in the hierarchical memory setting.
Published: 2006
Full Text: View/download PDF

5. Improving whole-program locality using intra-procedural and inter-procedural transformations

Author: Mahmut Kandemir
Subjects: Computer Networks and Communications, CPU cache, Computer science, Locality, Optimizing compiler, Memory organisation, Parallel computing, Theoretical Computer Science, Loop fission, Artificial Intelligence, Hardware and Architecture, Loop nest optimization, Benchmark (computing), Locality of reference, Loop interchange, Nested loop join, Software
Abstract: Exploiting spatial and temporal locality is essential for obtaining high performance on modern computers. Writing programs that exhibit high locality of reference is difficult and error-prone. Compiler researchers have developed loop transformations that allow the conversion of programs to exploit locality. Recently, transformations that change the memory layouts of multi-dimensional arrays-called data transformations-have been proposed. Unfortunately, both data and loop transformations have some important drawbacks. In this work, we present an integrated framework that uses loop and data transformations in concert to exploit the benefits of both approaches while minimizing the impact of their disadvantages. Our approach works inter-procedurally on acyclic call graphs, uses profile data to eliminate layout conflicts, and is unique in its capability of resolving conflicting layout requirements of different references to the same array in the same nest and in different nests for regular array-based applications. The optimization technique presented in this paper has been implemented in a source-to-source translator. We evaluate its performance using standard benchmark suites and several math libraries (complete programs) with large input sizes. Our experimental results show that the proposed approach improves the performance of the applications optimized by using the current state-of-the-art techniques by 8.2% on the average. This reduction comes from three important characteristics of the technique, namely, resolving layout conflicts between references to the same array in a loop nest, determining a suitable order to propagate layout modifications across loop nests, and propagating layouts between different procedures in the program-all in a unified framework. The locality optimization technique presented in this paper tries to exploit locality in the innermost loop positions. This strategy, in most cases, generates dependence-free outer loops, which can be safely parallelized.
Published: 2005
Full Text: View/download PDF

6. Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines

Author: Mahmut Kandemir, Alok Choudhary, and J. Ramanujam
Subjects: Computer Networks and Communications, Computer science, Distributed computing, Locality, Multiprocessing, Parallel computing, computer.software_genre, Theoretical Computer Science, Shared memory, Artificial Intelligence, Hardware and Architecture, Scalability, Vectorization (mathematics), Locality of reference, Distributed memory, Compiler, computer, Software
Abstract: Distributed-memory message-passing machines deliver scalable performance but are difficult to program. Shared-memory machines, on the other hand, are easier to program but obtaining scalable performance with large number of processors is difficult. Recently, scalable machines based on logically shared physically distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to different parallel architectures, issues such as data distribution are unique to specific architectures. One of the most important challenges compiler writers face is the design of compilation techniques that can work well on a variety of architectures. In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. Our optimization algorithm does the following: (1) transforms loop nests such that, where possible, the iterations of the outermost loops can be run in parallel across processors; (2) optimizes memory locality by carefully distributing each array across processors; (3) optimizes interprocessor communication using message vectorization whenever possible; and (4) optimizes cache locality by assigning appropriate storage layout for each array and by transforming the iteration space. Depending on the machine architecture, some or all of these steps can be applied in a unified framework. We report empirical results on an SGI Origin 2000 distributed-shared-memory multiprocessor and an IBM SP-2 distributed-memory message-passing machine to validate the effectiveness of our approach.
Published: 2000
Full Text: View/download PDF

7. Modeling Communication Locality in Multiprocessors

Author: C. Salisbury, Zhixiong Chen, and Rami Melhem
Subjects: Artificial Intelligence, Computer Networks and Communications, Hardware and Architecture, Computer science, Distributed computing, Locality, Working set, Parallel algorithm, Locality of reference, Latency (engineering), Software, Theoretical Computer Science
Abstract: Locality of reference is an important aspect of many computer operations. It is often exploited to optimize the performance of computer functions. In this paper, we apply the locality concept to the communication patterns of parallel programs operating over an interconnection network with a fixed communication latency between any pair of attached nodes. Unbuffered multistage networks and all-optical networks are examples of these. We quantify the notions of spatial and temporal locality in this context, and combine them in a locality measure. This measure is used as the basis for identifying the communication working sets of a parallel program. We focus on programs with a looping structure and investigate conditions under which each working set consists of the complete set of paths required by a single loop.
Published: 1999
Full Text: View/download PDF

8. Mobile-Process-Based Parallel Simulation

Author: Vernon Rego, Janche Sang, and Edward Mascarenhas
Subjects: Computer Networks and Communications, Process (engineering), Computer science, Distributed computing, Message passing, Object (computer science), Theoretical Computer Science, Transmission (telecommunications), Artificial Intelligence, Hardware and Architecture, Locality of reference, Distributed memory, Timestamp, Software system, Software
Abstract: Our focus is on the novel use of a process-oriented methodology in software systems for parallel simulation on distributed memory. To the best of our knowledge, the few existing systems which adopt a process view strictly use message passing to effect process interaction in distributed-memory settings. As a result, these systems avoid scenarios in which processes directly access passive but shared components. This can restrict the manner in which a system is modeled and hinder the phase of distributed model construction. In this paper, we propose an approach which utilizes mobile processes in distributed-memory simulation systems. The approach entails the migration of a requesting process with its timestamp to the remote site hosting the requested passive object. Major advantages of this approach include one-time transmission, fixed communication topology, and increased locality of reference. Empirical results based on lightweight processes show that the mobile process paradigm can be as efficient as the message-passing paradigm.
Published: 1996
Full Text: View/download PDF

9. Exploiting Data Structure Locality in the Dataflow Model

Author: A. P. Wim Böhm, Walid Najjar, and W. Marcus Miller
Subjects: Computer Networks and Communications, Computer science, Dataflow, Locality, Parallel computing, Data structure, Operand, computer.software_genre, Execution time, Theoretical Computer Science, Microarchitecture, Vector processor, Instruction set, Data flow diagram, Artificial Intelligence, Hardware and Architecture, Hybrid system, Systems architecture, Locality of reference, Compiler, computer, Software, Dataflow architecture
Abstract: Although the dataflow model has been shown to allow the exploitation of parallelism at all levels, research of the past decade has revealed several fundamental problems. Synchronization at the instruction level, token matching, coloring, and re-labeling operations have a negative impact on performance by significantly increasing the number of non-compute "overhead" cycles. Recently, many novel hybrid von-Neumann data driven machines have been proposed to alleviate some of these problems. The major objective has been to reduce or eliminate unnecessary synchronization costs through simplified operand matching schemes and increased task granularity. Moreover, the results from recent studies quantifying locality suggest sufficient spatial and temporal locality is present in dataflow execution to merit its exploitation. In this paper we present a data structure for exploiting locality in a data driven environment: the vector cell. A vector cell consists of a number of fixed length chunks of data elements. Each chunk is tagged with a presence bit, providing intra-chunk strictness and inter-chunk non-strictness to data structure access. We describe the semantics of the model, processor architecture and instruction set as well as a Sisal to dataflow vectorizing compiler back-end. The vector cell model is evaluated by comparing its performance to those of both a classical fine-grain dataflow processor employing I-structures and a conventional pipelined vector processor. Results indicate that the model is surprisingly resilient to long memory and communication latencies and is able to dynamically exploit the underlying parallelism across multiple processing elements at run time.
Published: 1995
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

9 results on '"Locality of reference"'

1. Scalable training of 3D convolutional networks on multi- and many-cores

2. Hint-based cache design for reducing miss penalty in HBS packet classification algorithm

3. Predicting locality phases for dynamic memory optimization

4. Translating submachine locality into locality of reference

5. Improving whole-program locality using intra-procedural and inter-procedural transformations

6. Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines

7. Modeling Communication Locality in Multiprocessors

8. Mobile-Process-Based Parallel Simulation

9. Exploiting Data Structure Locality in the Dataflow Model

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

9 results on '"Locality of reference"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources