Author: "Hill, Mark D." / Journal: indrastra global - Searchworks@Jio Institute Digital Library Search Results

1. Programming Heterogeneous Computers and Improving Inter-Node Communication Across Xeon Phis

Author: Feilbach, Chris, Sperling, Adam, Sifakis, Eftychios, and Hill, Mark D.
Subjects: Bandwidth, Xeon Phi, Scientific Workloads, Accelerators, SCIF, PCIE
Abstract: Scientific computing workloads are well suited to parallel accelerators such as GPGPUs and the Intel Xeon Phi. While these accelerators can provide greater performance than traditional CPUs due to their parallel architectures and greater memory bandwidth, their maximum workload size is limited by relatively small memory capacity. To solve this problem, data can be split across multiple accelerators to utilize the combined memory capacity as well as increased compute capability. Combining multiple accelerators into heterogeneous systems introduces a new bottleneck. Communication bandwidth between accelerators over the PCIe interconnect is much slower than internal memory bandwidth. This project examines the inter-node bandwidth bottleneck using the Intel Xeon Phi in the context of scientific applications. We show the limitations of traditional MPI programming paradigms, and leverage Intel?s Xeon Phi-specific SCIF communication API to achieve increased inter-node memory bandwidth. While small messages still incur communication overhead penalties, messages larger than 512KB are able to saturate the PCIe bus and achieve bandwidth utilization close to 90% of the theoretical maximum. This project also attempts to address the complexities of programming systems of multiple accelerators. We introduce a software interface wrapper over SCIF that coalesces groups of small messages into larger ones. This new interface eases the programming experience and provides greater interconnect bandwidth from coalescing.
Published: 2016

2. Probabilistic Directed Writebacks for Exclusive Caches

Author: Olson, Lena E. and Hill, Mark D.
Subjects: computer architecture, energy efficiency
Abstract: Energy is an increasingly important consideration in memory system design. Although caches can save energy in several ways, such as by decreasing execution time and reducing the number of main memory accesses, they also suffer from known inefficiencies: the last-level cache (LLC) tends to have a high miss ratio while simultaneously storing many blocks that are never referenced after being written back to LLC. These blocks contribute to dynamic energy while simultaneously causing cache pollution. Because these blocks are not referenced before they are evicted, we can write them directly to memory rather than to the LLC. To do so, we must predict which blocks will not be referenced. Previous approaches rely on additional state at the LLC and/or extra communication. We show that by predicting working set size per program counter (PC), we can decide which blocks have low probability of being referenced. Our approach makes the prediction based solely on the address stream as seen by the level-one data cache (L1D) and thus avoids storing or communicating PC values between levels of the cache hierarchy. We require no modifications to the LLC. We adapt Flajolet and Martin?s probabilistic counting to keep the state small: two additional bits per L1D block, with an additional 6KB prediction table. This approach yields a large reduction in number of LLC writebacks: 25% fewer for SPEC on average, 80% fewer for graph500, and 67% fewer for an in-memory hash table.
Published: 2016

3. Revisiting Stack Caches for Energy Efficiency

Author: Olson, Lena, Eckert, Yasuko, Manne, Srilatha, and Hill, Mark D.
Subjects: computer architecture, energy efficiency
Abstract: With the growing focus on energy efficiency, it is important to find ways to reduce energy without sacrificing performance. The L1 data cache is a significant contributor to processor energy consumption. We advocate treating data from the program?s stack differently from non-stack data to reduce energy. We characterize stack accesses to determine how they differ from general memory accesses in terms of footprint, frequency, and ratio of loads to stores. We then propose two ways to optimize for these characteristics. First, the implicit stack cache limits stack data to residing in designated ways of the data cache, reducing the energy required per stack access. We show that it can reduce data cache dynamic energy by 37% with no reduction in performance. Second, the explicit stack cache stores stack data in a separate L1 cache. In addition to reducing the energy per access, it also has additional benefits over the implicit policy in that it can be virtually tagged and have a different writeback policy. We show that this approach can lead to additional energy savings, with no performance impact. These optimizations are implemented purely in the hardware and thus require no changes to existing code.
Published: 2014

4. OS Support for Virtualizing Hardware Transactional Memory

Author: Swift, Michael M., Volos, Haris, Goyal, Neelam, Yen, Luke, Hill, Mark D., and Wood, David A.
Abstract: Transactional memory promises to simplify multithreaded programming. Hardware TM (HTM) implementations promise better performance by augmenting processors with transactional state. However, HTMs interact poorly with the operating system or virtual machine monitor. For example, they often do not tolerate OS actions that virtualize processors and memory, such as context switching and paging. Without support for these actions, an HTM may not execute programs correctly or guarantee forward progress. We investigate virtualizing transactional memory in the context of LogTM-SE. First, we describe an implementation of a kernel module in OpenSolaris that implements transactional virtualization and requires only 1120 lines of code. Second, we find that LogTM-SE interacts poorly with virtual machine monitors due to a reliance on physical addresses. We propose an extension to LogTM-SE, called LogTM-VSE, that addresses these problems and improves context-switching performance. Third, through application tracing on real hardware and full system simulation, we show virtualizing transactions can be necessary for system stability and to support code that voluntarily context switches. However, we find that aborting a transaction is generally faster than virtualizing it, and hence preferable in some cases.
Published: 2008

5. A Case for Deconstructing Hardware Transactional Memory Systems

Author: Hill, Mark D., Hower, Derek, Moore, Keven E., Swift, Michael M., Volos, Haris, and Wood, David A.
Abstract: Major hardware and software vendors are curious about transactional memory (TM), but are understandably cautious about committing to hardware changes. Our thesis is that deconstructing transactional memory into separate, interchangeable components facilitates TM adoption in two ways. First, it aids hardware TM refinement, allowing vendors to adopt TM earlier, knowing that they can more easily refine aspects later. Second, it enables the components to be applied to other uses, including reliability, security, performance, and correctness, providing value even if TM is not widely used. We develop some evidence for our thesis via experience with LogTM variants and preliminary case studies of scalable watch-points and race recording for deterministic replay.
Published: 2007

6. Amdahl's Law in the Multicore Era

Author: Hill, Mark D. and Marty, Michael R.
Abstract: We apply Amdahl's Law to multicore chips using symmetric cores, asymmetric cores, and dynamic techniques that allows cores to work together on sequential execution. To Amdahl's simple software model, we add a simple hardware model based on fixed chip resources. A key result we find is that, even as we enter the multicore era, researchers should still seek methods of speeding sequential execution. Moreover, methods that appear locally inefficient (e.g., tripling sequential performance with a 9x resource cost) can still be globally efficient as they reduce the sequential phase when the rest of the chip's resources are idle.
Published: 2007

7. Thread-Level Transactional Memory

Author: Moore, Kevin E., Hill, Mark D., and Wood, David A.
Abstract: This paper presents thread-level transactional memory (TTM), a memory system interface that separates the semantics of transactions-atomicity, consistency, and isolation-from the implementation. By making transactions a thread-level abstraction, TTM permits implementations using different combinations of of high-level software, low-level software, and dedicated hardware. TTM tracks a transaction's read and write sets and creates a "before-image" log in the thread's virtual address space. We evaluate four TTM implementations-broadcast and directory coherence times two different transaction abort mechanisms-using full-system simulation. Like previous transactional memory systems, TTM implementations are competitive with or better than lock-based synchronization. TTM's ability to cache the before and after images both supports large transactions and enables low memory bandwidth on successful commits and fast rollback on aborts.
Published: 2005

8. A Survey of User-Level Network Interfaces for System Area Networks

Author: Mukherjee, Shubhendu S and Hill, Mark D
Published: 1997

9. Cost Effective Parallel Computing

Author: Wood, David A and Hill, Mark D
Published: 1994

10. Application-Specific Protocols for User-Level Shared-Memory

Author: Falsafi, Babak, Lebeck, Alvin R, Reinhardt, Steven K, Schoinas, Ioannis, Hill, Mark D, Larus, James R, Rogers, Anne, and Wood, David A
Published: 1994

11. Solving Microstructure Electrostatics on a Proposed Parallel Computer

Author: Traenkle, Frank, Hill, Mark D, and Kim, Sangtae
Published: 1994

12. Sufficient System Requirements for Supporting the PLpc Memory Model

Author: Adve, Sarita V, Gharachorloo, Kourosh, Gupta, Anoop, Hennessy, John L, and Hill, Mark D
Published: 1993

13. Specifying System Requirements for Memory Consistency Models

Author: Gharachorloo, Kourosh, Adve, Sarita V, Gupta, Anoop, Hennessy, John L, and Hill, Mark D
Published: 1993

14. Sufficient Conditions for Implementing the Data-Race-Free-1 Memory Model

Author: Adve, Sarita V and Hill, Mark D
Published: 1992

15. A Unified Formalization of Four Shared-Memory Models

Author: Adve, Sarita V and Hill, Mark D
Published: 1991

16. Performance Implications of Tolerating Cache Faults

Author: Pour, Farid and Hill, Mark D
Published: 1991

17. A Model for Estimating Trace-Sample Miss Ratios

Author: Wood, David A, Hill, Mark D, and Kessler, Richard E
Published: 1991

18. A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches

Author: Kessler, Richard E, Hill, Mark D, and Wood, David A
Published: 1991

19. Cache Performance of the SPEC Benchmark Suite

Author: Gee, Jeffrey D, Hill, Mark D, Pnevmatikatos, Dionisios N, and Smith, Alan Jay
Published: 1991

20. Miss Reduction in Large, Real-Indexed Caches

Author: Kessler, Richard E and Hill, Mark D
Published: 1990

21. A Case for Direct-Mapped Caches

Author: Hill, Mark D
Published: 1988

22. Evaluating Associativity in CPU Caches

Author: Hill, Mark D and Smith, Alan J
Abstract: Because of the infeasibility or expense of large fully-associative caches, cache memories are often designed to be set-associative or direct-mapped. This paper presents (1) new and efficient algorithms for simulating alternative direct-mapped and set-associative caches, and (2) uses those algorithms to quantify the effect of limited associativity on the cache miss ratio. We introduce a new algorithm, forest simulation, for simulating alternative direct-mapped caches and generalize one, which we call all-associativity simulation, for simulating alternative direct-mapped, set-associative and fully-associative caches. We find that while all-associativity simulation is theoretically less efficient than forest simulation or stack simulation (a commonly-used simulation algorithm), in practice, it is not much slower and allows the simulation of many more caches with a single pass through an address trace. We also provide data and insight into how varying associativity affects the miss ratio,. We show: (1) how to use the simulations of alternative caches to isolate the cause of misses; (2) that the principle reason why set-associative miss ratios are larger than full-associative ones is (as one might expect) that too many active blocks map to a fraction of the sets even when blocks map to sets in a uniform random manner; and (3) that reducing associativity from eight-way to four-way, from four-way to two-way, and from two-way to direct-mapped causes relative miss ratio increases in our data of about 5, 10, and 30 percent respectively, regardless cache size.
Published: 1989

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

22 results on '"Hill, Mark D."'

1. Programming Heterogeneous Computers and Improving Inter-Node Communication Across Xeon Phis

2. Probabilistic Directed Writebacks for Exclusive Caches

3. Revisiting Stack Caches for Energy Efficiency

4. OS Support for Virtualizing Hardware Transactional Memory

5. A Case for Deconstructing Hardware Transactional Memory Systems

6. Amdahl's Law in the Multicore Era

7. Thread-Level Transactional Memory

8. A Survey of User-Level Network Interfaces for System Area Networks

9. Cost Effective Parallel Computing

10. Application-Specific Protocols for User-Level Shared-Memory

11. Solving Microstructure Electrostatics on a Proposed Parallel Computer

12. Sufficient System Requirements for Supporting the PLpc Memory Model

13. Specifying System Requirements for Memory Consistency Models

14. Sufficient Conditions for Implementing the Data-Race-Free-1 Memory Model

15. A Unified Formalization of Four Shared-Memory Models

16. Performance Implications of Tolerating Cache Faults

17. A Model for Estimating Trace-Sample Miss Ratios

18. A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches

19. Cache Performance of the SPEC Benchmark Suite

20. Miss Reduction in Large, Real-Indexed Caches

21. A Case for Direct-Mapped Caches

22. Evaluating Associativity in CPU Caches

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

22 results on '"Hill, Mark D."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources