Author: "Hiroyuki Takizawa" / Topic: parallel computing - Searchworks@Jio Institute Digital Library Search Results

1. Evaluating I/O Acceleration Mechanisms of SX-Aurora TSUBASA

Author: Hiroyuki Takizawa, Yuta Sasaki, Ayumu Ishizuka, and Mulya Agung
Subjects: Acceleration, Parallel processing (DSP implementation), Computer science, Symmetric multiprocessor system, System configuration, Parallel computing, Data transmission
Abstract: In a heterogeneous computing system, different kinds of processors might need to be involved in the execution of a file I/O operation. Since NEC SX-Aurora TSUBASA is one such system, two I/O acceleration mechanisms are offered to reduce the data transfer overheads among the processors for a file I/O operation. This paper first investigates the effects of the two mechanisms on the I/O performance of SX-Aurora TSUBASA. Considering the results, proper use of the two mechanisms is discussed via a real-world application of flood damage estimation. These results clearly demonstrate the demand for auto-tuning, i.e., adaptively selecting either of the two mechanisms with considering application behaviors and system configuration.
Published: 2021
Full Text: View/download PDF

2. Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE

Author: Ryusuke Egawa, Shintaro Momose, Kazuhiko Komatsu, Yoko Isobe, Hiroyuki Takizawa, Akihiro Musa, and Hiroaki Kobayashi
Subjects: Scalar processor, ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION, Computer science, Memory bandwidth, Parallel computing, Software_PROGRAMMINGTECHNIQUES, 010502 geochemistry & geophysics, Supercomputer, Memory performance, 01 natural sciences, 010305 fluids & plasmas, Theoretical Computer Science, Vector processor, Computer architecture, Hardware_GENERAL, Hardware and Architecture, 0103 physical sciences, Software, 0105 earth and related environmental sciences, Information Systems
Abstract: Achieving a high sustained simulation performance is the most important concern in the HPC community. To this end, many kinds of HPC system architectures have been proposed, and the diversity of the HPC systems grows rapidly. Under this circumstance, a vector-parallel supercomputer SX-ACE has been designed to achieve a high sustained performance of memory-intensive applications by providing a high memory bandwidth commensurate with its high computational capability. This paper examines the potential of the modern vector-parallel supercomputer through the performance evaluation of SX-ACE using practical engineering and scientific applications. To improve the sustained simulation performances of practical applications, SX-ACE adopts an advanced memory subsystem with several new architectural features. This paper discusses how these features, such as MSHR, a large on-chip memory, and novel vector processing mechanisms, are beneficial to achieve a high sustained performance for large-scale engineering and scientific simulations. Evaluation results clearly indicate that the high sustained memory performance per core enables the modern vector supercomputer to achieve outstanding performances that are unreachable by simply increasing the number of fine-grain scalar processor cores. This paper also discusses the performance of the HPCG benchmark to evaluate the potentials of supercomputers with balanced memory and computational performance against heterogeneous and cutting-edge scalar parallel systems.
Published: 2017
Full Text: View/download PDF

3. Toward Dynamic Load Balancing across OpenMP Thread Teams for Irregular Workloads

Author: Shoichi Hirasawa, Xiong Xiao, Hiroyuki Takizawa, and Hiroaki Kobayashi
Subjects: 010302 applied physics, Coprocessor, Computer science, Distributed computing, General Engineering, 020207 software engineering, 02 engineering and technology, Parallel computing, Thread (computing), ComputerSystemsOrganization_PROCESSORARCHITECTURES, Load balancing (computing), Supercomputer, 01 natural sciences, Scheduling (computing), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Xeon Phi, De facto standard
Abstract: In the field of high performance computing, massively-parallel many-core processors such as Intel Xeon Phi coprocessors are becoming popular because they can significantly accelerate various applications. In order to efficiently parallelize applications for such many-core processors, several high-level programming models have been proposed. The de facto standard programming model mainly for shared-memory parallel processing is OpenMP. For hierarchical parallel processing, OpenMP version 4.0 or later allows programmers to create multiple thread teams. Each thread team contains a bunch of newly-created synchronizable threads. When multiple thread teams are used to execute an application, it is important to have dynamic load balancing across thread teams, since static load balancing easily encounters load imbalance across teams, and thus degrades performance. In this paper, we first motivate our work by clarifying the benefit of using multiple thread teams to execute an irregular workload on a many-core processor. Then, we demonstrate that dynamic load balancing across those thread teams has a potential of significantly improving the performance of irregular workloads on a many-core processor, with considering the scheduling overhead. Although such a dynamic load balancing mechanism has not been provided by the current OpenMP specification, the benefits of dynamic load balancing across thread teams are discussed through experiments using the Intel Xeon Phi coprocessor. We evaluate the performance gain of dynamic load balancing across thread teams using a ray tracing code. The results show that such a dynamic load balancing mechanism can improve the performance by up to 14% compared to static load balancing across teams, with considering scheduling overhead.
Published: 2017
Full Text: View/download PDF

4. An Automatic MPI Process Mapping Method Considering Locality and Memory Congestion on NUMA Systems

Author: Hiroyuki Takizawa, Ryusuke Egawa, Mulya Agung, and Muhammad Alfian Amrizal
Subjects: Multi-core processor, Runtime system, Computer science, Scalability, Locality, Process (computing), Overhead (computing), Parallel computing, Performance improvement, Oracle
Abstract: MPI process mapping is an important step to achieve scalable performance on non-uniform memory access (NUMA) systems. Conventional approaches have focused only on improving the locality of communication. However, related studies have shown that on modern NUMA systems, the memory congestion problem could cause more severe performance degradation than the locality problem because a high number of processor cores in the systems can cause heavy congestion on shared caches and memory controllers. To optimize the process mapping, it is necessary to determine the communication behavior of the MPI processes. Previous methods rely on offline profiling to analyze the communication behavior, which incurs a high overhead and is potentially time-consuming. In this paper, we propose a method that automatically performs MPI process mapping for adapting to communication behaviors while considering both locality and memory congestion. Our method works at runtime during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. The proposed method has been evaluated with the NAS parallel benchmarks on a NUMA system. Experimental results show that our method can achieve performance close to an oracle-based mapping method with low overhead to the application execution. The performance improvement is up to 27.4% (13.4% on average) compared with the default mapping of the MPI runtime system.
Published: 2019
Full Text: View/download PDF

5. Scaling Performance for N-Body Stream Computation with a Ring of FPGAs

Author: Jens Huthmann, Kentaro Sano, Artur Podobas, Hiroyuki Takizawa, and Abiko Shin
Subjects: business.industry, Computation, Integrated circuit, Parallel computing, 030204 cardiovascular system & hematology, law.invention, 03 medical and health sciences, 0302 clinical medicine, law, Application domain, Scalability, Limit (music), Medicine, 030212 general & internal medicine, Field-programmable gate array, business, Scaling, Real-time operating system
Abstract: Field-Programmable Gate Arrays (FPGAs) offer a fairly non-invasive method to specialize custom architectures towards a specific application domain. Recent studies has successfully demonstrated that single-node FPGAs can be a rival to both CPUs and GPUs in performance. Unfortunately, most existing studies limit themselves to using a single FPGA devices, and their scalability requires more investigation. In this work, we practically demonstrate how to scale the important n-body problem across a comparatively large FPGA cluster. Our design -- composed of up to 256 processing elements -- achieves near-linear strong scaling, with performance-levels comparable to that of custom Application-Specific Integrated Circuits (ASICs). We further develop an analytical performance model, which we use to predict the performance of our solution onto future upcoming Intel Agilex systems. Today, our system reaches up to 47 Giga-Pairs/second, and using our performance model we predict that we can reach up-to 0.142 Tera-Pairs/second peak performance with next-generation FPGAs.
Published: 2019
Full Text: View/download PDF

6. Performance Evaluation of Different Implementation Schemes of an Iterative Flow Solver on Modern Vector Machines

Author: Kenta Yamaguchi, Thorsten Reimann, Yoichi Shimomura, Kazuhiko Komatsu, Hiroyuki Takizawa, Takashi Soga, Hiroaki Kobayashi, Akihiro Musa, and Ryusuke Egawa
Subjects: Source code, Xeon, Computer Networks and Communications, Computer science, media_common.quotation_subject, Parallel computing, Solver, Computer Science Applications, Computational Theory and Mathematics, Hardware and Architecture, Code (cryptography), Image tracing, SIMD, Legacy code, Software, Xeon Phi, Information Systems, media_common
Abstract: Modern supercomputers consist of multi-core processors, and these processors have recently employed vector instructions, or so-called SIMD instructions, to improve performances. Numerical simulations need to be vectorized in order to achieve higher performance on these processors. Various legacy numerical simulation codes that have been utilized for a long time often contain two versions of source codes: a non-vectorized version and a vectorized version that is optimized for old vector supercomputers. It is important to clarify which version is better for modern supercomputers in order to achieve higher performance. In this paper, we evaluate the performances of a legacy fluid dynamics simulation code called FASTEST on modern supercomputers in order to provide a guidepost for migrating such codes to modern supercomputers. The solver has a nonvectorized version and a vectorized version, and the latter uses the hyperplane ordering method for vectorization. For the evaluation, we also implement the red-black ordering method, which is another way to vectorize the solver. Then, we examine the performance on NEC SX-ACE, SXAurora TSUBASA, Intel Xeon Gold, and Xeon Phi. The results show that the shortest execution times are with the red-black ordering method on SX-ACE and SX-Aurora TSUBASA, and with the non-vectorized version on Xeon Gold and Xeon Phi. Therefore, achieving a higher performance on multiple modern supercomputers potentially requires maintenance of multiple code versions. We also show that the red-black ordering method is more promising to achieve high performance on modern supercomputers.
Published: 2019
Full Text: View/download PDF

7. Translation of Large-Scale Simulation Codes for an OpenACC Platform Using the Xevolver Framework

Author: Ryusuke Egawa, Ken'ichi Itakura, Shoichi Hirasawa, Hiroyuki Takizawa, Hiroaki Kobayashi, and Kazuhiko Komatsu
Subjects: Code Translation, Source code, Computer science, business.industry, media_common.quotation_subject, Maintainability, 020207 software engineering, 02 engineering and technology, Parallel computing, Directive, Translation (geometry), Software portability, Embedded system, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), 020201 artificial intelligence & image processing, Code generation, business, media_common
Abstract: As the diversity of high-performance computing (HPC) systems increases, even legacy HPC applications often need to use accelerators for higher performance. To migrate large-scale legacy HPC applications to modern HPC systems equipped with accelerators, a promising way is to use OpenACC because its directive-based approach can prevent drastic code modifications. This paper shows translation of a large-scale simulation code for an OpenACC platform by keeping the maintainability of the original code. Although OpenACC enables an application to use accelerators by adding a small number of directives, it requires modifying the original code to achieve a high performance in most cases, which tends to degrade the code maintainability and performance portability. To avoid such code modifications, this paper adopts a code translation framework, Xevolver. Instead of directly modifying a code, a pair of a custom code translation rule and a custom directive is defined, and is applied to the original code using the Xevolver framework. This paper first shows that simply inserting OpenACC directives does not lead to high performance and non-trivial code modifications are required in practice. In addition, the code modifications sometimes decrease the performance when migrating a code to other platforms, which leads to low performance portability. Â The direct code modifications can be avoided by using pairs of an externally-defined translation rule and a custom directive to keep the original code unchanged as much as possible. Finally, the performance evaluation shows that the performance portability can be improved by selectively applying translation with the Xevolver framework compared with directly modifying a code.
Published: 2016
Full Text: View/download PDF

8. A Light-Weight Rollback Mechanism for Testing Kernel Variants in Auto-Tuning

Author: Hiroaki Kobayashi, Shoichi Hirasawa, and Hiroyuki Takizawa
Subjects: Mechanism (engineering), Auto tuning, Artificial Intelligence, Hardware and Architecture, Computer science, Kernel (statistics), Computer Vision and Pattern Recognition, Parallel computing, Cache, Electrical and Electronic Engineering, Software, Rollback
Published: 2015
Full Text: View/download PDF

9. FLEXII: A Flexible Insertion Policy for Dynamic Cache Resizing Mechanisms

Author: Ryusuke Egawa, Masayuki Sato, Hiroaki Kobayashi, and Hiroyuki Takizawa
Subjects: business.industry, Computer science, Embedded system, Resizing, Cache, Energy consumption, Parallel computing, Electrical and Electronic Engineering, business, Electronic, Optical and Magnetic Materials
Published: 2015
Full Text: View/download PDF

10. Performance and Power Analysis of SX-ACE Using HP-X Benchmark Programs

Author: Hiroyuki Takizawa, Yoko Isobe, Kazuhiko Komatsu, Souya Fujimoto, Toshihiro Kato, Ryusuke Egawa, Hiroaki Kobayashi, and Akihiro Musa
Subjects: 020203 distributed computing, Computer science, 0211 other engineering and technologies, 021107 urban & regional planning, Memory bandwidth, 02 engineering and technology, Parallel computing, Supercomputer, Ranking (information retrieval), Vector processor, Power analysis, Memory management, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), SIMD
Abstract: As the SIMD width of modern microprocessors has been widening for keeping up with the computational demand for HPC systems, recently the vector architecture comes back to spotlight. Besides, a modern vector architecture that has been keeping a large SIMD width and a high B/F ratio has survived and evolved in the HPC community. In this paper, to clarify the potential of the modern vector architecture, we present the performance and power analysis of a modern vector supercomputer SX-ACE using HP-X benchmark programs (HPL, HPCG, and HPGMG). Furthermore, the implementation and optimization of these benchmarks on SX-ACE are discussed. The evaluation results show that SX-ACE achieves the highest efficiencies in the HPGMG and HPCG ranking lists. These facts clearly indicate that the powerful vector processing mechanism with a high B/F ratio is mandatory to achieve a high sustained performance in the future HPC systems.
Published: 2017
Full Text: View/download PDF

11. Vectorization-Aware Loop Optimization with User-Defined Code Transformations

Author: Ryusuke Egawa, Kazuhiko Komatsu, Akihiro Musa, Hiroyuki Takizawa, Hiroaki Kobayashi, Takashi Soga, and Thorsten Reimann
Subjects: 020203 distributed computing, Loop optimization, Source code, Dead code, Programming language, Computer science, Loop-invariant code motion, media_common.quotation_subject, Cyclomatic complexity, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, computer.software_genre, Code bloat, 01 natural sciences, Dead code elimination, Vector processor, Systematic code, 0202 electrical engineering, electronic engineering, information engineering, Code generation, Unreachable code, 0101 mathematics, Redundant code, computer, media_common
Abstract: The cost of maintaining an application code would significantly increase if the application code is branched into multiple versions, each of which is optimized for a different architecture. In this work, default and vector versions of a realworld application code are refactored to be a single version, and the differences between the versions are expressed as userdefined code transformations. As a result, application developers can maintain only the single version, and transform it to its vector version just before the compilation. Although code optimizations for a vector processor are sometimes different from those for other processors, application developers can enjoy the performance of the vector processor without increasing the code complexity. Evaluation results demonstrate that vectorizationaware loop optimization for a vector processor can be expressed as user-defined code transformation rules, and thereby significantly improve the performance of a vector processor without major code modifications.
Published: 2017
Full Text: View/download PDF

12. Optimizing Energy Consumption on HPC Systems with a Multi-Level Checkpointing Mechanism

Author: Muhammad Alfian Amrizal and Hiroyuki Takizawa
Subjects: File system, Iterative method, Computer science, Distributed computing, 020206 networking & telecommunications, 02 engineering and technology, Interval (mathematics), Parallel computing, Energy consumption, computer.software_genre, Convergence (routing), Data_FILES, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), 020201 artificial intelligence & image processing, computer, Energy (signal processing), Efficient energy use
Abstract: Coordinated checkpointing is a widely-used checkpoint/restart (CPR) technique for fault-tolerance in large-scale HPC systems. However, this CPR technique will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on multi-level checkpointing that allows the use of different kinds of fast but less reliable storages to reduce the checkpointing frequency to parallel file system (PFS). This paper presents an energy model of multi-level checkpointing and proposes an iterative algorithm that minimizes energy consumption by optimizing the checkpoint interval of each level and selecting the best combination of checkpoint levels. It is confirmed that the algorithm is very fast and effective since it can reach convergence in a relatively small number of iteration steps. This paper also clarifies the fact that it is actually unnecessary to use all the available checkpoint levels in a multi-level CPR mechanism. By selectively using only appropriate checkpoint levels, a significant increase in energy efficiency (9 to 21%) is observed.
Published: 2017
Full Text: View/download PDF

13. A Capacity-Aware Thread Scheduling Method Combined with Cache Partitioning to Reduce Inter-Thread Cache Conflicts

Author: Masayuki Sato, Hiroaki Kobayashi, Hiroyuki Takizawa, and Ryusuke Egawa
Subjects: Computer science, Cache coloring, MESI protocol, Parallel computing, Cache pollution, Smart Cache, Artificial Intelligence, Hardware and Architecture, Cache invalidation, Page cache, Computer Vision and Pattern Recognition, Cache, Electrical and Electronic Engineering, Cache algorithms, Software
Published: 2013
Full Text: View/download PDF

14. The Importance of Dynamic Load Balancing among OpenMP Thread Teams for Irregular Workloads

Author: Hiroyuki Takizawa, Xiong Xiao, Shoichi Hirasawa, and Hiroaki Kobayashi
Subjects: 020203 distributed computing, Coprocessor, Computer science, Dynamic load balancing, 010103 numerical & computational mathematics, 02 engineering and technology, Thread (computing), Dynamic priority scheduling, Parallel computing, ComputerSystemsOrganization_PROCESSORARCHITECTURES, computer.software_genre, 01 natural sciences, Instruction set, Load management, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Operating system, 0101 mathematics, computer, Xeon Phi
Abstract: Recently, massively-parallel many-core processors such as Intel Xeon Phi coprocessors have attracted researchers' attentions because various applications are significantly accelerated with those processors. In the field of high-performance computing, OpenMP is a standard programming model commonly used to parallelize a kernel loop for many-core processors. For hierarchical parallel processing, OpenMP version 4.0 or later allows programmers to group threads into multiple thread teams. In this paper, we first show the performance gain of using multiple thread teams even for one many-core processor. Then, we demonstrate that dynamic load balancing among those thread teams has a potential of significantly improving the performance of irregular workloads on a many-core processor. Although the current OpenMP specification does not offer such a dynamic load balancing mechanism, we discuss possible benefits of dynamic load balancing among thread teams through experiments using the Intel Xeon Phi coprocessor.
Published: 2016
Full Text: View/download PDF

15. A User-Defined Code Transformation Approach to Overlapping MPI Communication with Computation

Author: Yasuharu Hayashi, Hiroaki Kobayashi, and Hiroyuki Takizawa
Subjects: 020203 distributed computing, Code Translation, Computer science, computer.internet_protocol, Programming language, Node (networking), 020206 networking & telecommunications, 02 engineering and technology, Dynamic priority scheduling, Parallel computing, computer.software_genre, Electronic mail, Instruction set, Software portability, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), computer, XML
Abstract: The Xevolver framework has been developed to enable application programmers to define their own code translation rules outside of their codes so that they can express platform-specific optimizations separately from algorithm-level application codes. Due to the diversity of HPC node architectures, the Xevolver framework has so far mainly been used to separate node-level code optimizations from application codes. However, user-defined code transformation rules are also potentially useful for optimizing MPI applications without messing up their codes. Therefore, this paper shows a case study of using the Xevolver framework to optimize MPI applications through customizable code transformations without loss of high performance portability, and discusses the benefits of the framework.
Published: 2016
Full Text: View/download PDF

16. A cache partitioning mechanism to protect shared data for CMPs

Author: Hiroaki Kobayashi, Ryusuke Egawa, Shin Nishimura, Masayuki Sato, and Hiroyuki Takizawa
Subjects: Hardware_MEMORYSTRUCTURES, Computer science, Cache coloring, Parallel computing, Cache pollution, computer.software_genre, Smart Cache, Cache invalidation, Bus sniffing, Operating system, Page cache, Cache, Cache algorithms, computer
Abstract: The last-level cache (LLC) of a modern chip-multiprocessor (CMP) keeps two kinds of data: shared data accessed by multiple cores and private data accessed by only one core. Although the former are likely to have a larger performance impact than the latter, the LLC manages both of those data in the same fashion. To realize a highly efficient execution on a CMP, this paper proposes a cache partitioning mechanism to protect shared data from excessive eviction. The evaluation results show that the proposed mechanism improves the performance by up to 76% and by 8% on average at a cost of less than 2% of the LLC hardware.
Published: 2016
Full Text: View/download PDF

17. Parallel processing of the Building-Cube Method on a GPU platform

Author: Ryusuke Egawa, Hiroyuki Takizawa, Takashi Soga, Hiroaki Kobayashi, Kazuhiro Nakahashi, Kazuhiko Komatsu, Shun Takahashi, and Daisuke Sasaki
Subjects: Theoretical computer science, General Computer Science, Computer science, Data parallelism, General Engineering, GPU cluster, Parallel computing, law.invention, Parallel processing (DSP implementation), law, Scalability, Polygon mesh, Cartesian coordinate system, Cube, Data transmission
Abstract: The Building-Cube Method (BCM) based on equally-spaced Cartesian meshes has been proposed as a next generation CFD method. Due to the equally-spaced meshes, it is well suited for highly parallel computation. This paper proposes a parallel implementation scheme of BCM on a GPU cluster system, which needs efficient hierarchical parallel processing to exploit the potential of the cluster system. The proposed scheme employs the Red-Black SOR method for the pressure calculations, which is the most time-consuming part of BCM, to obtain massive data parallelism of BCM. By exploiting the coarse-grain and fine-grain parallelism of BCM, the proposed scheme hierarchically assigns equally-divided tasks into the GPU cluster system. Furthermore, to exploit the computational power of GPUs in the cluster system, the proposed scheme employs an efficient data management such as coalesced data transfer and reusing data on an on-chip memory. Experimental results show that the single GPU implementation can achieve about three times higher performance than the single CPU one. Moreover, the multiple GPU implementation can achieve an almost ideal scalability. Finally, the possibility of further acceleration of not only the pressure calculation but also the whole BCM is discussed.
Published: 2011
Full Text: View/download PDF

18. Performance of SOR methods on modern vector and scalar processors

Author: Shun Takahashi, Hiroaki Kobayashi, Takashi Soga, Kazuhiro Nakahashi, Akihiro Musa, Kazuhiko Komatsu, Ryusuke Egawa, Hiroyuki Takizawa, Koki Okabe, and Daisuke Sasaki
Subjects: Flow computation, Scalar processor, General Computer Science, Computer science, business.industry, General Engineering, Memory bandwidth, Parallel computing, Computational fluid dynamics, Pressure field, Successive over-relaxation, Performance improvement, business, Algorithm, Implementation
Abstract: The building-cube method (BCM) is a new generation algorithm for CFD simulations. The basic idea of BCM is to simplify the algorithm in all stages of flow computation to achieve large-scale simulations. Calculation of a pressure field using the Successive Over Relaxation (SOR) method consumes most of the total execution time required for BCM. In this paper, effective implementations on modern vector and scalar processors are investigated. NEC SX-9 and Intel Nehalem-EX are the latest vector and scalar processors. Those processors have much higher peak performances than their previous-generation processors. However, their memory bandwidth improvement cannot catch up with the performance improvement of processors. This is the so-called memory wall problem. In our paper, we discuss optimization techniques for implementation of the SOR method based on architectural characteristics of these modern processors, and evaluate their effects on the sustained performances of these processors for BCM.
Published: 2011
Full Text: View/download PDF

19. Characteristics of an On-Chip Cache on NEC SX Vector Architecture

Author: Akihiro Musa, Koki Okabe, Ryusuke Egawa, Hiroyuki Takizawa, Hiroaki Kobayashi, and Yoshiei Sato
Subjects: Smart Cache, Hardware_MEMORYSTRUCTURES, Cache invalidation, Computer science, Cache coloring, Page cache, Cache, Parallel computing, Cache pollution, Cache-oblivious algorithm, Cache algorithms
Abstract: Thanks to the highly effective memory bandwidth of the vector systems, they can achieve the high computation efficiency for computation-intensive scientific applications. However, they have been encountering the memory wall problem and the effective memory bandwidth rate has decreased, resulting in the decrease in the bytes per flop rates of recent vector systems from 4 (SX-7 and SX-8) to 2 (SX-8R) and 2.5 (SX-9). The situation is getting worse as many functions units and/or cores will be brought into a single chip, because the pin bandwidth is limited and does not scale. To solve the problem, we propose an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of future vector supercomputers. The vector cache employs a bypass mechanism between the main memory and register files under software controls. We evaluate the performance of the vector cache on the NEC SX vector processor architecture with bytes per flop rates of 2 B/FLOP and 1 B/FLOP, to clarify the basic characteristics of the vector cache. For the evaluation, we use the NEC SX-7 simulator extended with the vector cache mechanism. Benchmark programs for performance evaluation are two DAXPY-like loops and five leading scientific applications. The results indicate that the vector cache boosts the computational efficiencies of the 2 B/FLOP and 1 B/FLOP systems up to the level of the 4 B/FLOP system. Especially, in the case where cache hit rates exceed 50%, the 2 B/FLOP system can achieve a performance comparable to the 4 B/ FLOP system. The vector cache with the bypass mechanism can provide the data both from the main memory and the cache simultaneously. In addition, from the viewpoints of designing the cache, we investigate the impact of cache associativity on the cache hit rate, and the relationship between cache latency and the performance. The results also suggest that the associativity hardly affects the cache hit rate, and the effects of the cache latency depend on the vector loop length of applications. The cache shorter latency contributes to the performance improvement of the applications with shorter loop lengths, even in the case of the 4 B/FLOP system. In the case of longer loop lengths of 256 or more, the latency can effectively be hidden, and the performance is not sensitive to the cache latency. Finally, we discuss the effects of selective caching using the bypass mechanism and loop unrolling on the vector cache performance for the scientific applications. The selective caching is effective for efficient use of the limited cache capacity. The loop unrolling is also effective for the improvement of performance, resulting in a synergistic effect with caching. However, there are exceptional cases; the loop unrolling worsens the cache hit rate due to an increase in the working space to process the unrolled loops over the cache. In this case, an increase in the cache miss rate cancels the gain obtained by unrolling.
Published: 2009
Full Text: View/download PDF

20. A Case Study of User-Defined Code Transformations for Data Layout Optimizations

Author: Shoichi Hirasawa, Takeshi Yamada, Hiroaki Kobayashi, and Hiroyuki Takizawa
Subjects: Source code, Dead code, Computer science, Programming language, media_common.quotation_subject, Parallel computing, computer.software_genre, External Data Representation, Data structure, Code (cryptography), Code generation, KPI-driven code analysis, Redundant code, computer, media_common
Abstract: This paper reports a case study of using the Xevolver code transformation framework for data layout optimizations of high-performance computing (HPC) applications. Due to the variety of data structures used in individual applications, a code transformation rule for data layout optimizations is generally specific to a particular application. Since the Xevolver framework enables users to define their own code transformations, a custom code transformation can be defined so that a specific data representation in an existing code can mechanically and consistently be translated to another one. Our evaluation results clearly demonstrate that such a code transformation is effective to improve memory access efficiency and hence the performance of an HPC application without overcomplicating the code.
Published: 2015
Full Text: View/download PDF

21. A Verification Framework for Streamlining Empirical Auto-Tuning

Author: Hiroyuki Takizawa, Shoichi Hirasawa, and Hiroaki Kobayashi
Subjects: Auto tuning, Correctness, Computer science, Kernel (statistics), Computation, Process (computing), Code (cryptography), Parallel computing, Programmer, Field (computer science)
Abstract: Empirical auto-tuning is getting attention in the field of high-performance computing (HPC) because it effectively reduces programmers' burden to improve the execution performance of an application. In the tuning process, a programmer selects a high-performance kernel variant of the application by evaluating the performances of multiple kernel variants. Since HPC applications need quite a huge number of floating-point operations, not all kernel variants produce exactly the same computation result as the original code. Although it is possible to verify the correctness of each kernel variant by executing the whole application to the end, it takes a long time to verify the final computation results of all kernel variants especially in the cases of long-running applications. Therefore, this paper proposes a framework that reduces the time for verifying the computation result on tuning a large-scale application. The framework uses user-specified information of the final computation result of the application to verify the correctness of every kernel variant. The framework also automatically skips unnecessary verifications to reduce the overall verification time. As a result, the framework streamlines empirical auto-tuning.
Published: 2015
Full Text: View/download PDF

22. Performance Evaluation of Compiler-Assisted OpenMP Codes on Various HPC Systems

Author: Kazuhiko Komatsu, Hiroaki Kobayashi, Hiroyuki Takizawa, and Ryusuke Egawa
Subjects: Structure (mathematical logic), Automatic parallelization, Data dependency, Computer science, Key (cryptography), Code (cryptography), Compiler, Parallel computing, Software_PROGRAMMINGLANGUAGES, Serial code, computer.software_genre, computer
Abstract: As automatic parallelization functions are different among compilers, a serial code is often modified so that a particular target compiler can easily understand its code structure and data dependency, resulting in effective automatic optimizations. However, these code modifications might not be effective for a different compiler because the different compiler cannot always parallelize the modified code. In this paper, in order to achieve effective parallelization on various HPC systems, compiler messages obtained from various compilers on different HPC systems are utilized for the OpenMP parallelization. Because the message about one system may be useful to identify key loop nests even for other systems, performance portable OpenMP parallelization can be achieved. This paper evaluates the performance of the compiler-assisted OpenMP codes using compiler messages from various compilers. The evaluation results clarified that, when a code is modified for its target compiler, the compiler message given by the target compiler is the most helpful to achieve appropriate OpenMP parallelization.
Published: 2015
Full Text: View/download PDF

23. Hierarchical parallel processing of large scale data clustering on a PC cluster with GPU co-processing

Author: Hiroyuki Takizawa and Hiroaki Kobayashi
Subjects: Computer science, Data parallelism, Nearest neighbor search, Message passing, Graphics processing unit, Parallel computing, Supercomputer, Theoretical Computer Science, Parallel processing (DSP implementation), Hardware and Architecture, Cluster analysis, Massively parallel, Software, Information Systems
Abstract: This paper presents an effective scheme for clustering a huge data set using a PC cluster system, in which each PC is equipped with a commodity programmable graphics processing unit (GPU). The proposed scheme is devised to achieve three-level hierarchical parallel processing of massive data clustering. The divide-and-conquer approach to parallel data clustering is employed to perform the coarse-grain parallel processing by multiple PCs with a message passing mechanism. By taking advantage of the GPU's parallel processing capability, moreover, the proposed scheme can exploit two types of the fine-grain data parallelism at the different levels in the nearest neighbor search, which is the most computationally-intensive part of the data-clustering process. The performance of our scheme is discussed in comparison with that of the implementation entirely running on CPU. Experimental results clearly show that the proposed hierarchial parallel processing can remarkably accelerate the data clustering task. Especially, GPU co-processing is quite effective to improve the computational efficiency of parallel data clustering on a PC cluster. Although data-transfer from GPU to CPU is generally costly, acceleration by GPU co-processing is significant to save the total execution time of data-clustering.
Published: 2006
Full Text: View/download PDF

24. Efficient parallel processing of competitive learning algorithms

Author: Hiroyuki Takizawa, Shintaro Momose, Hiroaki Kobayashi, Tadao Nakamura, and Kentaro Sano
Subjects: Self-organizing map, Analysis of parallel algorithms, Speedup, Computer Networks and Communications, Computer science, Competitive learning, Quantization (signal processing), Vector quantization, Parallel algorithm, Codebook, Parallel computing, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, Artificial Intelligence, Hardware and Architecture, Scalability, Algorithm, Software
Abstract: Vector quantization (VQ) is an attractive technique for lossy data compression, which has been a key technology for data storage and/or transfer. So far, various competitive learning (CL) algorithms have been proposed to design optimal codebooks presenting quantization with minimized errors. Although algorithmic improvements of these CL algorithms have achieved faster codebook design than conventional ones, limitations of speedup still exist when large data sets are processed on a single processor. Considering a variety of CL algorithms, parallel processing on flexible computing environment, like general-purpose parallel computers is in demand for a large-scale codebook design. This paper presents a formulation for efficiently parallelizing CL algorithms, suitable for distributed-memory parallel computers with a message-passing mechanism. Based on this formulation, we parallelize three CL algorithms: the Kohonen learning algorithm, the MMPDCL algorithm and the LOJ algorithm. Experimental results indicate a high scalability of the parallel algorithms on three different types of commercially available parallel computers: IBM SP2, NEC AzusA and PC cluster.
Published: 2004
Full Text: View/download PDF

25. Performance Evaluation of an OpenMP Parallelization by Using Automatic Parallelization Information

Author: Ryusuke Egawa, Kazuhiko Komatsu, Hiroaki Kobayashi, and Hiroyuki Takizawa
Subjects: Automatic parallelization, Software portability, Exploit, Computer science, Key (cryptography), Code (cryptography), Compiler, Parallel computing, Serial code, computer.software_genre, Programmer, computer
Abstract: To exploit the potential of many core processors, a serial code is generally optimized for a particular compiler called a target compiler, so that the compiler can understand the code structure for automatic parallelization. However, the performance of such a serial code is always not portable to a new system that uses a different compiler. To improve the performance portability, this paper proposes an OpenMP parallelization method by using compiler messages of the target compiler. Since the compiler messages from the target compiler are also useful to identify key loop nests even for the different system, a programmer can use the message to easily parallelize a serial code with low programming effort. Furthermore, programmer’s intention of the optimization can be migrated to other systems through the OpenMP parallelization, which results in high performance portability. The experimental results indicate that the OpenMP codes parallelized by the proposed method can achieve a comparable or even better performance than the automatically parallelized codes by various compilers.
Published: 2014
Full Text: View/download PDF

26. An energy optimization method for vector processing mechanisms

Author: Hiroaki Kobayashi, Ye Gao, Hiroyuki Takizawa, Masayuki Sato, and Ryusuke Egawa
Subjects: Pipeline transport, Low energy, Power consumption, Computer science, Benchmark (computing), Cache, Parallel computing, Energy minimization, Energy (signal processing), Vector processor
Abstract: In order to achieve a low energy execution for any multimedia applications (MMAs) on a vector processing mechanism (VPM), the number of parallel arithmetic pipelines and the number of cache ports of VPM must be properly configured for each MMAs. Therefore, this paper proposes an energy optimization method for VPMs (EOM-VP), which finds the lowest energy configuration by using the greedy searching method and an analytical model. As the evaluation results suggest, EOM-VP could find the lowest or the second lowest energy configuration for all the benchmark programs in the evaluation.
Published: 2014
Full Text: View/download PDF

27. A Compiler-Assisted OpenMP Migration Method Based on Automatic Parallelizing Information

Author: Hiroaki Kobayashi, Hiroyuki Takizawa, Kazuhiko Komatsu, and Ryusuke Egawa
Subjects: Functional compiler, Automatic parallelization, Computer science, Intrinsic function, Programming language, Interprocedural optimization, Parallel computing, Compiler, Serial code, computer.software_genre, computer, Dead code elimination, Compiler correctness
Abstract: Performance of a serial code often relies on compilers' capabilities for automatic parallelization. In such a case, the performance is not portable to a new system because a new compiler on the new system may be unable to effectively parallelize the ode originally developed assuming a particular target compiler. As the compiler messages from the target compiler are still useful to identify key kernels that should be optimized even for the different system, this paper proposes a method to migrate a serial code to the OpenMP programming model by using such compiler messages. The aim of the proposed method is to improve the performance portability across different systems and compilers. Experimental results indicate that the migrated OpenMP code can achieve a comparable or even better performance than the original code with automatic parallelization.
Published: 2014
Full Text: View/download PDF

28. A Comparison of Performance Tunabilities between OpenCL and OpenACC

Author: Hiroaki Kobayashi, Kazuhiko Komatsu, Hiroyuki Takizawa, Shoichi Hirasawa, and Makoto Sugawara
Subjects: Instruction set, Auto tuning, CUDA, Computer architecture, Computer science, Compiler directive, Synchronization (computer science), Programming paradigm, Software maintenance, Parallel computing, ComputerSystemsOrganization_PROCESSORARCHITECTURES, Software_PROGRAMMINGTECHNIQUES, General-purpose computing on graphics processing units
Abstract: To design and develop any auto tuning mechanisms for OpenACC, it is important to clarify the differences between conventional GPU programming models and OpenACC in terms of available programming and tuning techniques, called performance tunabilities. This paper hence discusses the performance tunabilities of OpenACC and OpenCL. As OpenACC cannot synchronize threads running on GPUs, some important techniques are not available to OpenACC. Therefore, we also design an additional compiler directive for thread synchronization. Evaluation results show that both OpenCL and OpenACC need architecture-aware optimizations, and similar approaches to performance optimization are effective for both OpenCL and OpenACC. The additional directive can allow OpenACC to describe more tuning techniques in the same approach as OpenCL. As it is obvious that OpenACC is more productive than OpenCL especially for legacy application migration, OpenACC is a very promising programming model if it can achieve the same performance as the conventional GPU programming models such as CUDA and OpenCL.
Published: 2013
Full Text: View/download PDF

29. A flexible insertion policy for dynamic cache resizing mechanisms

Author: Masayuki Sato, Hiroyuki Takizawa, Hiroaki Kobayashi, Y. Tobo, and Ryusuke Egawa
Subjects: Hardware_MEMORYSTRUCTURES, Cache coloring, business.industry, Computer science, Parallel computing, Cache pollution, Cache-oblivious algorithm, Smart Cache, Cache invalidation, Embedded system, Page cache, Cache, business, Cache algorithms
Abstract: This paper proposes a novel cache replacement policy named a flexible insertion policy (FLEXII) for dynamic cache resizing mechanisms. FLEXII can reduce the number of dead-on-fill blocks, which are never reused in a cache memory, and help the mechanisms further reduce the energy consumption. The experimental results indicate that FLEXII can reduce the energy consumption of the cache memory by up to 68%, and 27% on average without significant performance degradation.
Published: 2013
Full Text: View/download PDF

30. ClMPI: An opencl extension for interoperation with the message passing interface

Author: Isaac Gelado, Makoto Sugawara, Shoichi Hirasawa, Wen-mei W. Hwu, Hiroaki Kobayashi, and Hiroyuki Takizawa
Subjects: 020203 distributed computing, Computer science, 020209 energy, Serialization, Message passing, Message Passing Interface, 02 engineering and technology, Parallel computing, ComputerSystemsOrganization_PROCESSORARCHITECTURES, Software_PROGRAMMINGTECHNIQUES, computer.software_genre, Instruction set, Software portability, Interoperation, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Operating system, Programmer, computer
Abstract: This paper proposes an OpenCL extension, clMPI, that allows a programmer to think as if GPUs communicate without any help of CPUs. The clMPI extension offers some OpenCL commands of inter-node data transfers that are executed in the same manner as the other OpenCL commands. Thus, clMPI naturally extends the conventional OpenCL programming model so as to improve the MPI interoperability. Unlike conventional joint programming of MPI and OpenCL, CPUs do not need to be blocked to serialize dependent operations of MPI and OpenCL. Hence, an application can easily use the opportunities to overlap parallel activities of CPUs and GPUs. In addition, the implementation details of data transfers are hidden behind the extension, and application programmers can use the optimized data transfers without any tricky programming techniques. As a result, the extension can improve not only the performance but also the performance portability across different system configurations. The evaluation results show that the clMPI extension can use the optimized data transfer implementation and thereby increase the sustained performance by about 14% for the Himeno benchmark if the communication time cannot be overlapped with the computation time.
Published: 2013
Full Text: View/download PDF

31. Analysing the Performance Improvements of Optimizations on Modern HPC Systems

Author: Kazuhiko Komatsu, Toshihide Sasaki, Ryusuke Egawa, Hiroyuki Takizawa, and Hiroaki Kobayashi
Subjects: Scalar processor, ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION, Exploit, Hardware_GENERAL, Computer science, Loop unswitching, Code (cryptography), Optimization methods, Parallel computing, Software_PROGRAMMINGTECHNIQUES, Supercomputer
Abstract: Recently, there are many types of supercomputing systems being equipped with vector processors, scalar processors, and accelerators as processing elements of the systems. Although all kinds of calculations cannot effectively be performed on one HPC system, a part of calculations can exploit the potential of a HPC system by considering both the calculations and the system. These tendencies that each HPC system is designed and suitable for specific fields of calculations continue in order to achieve higher performance for target HPC codes. Therefore, even though the same HPC code is executed on multiple HPC systems, the sustained performances on HPC systems are different. As characteristics of a HPC code mainly depend on optimization methods, clarifying the performances by the optimization methods on multiple HPC systems becomes important for developing performance-portable HPC codes, which can exploit the potential of every HPC system. By considering both the optimization methods and the HPC systems, this paper clarifies the performances of the optimization methods on multiple HPC systems.
Published: 2013
Full Text: View/download PDF

32. Improving the scalability of transparent checkpointing for GPU computing systems

Author: Hiroyuki Takizawa, Alfian Amrizal, Shoichi Hirasawa, Hiroaki Kobayashi, and Kazuhiko Komatsu
Subjects: Computer science, Distributed computing, Node (networking), Scalability, Data_FILES, Process (computing), Parallel computing, General-purpose computing on graphics processing units, Global file system
Abstract: As the number of nodes in a GPU computing system increases, checkpointing to a global file system becomes more time-consuming due to the I/O bottlenecks and network congestion. To solve this problem, in this paper, we propose a transparent and scalable checkpoint/restart mechanism for OpenCL applications, named Two-level CheCL. As its name implies, Two-level CheCL consists of two different checkpoint implementations, Local CheCL and Global CheCL. Local CheCL avoids checkpointing to the global file system by utilizing node's local storage. Our experimental results show that Local CheCL can accelerate the checkpointing process by up to four times faster than a conventional checkpointing mechanism. We also implement Global CheCL, which utilizes a global file system, to make sure that we always have a global checkpoint file even in the case of a catastrophic failure. We discuss the performance of our proposed mechanism through an analysis with a two-level checkpoint model.
Published: 2012
Full Text: View/download PDF

33. Performance Evaluation of a Next-Generation CFD on Various Supercomputing Systems

Author: Ryusuke Egawa, Kazuhiko Komatsu, Hiroaki Kobayashi, Hiroyuki Takizawa, and Takashi Soga
Subjects: ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION, Computer science, business.industry, Computation, Parallel computing, GPU cluster, Computational fluid dynamics, Supercomputer, law.invention, law, Polygon mesh, Cartesian coordinate system, business, Implementation
Abstract: The Building-Cube Method (BCM) has been proposed as a new CFD method for an efficient three-dimensional flow simulation on large-scale supercomputing systems, and is based on equally-spaced Cartesian meshes. As a flow domain can be divided into equally-partitioned cells due to the equally-spaced meshes, the flow computations can be divided to partial computations of the same computational cost. To achieve a high sustained performance, architecture-aware implementations and optimizations considering characteristics of supercomputing systems are essential because there have been various types of supercomputing systems such as a scalar type, a vector type, and an accelerator type. This paper discusses the architecture-aware implementations and optimizations for various supercomputing systems such as an Intel Nehalem-EP cluster, an Intel Nehalem-EX cluster, Fujitsu FX-1, Hitachi SR16000 M1, NEC SX-9, and a GPU cluster, and analyses their sustained performance for BCM. The performance analysis shows that memory and network capabilities largely affect the performance of BCM rather than computational potentials.
Published: 2012
Full Text: View/download PDF

34. Performance and Scalability Analysis of a Chip Multi Vector Processor

Author: Koki Okabe, Ryusuke Egawa, Hiroaki Kobayashi, Yoshiei Sato, Akihiro Musa, and Hiroyuki Takizawa
Subjects: Computer science, Scalability, Performance tuning, Memory bandwidth, Cache, Parallel computing, Chip, Bottleneck, Vector processor, Euclidean vector
Abstract: To realize more efficient and powerful computations on a vector processor, a chip multi vector processor (CMVP) has been proposed as a next generation vector processor. However, the usefulness of CMVP for scientific applications has been unclear. The objective of this paper is to clarify the potential of CMVP. Although the computational performance of CMVP increases with the number of cores, the ratio of memory bandwidth to computational performance (B/F) will decrease. To cover the insufficient B/F, CMVP has a shared vector cache. Therefore, to exploit the potential of CMVP, applications for CMVP should be optimized not only with conventional tuning techniques to improve the efficiency of vector operations, but also with new techniques to effectively use the vector cache. Under this situation, this paper presents a performance tuning strategy for CMVP. The strategy analyzes the performance bottleneck of an application to find the best combination of tuning techniques. The performance and scalability improvements due to the tuning strategy are evaluated using real applications. The evaluation results clarify that performance tuning becomes more important as the number of cores increases.
Published: 2011
Full Text: View/download PDF

35. Power-Aware Dynamic Cache Partitioning for CMPs

Author: Kenta Abe, Hiroaki Kobayashi, Ryusuke Egawa, Isao Kotera, and Hiroyuki Takizawa
Subjects: Smart Cache, Hardware_MEMORYSTRUCTURES, Shared memory, Computer science, Cache coloring, Pipeline burst cache, Cache, Parallel computing, Cache pollution, Cache-oblivious algorithm, Cache algorithms
Abstract: Cache partitioning and power-gating schemes are major research topics to achieve a high-performance and low-power shared cache for next generation chip multiprocessors(CMPs). We propose a power-aware cache partitioning mechanism, which is a scheme to realize both low power and high performance using power-gating and cache partitioning at the same time. The proposed cache mechanism is composed of a way-allocation function and power control function; each function works based on the cache locality assessment. The performance evaluation results show that the proposed cache mechanism with a performance-oriented parameter setting can reduce energy consumption by 20% while keeping the performance, and the mechanism with an energy-oriented parameter setting can reduce 54% energy consumption with a performance degradation of 13%. The hardware implementation results indicate that the delay and area overheads to control the proposed mechanism are negligible, and therefore hardly affect both the entire chip design and performance.
Published: 2011
Full Text: View/download PDF

36. Cache partitioning strategies for 3-D stacked vector processors

Author: Hiroaki Kobayashi, Yusuke Funaya, Ryusuke Egawa, and Hiroyuki Takizawa
Subjects: Smart Cache, Hardware_MEMORYSTRUCTURES, Cache coloring, Cache invalidation, Computer science, Bus sniffing, Page cache, Parallel computing, Cache, Cache pollution, Cache algorithms
Abstract: An on-chip cache memory for vector processors, named vector cache, has been proposed to realize a high sustained memory bandwidth, which is balanced with the high computational performance of future vector processors. In our previous research, from the viewpoint of architectural design, it is clearly shown that the 3D die-stacking technology can increase the capacity of the vector cache and thereby the performance of vector processors. However, detailed design of vector caches with the 3D die-stacking technology has not been discussed well yet. Therefore it is still unclear how much the vector cache can benefit from the 3D die-stacking technology in terms of cost, latency, and energy consumption. In this paper, the vector caches are designed in detail so as to exploit the advantages of the 3D die-stacking technologies, such as reductions in long wires and energy consumption. In the cache design, this paper examines two strategies to partition the vector caches into some blocks and to place them onto multiple layers. One cache partitioning strategy places more emphasis on the reduction in the number of long wires. The other strategy reduces the number of through-silicon vias (TSVs). This paper evaluates latency, energy consumption, and the number of TSVs used in each cache partitioning strategy. This paper also discusses the 3D cache configuration suitable for vector processors.
Published: 2010
Full Text: View/download PDF

37. Automatic Tuning of CUDA Execution Parameters for Stencil Processing

Author: Hiroaki Kobayashi, Kazuhiko Komatsu, Hiroyuki Takizawa, and Katsuto Sato
Subjects: Profiling (computer programming), CUDA, Kernel (image processing), Computer science, Parallel computing, Thread (computing), Performance improvement, Graphics, Programmer, Stencil
Abstract: Recently, Compute Unified Device Architecture (CUDA) has enabled Graphics Processing Units (GPUs) to accelerate various applications. However, to exploit the GPU’s computing power fully, a programmer has to carefully adjust some CUDA execution parameters even for simple stencil processing kernels. Hence, this paper develops an automatic parameter tuning mechanism based on profiling to predict the optimal execution parameters. This paper first discusses the scope of the parameter exploration space determined by GPU’s architectural restrictions. To find the optimal execution parameters, performance models are created by profiling execution times of kernel using each promising parameter configuration. The execution parameters are determined by using those performance models. This paper evaluates the performance improvement due to the proposed mechanism using two benchmark programs. From the evaluation results, it is clarified that the proposed mechanism can appropriately select a suboptimal Cooperative Thread Array (CTA) configuration whose performance is comparable to the optimal one.
Published: 2010
Full Text: View/download PDF

38. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

Author: Katsuto Sato, Hiroyuki Takizawa, Hiroaki Kobayashi, and Kazuhiko Komatsu
Subjects: Coprocessor, Computer science, GPU cluster, Parallel computing, Software_PROGRAMMINGTECHNIQUES, computer.software_genre, Scheduling (computing), Computer graphics, CUDA, CUDA Pinned memory, High-level programming language, Operating system, Dependability, computer, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.
Published: 2009
Full Text: View/download PDF

39. Performance evaluation of NEC SX-9 using real science and engineering applications

Author: Ryusuke Egawa, Ken'ichi Itakura, Hiroaki Kobayashi, Akihiro Musa, Youichi Shimomura, Takashi Soga, Koki Okabe, and Hiroyuki Takizawa
Subjects: HPC Challenge Benchmark, Scalar processor, Computer science, business.industry, Scalability, Performance tuning, Memory bandwidth, Parallel computing, Cache, Supercomputer, business, Computer hardware, Vector processor
Abstract: This paper describes a new-generation vector parallel supercomputer, NEC SX-9 system. The SX-9 processor has an outstanding core to achieve over 100Gflop/s, and a software-controllable on-chip cache to keep the high ratio of the memory bandwidth to the floating-point operation rate. Moreover, its large SMP nodes of 16 vector processors with 1.6Tflop/s performance and 1TB memory are connected with dedicated network switches, which can achieve inter-node communication at 128GB/s per direction. The sustained performance of the SX-9 processor is evaluated using six practical applications in comparison with conventional vector processors and the latest scalar processor such as Nehalem-EP. Based on the results, this paper discusses the performance tuning strategies for new-generation vector systems. An SX-9 system of 16 nodes is also evaluated by using the HPC challenge benchmark suite and a CFD code. Those evaluation results clarify the highest sustained performance and scalability of the SX-9 system.
Published: 2009
Full Text: View/download PDF

40. Performance tuning and analysis of future vector processors based on the roofline model

Author: Ryusuke Egawa, Koki Okabe, Ryuichi Nagaoka, Hiroaki Kobayashi, Hiroyuki Takizawa, Akihiro Musa, and Yoshiei Sato
Subjects: Hardware_MEMORYSTRUCTURES, Cache coloring, Computer science, Scalar (mathematics), Performance tuning, Memory bandwidth, Cache, Parallel computing, Cache algorithms, Vector processor, Efficient energy use
Abstract: Because of a recent steep drop in the ratio of memory bandwidth to computational performance (B/F) of vector processors, their advantage against scalar ones regarding relatively high sustained performance is decaying. To cover the insufficient B/F rate, an on-chip vector cache mechanism is promising for the vector processors. Although the effectiveness of the vector cache has been evaluated, cache-conscious tuning of vector codes and the analysis of the obtained performance have not been discussed yet. Under this situation, the purpose of this paper is to establish a strategy for performance tuning of a vector processor with a cache to exploit its potential. To analyze its sustained performance, this paper uses the roofline model. Several optimization techniques are applied to real scientific and engineering applications, and their effects are assessed with the model. We confirm that the model can guide users to effective tuning so as to maximize its gain. We also discuss the energy efficiency of the on-chip vector cache.
Published: 2009
Full Text: View/download PDF

41. 3D on-chip memory for the vector architecture

Author: Yusuke Funaya, Ryusuke Egawa, Hiroaki Kobayashi, and Hiroyuki Takizawa
Subjects: business.industry, Computer science, Interleaved memory, Uniform memory access, Registered memory, Memory bandwidth, Semiconductor memory, Distributed memory, Computing with Memory, Parallel computing, business, Computer hardware, Extended memory
Abstract: Vector supercomputers play an important roll in a high performance computing area because vector systems can achieve a high computational efficiency for large scale scientific applications. The most important factor of a vector supercomputer is its high memory bandwidth between the processor and the off-chip main memory. However, it is inevitable to decrease the ratio of memory bandwidth to floating-point operation rate due to several hardware limitations, which prevent future vector processors from obtaining the higher sustained performance and lower energy consumption. Recently, three-dimensional (3D) die stacking technology has attracted much attention to be able to relax several limitations of conventional processor design. Hence, this paper explores the design space of vector processors with a large on-chip memory by using the 3D die stacking technology. A processor design proposed in this paper achieves a 32 MB on-chip memory by stacking four memory layers onto a vector processor layer using the 3D die stacking technology. In addition, an optimal 3D on-chip memory configuration is discussed in this paper. The on-chip memory can reduce the number of off-chip main memory accesses, resulting in higher performance and lower energy consumption of a memory system. Simulation results show that the proposed vector processor can achieve a 55% higher performance and 40% lower energy consumption than a conventional vector processor.
Published: 2009
Full Text: View/download PDF

42. Effects of MSHR and Prefetch Mechanisms on an On-Chip Cache of the Vector Architecture

Author: Ryusuke Egawa, Yoshiei Sato, Takashi Soga, Hiroyuki Takizawa, Akihiro Musa, Koki Okabe, and Hiroaki Kobayashi
Subjects: Instruction prefetch, Hardware_MEMORYSTRUCTURES, Software, business.industry, Computer science, Bandwidth (computing), Memory bandwidth, Parallel computing, Cache, Architecture, FLOPS, business, Supercomputer
Abstract: Vector supercomputers have been encountering the memory wall problem and their memory bandwidth per flop/s rate has decreased. To cover the insufficient memory bandwidth per flop/s rate, an on-chip vector cache has been proposed for the vector processors. Although vector caching is effective to increase the sustained performance to a certain degree, it still needs software and hardware supporting mechanisms to extract its potential. To this end, we propose miss status handling registers (MSHR) and a prefetch mechanism. This paper evaluates the performance of the vector cache with the MSHR and the prefetch mechanism on the vector supercomputer across three leading scientific applications. The MSHR is an effective mechanism for handling subsequent vector loads of the same data, which frequently appear in different schemes. The experimental results indicate that the MSHR can improve the computational performance of scientific applications by 1.45×. Moreover, we examine the performance of the prefetch mechanism on the vector cache. The prefetch mechanism increases the computational performance by 1.6×. Accordingly, the MSHR and the prefetching mechanism are very effective optimization options for vector caching of future vector supercomputers even if the vector supercomputers cannot maintain the current memory bandwidth per flop/s rate.
Published: 2008
Full Text: View/download PDF

43. Modeling of cache access behavior based on Zipf's law

Author: Isao Kotera, Ryusuke Egawa, Hiroyuki Takizawa, and Hiroaki Kobayashi
Subjects: Smart Cache, Hardware_MEMORYSTRUCTURES, Computer science, Cache invalidation, Cache coloring, Page cache, Cache, Parallel computing, Cache-oblivious algorithm, Cache pollution, Cache algorithms
Abstract: Recently, chip multiprocessors (CMPs) that can simultaneously execute multiple workloads using multiple cores have become a key to achieve high-performance processing. To improve CMP performance, various shared resource management mechanisms have been proposed. In particular, cache partitioning is significantly effective to avoid resource conflicts at a shared cache memory. As most cache partitioning methods need to predict the changes in cache access characteristics of each workload when the cache partition moves, it is important for cache partitioning to establish an accurate prediction model.In this paper, we first analyze the cache access locality of various applications using stack distance profiling. We figure out that stack distance distributions incline to obey socalled Zipf's law. To achieve effective cache partitioning, then, we propose a model based on Zipf's law that predicts the changes in the stack distance distributions. Using the model, we also show the validity of a measure, which has been proposed in our previous work to quantify how much a workload demands the cache capacity.
Published: 2008
Full Text: View/download PDF

44. First Experiences with NEC SX-9

Author: Koki Okabe, Hiroyuki Takizawa, Ryusuke Egawa, Hiroaki Kobayashi, Akihiko Musa, Yoichi Shimomura, and Takashi Soga
Subjects: Speedup, Computer science, Cache, Parallel computing, Supercomputer
Abstract: This paper presents the new supercomputer system NEC SX-9 that has been installed at Tohoku University in March 2008. The performance of the system is evaluated by using six real application codes. The experimental results indicate that the SX-9 system achieves a speedup of up to 7 compared to our previous NEC SX-7 system for the single-CPU sustained performance. In addition, the paper examines the effects of an on-chip vector cache named ADB on the performance, and confirms performance increases between 20 and 70% by selective caching on ADB.
Published: 2008
Full Text: View/download PDF

45. An on-chip cache design for vector processors

Author: Hiroaki Kobayashi, Akihiro Musa, Ryusuke Egawa, Koki Okabe, Hiroyuki Takizawa, and Yoshiei Sato
Subjects: Hardware_MEMORYSTRUCTURES, Computer science, Cache coloring, business.industry, Cache-only memory architecture, Parallel computing, Cache pollution, Non-uniform memory access, Smart Cache, Page cache, Cache, business, Cache algorithms, Computer hardware
Abstract: This paper discusses the potential of an on-chip cache memory for modern vector supercomputers. The vector supercomputers can achieve the high computational efficiency for compute-intensive scientific applications. The most important factor affecting the computational performance is high memory bandwidth to provide a sufficient amount of data to the rich arithmetic units in time; the modern vector supercomputers such as NEC SX-7 and SX-8 have 4 bytes per flop (4B/FLOP) on the ratio of memory bandwidth to floating-point operations. However, the gap in performance between memory and processors has become remarkably exposed year by year in high performance computing. Therefore, it is getting harder to keep the 4B/FLOP memory bandwidth in design of future vector supercomputers. As a promising solution to cover a lack of the memory bandwidths of vector load/store units of the future vector supercomputers, we design an on-chip vector cache for the NEC SX vector processor architecture. This paper evaluates the performance of the on-chip cache memory system on the SX-7 system with 2B/FLOP or lower memory bandwidth across two kernel loops and five leading scientific applications. The results of the kernel loops demonstrate that a 2B/FLOP memory system with the on-chip cache whose hit ratio is 50% can achieve a performance comparable to that of a 4B/FLOP system without the cache. The results of the four applications indicate that the on-chip cache can improve sustained performance of the four applications by 20% to 98%. The experimental results regarding the last one show a conflicting effect of loop unrolling with vector caching, resulting in a poor hit rate. However, when loop-unrolling is disabled, its cache hit rate is improved, and the sustained performance comparable to that of the 4B/FLOP memory bandwidth without the loop-unrolling is obtained. In addition, selective caching, in which only a part of data with the high locality of reference are cached, is also effective for efficient use of the limited cache capacity.
Published: 2007
Full Text: View/download PDF

46. Implications of Memory Performance for Highly Efficient Supercomputing of Scientific Applications

Author: Akihiro Musa, Takashi Soga, Hiroaki Kobayashi, Koki Okabe, and Hiroyuki Takizawa
Subjects: Floating point, Memory bank, CPU cache, Computer science, Parallel algorithm, Memory bandwidth, Computing with Memory, Parallel computing, FLOPS, Supercomputer
Abstract: This paper examines the memory performance of the vector-parallel and scalar-parallel computing platforms across five applications of three scientific areas; electromagnetic analysis, CFD/heat analysis, and seismology. Our evaluation results show that the vector platforms can achieve the high computational efficiency and hence significantly outperform the scalar platforms in the areas of these applications. We did exhaustive experiments and quantitatively evaluated representative scalar and vector platforms using real applications from the viewpoint of the system designers and developers. These results demonstrate that the ratio of memory bandwidth to floating-point operation rate needs to reach 4-bytes/flop to preserve the computational performance with hiding the memory access latencies by pipelined vector operations in the vector platforms. We also confirm that the enough number of memory banks to handle stride memory accesses leads to an increase in the execution efficiency. On the scalar platforms, the cache hit rate needs to be almost 100% to achieve the high computational efficiency.
Published: 2006
Full Text: View/download PDF

47. A stream Programming Language for GPU Computing

Author: Hiroyuki Takizawa
Subjects: Stream processing, Programming language, Computer science, Stream programming, Fourth-generation programming language, Parallel computing, General-purpose computing on graphics processing units, computer.software_genre, Programming language implementation, computer
Published: 2008
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

47 results on '"Hiroyuki Takizawa"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources