Author: "Mitsuhisa Sato" / Publisher: ieee - Searchworks@Jio Institute Digital Library Search Results

1. Design and Performance Evaluation of UCX for Tofu-D Interconnect with OpenSHMEM-UCX on Fugaku

Author: Yutaka Watanabe, Mitsuhisa Sato, Miwako Tsuji, Hitoshi Murai, and Taisuke Boku
Published: 2022

2. Performance analysis of a state vector quantum circuit simulation on A64FX processor

Author: Miwako Tsuji and Mitsuhisa Sato
Published: 2022

3. The Supercomputer 'Fugaku'

Author: Mitsuhisa Sato
Published: 2022

4. Performance Evaluation and Analysis of A64FX many-core Processor for the Fiber Miniapp Suite

Author: Mitsuhisa Sato and Miwako Tsuji
Subjects: Instruction set, Computer science, Computer cluster, Suite, Fiber (computer science), Process (computing), Instruction scheduling, Parallel computing, Thread (computing), ComputerSystemsOrganization_PROCESSORARCHITECTURES, Supercomputer
Abstract: In recent years, there has been growing interest in Arm-based processors for high performance computing systems such as supercomputer Fugaku using A64FX Arm-based processor. We have evaluated the performance of A64FX processor using Fiber Miniapp suite and have investigated various numbers of MPI processes, OpenMP threads as well as different methods to assign MPI processes and OpenMP threads. In addition to the performance evaluation, the performance comparison with other processors and some performance analysis are shown. Our experiments suggest that while shorter OpenMP thread strides perform better in most mini applications, MPI process allocation methods have not had a large impact on the performance. For some applications of “as-is” with small data set, A64FX shows poor performance, but it can be improved by enhancing the SIMD vectorization and changing instruction scheduling during the compilation. The performance of the A64FX is better or comparable with other processors for other applications and data sets.
Published: 2021

5. Evaluation of SPEC CPU and SPEC OMP on the A64FX

Author: Masaaki Kondo, Mitsuhisa Sato, and Yuetsu Kodama
Subjects: Instruction set, Hardware_MEMORYSTRUCTURES, Xeon, Power demand, Computer science, Benchmark (computing), Bandwidth (computing), Spec#, Parallel computing, Supercomputer, computer, computer.programming_language, Power control
Abstract: We evaluated the A64FX processor used in the supercomputer Fugaku using the SPEC CPU and SPEC OMP benchmark suites. As a result, we found the performance of the A64FX processor, which had 48 cores, was lower than that of the Xeon with dual sockets of 24 cores each in SPEC CPU int and fp. In SPEC OMP, due to the effect of the Xeon’s Hyperthread, the A64FX performance was lower than the performance of the Xeon with single socket of 28 cores. But in several benchmarks in SPEC CPU fp and SPEC OMP, the A64FX performance was higher due to its high memory bandwidth. In addition, by comparing the performance and power using the power control mechanism of the A64FX, it was confirmed that power can be reduced without affecting the performance when not using all cores.
Published: 2021

6. Power/Performance/Area Evaluations for Next-Generation HPC Processors using the A64FX Chip

Author: Eishi Arima, Tetsuya Odajima, Miwako Tsuji, Mitsuhisa Sato, and Yuetsu Kodama
Subjects: Pipeline transport, Computer science, business.industry, Pipeline (computing), Embedded system, Node (circuits), SIMD, Performance improvement, Chip, FLOPS, business, Bottleneck
Abstract: Future HPC systems, including post-exascale supercomputers, will face severe problems such as the slowing-down of Moore's law and the limitation of power supply. To achieve desired system performance improvement while counteracting these issues, the hardware design optimization is a key factor. In this paper, we investigate the future directions of SIMD-based processor architectures by using the A64FX chip and a customized version of power/performance/area simulators, i.e., Gem5 and McPAT. More specifically, based on the A64FX chip, we firstly customize various energy parameters in the simulators, and then evaluate the power and area reductions by scaling the technology node down to 3nm. Moreover, we investigate also the achievable FLOPS improvement at 3nm by scaling the number of cores, SIMD width, and FP pipeline width under power/area constraints. The evaluation result indicates that no further SIMD/pipeline width scaling will help with improving FLOPS due to the memory system bottleneck, especially on L1 data caches and FP register files. Based on the observation, we discuss the future directions of SIMD-based HPC processors.
Published: 2021

7. Co-Design for A64FX Manycore Processor and 'Fugaku'

Author: Atsushi Furuya, Ikuo Miyoshi, Yuetsu Kodama, Naoyuki Shida, Hirofumi Tomita, Kouichi Hirai, Akira Asato, Mitsuhisa Sato, Tetsuya Odajima, Toshiyuki Shimizu, Yutaka Ishikawa, Masaki Aoki, Hisashi Yashiro, Miwako Tsuji, and Kuniki Morita
Subjects: Instruction set, Set (abstract data type), Manycore processor, Computer architecture, Computer science, Scalability, Key (cryptography), Supercomputer, Exascale computing
Abstract: We have been carrying out the FLAGSHIP 2020 Project to develop the Japanese next-generation flagship supercomputer, the Post-K, recently named “Fugaku”. We have designed an original many core processor based on Armv8 instruction sets with the Scalable Vector Extension (SVE), an A64FX processor, as well as a system including interconnect and a storage subsystem with the industry partner, Fujitsu. The “co-design” of the system and applications is a key to making it power efficient and high performance. We determined many architectural parameters by reflecting an analysis of a set of target applications provided by applications teams. In this paper, we present the pragmatic practice of our co-design effort for “Fugaku”. As a result, the system has been proven to be a very power-efficient system, and it is confirmed that the performance of some target applications using the whole system is more than 100 times the performance of the K computer.
Published: 2020

8. Preliminary Performance Evaluation of the Fujitsu A64FX Using HPC Applications

Author: Mitsuhisa Sato, Tetsuya Odajima, Miwako Tsuji, Yutaka Maruyama, Motohiko Matsuda, and Yuetsu Kodama
Subjects: Loop fission, High memory, Installation, Xeon, Computer science, Operating system, Bandwidth (computing), Compiler, computer.software_genre, Supercomputer, computer
Abstract: RIKEN Center for Computational Science has been installing the supercomputer Fugaku. The Fujitsu A64FX, based on the Armv8.2-A+SVE architecture, is used in the system. In this paper, we evaluated the seven HPC applications and benchmarks on the A64FX. In a performance comparison with Marvell (Cavium) ThunderX2 processor and Intel Xeon Skylake processor, the A64FX achieved higher performance in a memory bandwidth-intensive application thanks to its high memory bandwidth. However, we confirmed that the performance of the A64FX decreased from a lack of out-of-order resources. To mitigate this problem, the “loop fission” function of the Fujitsu compiler was used to improve the performance.
Published: 2020

9. Performance Evaluation of Supercomputer Fugaku using Breadth-First Search Benchmark in Graph500

Author: Koji Ueno, Masahiro Nakao, Katsuki Fujisawa, Mitsuhisa Sato, and Yuetsu Kodama
Subjects: Computer science, 020204 information systems, Breadth-first search, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), 020201 artificial intelligence & image processing, Graph theory, 02 engineering and technology, Parallel computing, Supercomputer, Graph, Graph500, MathematicsofComputing_DISCRETEMATHEMATICS
Abstract: There is increasing demand for the high-speed processing of large-scale graphs in various fields. However, such graph processing requires irregular calculations, making it difficult to scale performance on large-scale distributed memory systems. Against this background, Graph500, a competition for evaluating the performance of large-scale graph processing, has been held. We developed breadth-first search (BFS), which is one of the benchmark kernels used in Graph500, and took the top spot a total of 10 times using the K computer. In this paper, we tune BFS performance and evaluate it using the supercomputer Fugaku, which is the successor to the K computer. The results of evaluating BFS for a large-scale graph composed of about 1.1 trillion vertices and 17.6 trillion edges using 92,160 nodes of Fugaku indicate that Fugaku has 2.27 times the performance of the K computer. Fugaku took the top spot on Graph500 in June 2020.
Published: 2020

10. Evaluation of Power Management Control on the Supercomputer Fugaku

Author: Mitsuhisa Sato, Eishi Arima, Yuetsu Kodama, and Tetsuya Odajima
Subjects: 010302 applied physics, Power management, TOP500, business.industry, Computer science, Clock rate, 02 engineering and technology, Supercomputer, 01 natural sciences, 020202 computer hardware & architecture, Memory management, Embedded system, Low-power electronics, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Central processing unit, business, Power control, Efficient energy use
Abstract: The supercomputer “Fugaku”, which recently ranked number one on multiple supercomputing lists, including the Top500 in June 2020, has various power control features, such as (1) an eco mode that utilizes only one of two floating-point pipelines while decreasing the power supply to the chip; (2) a boost mode that increases clock frequency; and (3) a core retention function that turns unused cores into a low-power state. By orchestrating these power-performance features while considering the characteristics of currently running applications, we can potentially gain even better system-level energy efficiency. In this article, we report on the effectiveness of these features using the pre-evaluation environment for Fugaku. As a result, we confirmed several prominent results useful for the operation of the Fugaku system, including: remarkable power reduction and energy-efficiency improvement by coordinating the eco mode and the core retention feature in the memory intensive case; a 10% speed-up with a 17% power consumption increase using the boost mode in the CPU intensive case; and considerable power variations across over 20K nodes.
Published: 2020

11. The Supercomputer 'Fugaku' and Arm-SVE enabled A64FX processor for energy-efficiency and sustained application performance

Author: Mitsuhisa Sato
Subjects: 0106 biological sciences, Interconnection, Computer science, Supercomputer, 010603 evolutionary biology, 01 natural sciences, Microarchitecture, Manycore processor, Set (abstract data type), 010602 entomology, Computer architecture, Key (cryptography), Public service, Efficient energy use
Abstract: We have been carrying out the FLAGSHIP 2020 to develop the Japanese next-generation flagship supercomputer, Post-K, named as “Fugaku” recently. In the project, we have designed a new Arm-SVE enabled processor, called A64FX, as well as the system including interconnect with the industry partner, Fujitsu. The processor is designed for energy-efficiency and sustained application performance. In the design of the system, the “co-design” with the system and applications is a key to make it efficient and high-performance. We analyzed a set of the target applications provided from applications teams for the design of the processor architecture and the decision of many architectural parameters. The “Fugaku” is being installed and scheduled to be put into operation for public service around 2021. In this talk, several features and some preliminary performance results of the “Fugaku” system and A64FX manycore processor will be presented as well as the overview of the system.
Published: 2020

12. Metaprogramming Framework for Existing HPC Languages Based on the Omni Compiler Infrastructure

Author: Masahiro Nakao, Jinpil Lee, Mitsuhisa Sato, and Hitoshi Murai
Subjects: Loop unrolling, Computer science, Fortran, business.industry, 010103 numerical & computational mathematics, computer.software_genre, Supercomputer, 01 natural sciences, Metaprogramming, Field (computer science), 010101 applied mathematics, Abstract syntax, Code (cryptography), Compiler, 0101 mathematics, Software engineering, business, computer, computer.programming_language
Abstract: Recently, low productivity owing to more and more complicated programs has become a serious problem in the field of High Performance Computing (HPC). Omni is a compiler infrastructure based on source-to-source translation for Fortran and C. It was developed by RIKEN and the University of Tsukuba. We are developing a metaprogramming framework for existing HPC languages including Fortran based on Omni with a goal of higher productivity of HPC programs. In this paper, we show the design and prototype implementation of this framework, which is based on directives and abstract syntax trees, and evaluate its feasibility and effectiveness. Through case studies of loop unrolling and the data-layout optimization of derived types, it is verified that various kinds of code transformations can be specified with this framework to improve program productivity.
Published: 2018

13. Power performance analysis of ARM scalable vector extension

Author: Mitsuhisa Sato, Yuetsu Kodama, and Tetsuya Odajima
Subjects: Computer science, Scalability, Power performance, Energy consumption, Extension (predicate logic), Dynamic priority scheduling, Parallel computing, SIMD, Lower energy, Euclidean vector
Abstract: Recent CPUs not only have multiple cores but also support wide SIMD (Single Instruction Multiple Data) instructions. This trend is expected to continue in the future. In this paper, we evaluate computing performance and energy consumption for multiple vector lengths using the ARM Scalable Vector Extension. From our evaluations, we confirm that a longer vector length with multi-cycle vector units results in up to approximately 38% better performance and 23% lower energy consumption than a shorter vector length.
Published: 2018

14. Implementing Lattice QCD Application with XcalableACC Language on Accelerated Cluster

Author: Taisuke Boku, Hidetoshi Iwashita, Masahiro Nakao, Hitoshi Murai, Mitsuhisa Sato, and Akihiro Tabuchi
Subjects: 020203 distributed computing, CUDA, Computer science, Programming complexity, 0202 electrical engineering, electronic engineering, information engineering, Cluster (physics), 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, Lattice QCD, 0101 mathematics, 01 natural sciences
Abstract: Accelerated clusters, which are distributed memory systems equipped with accelerators, have been used in various fields. For accelerated clusters, programmers often implement their applications by a combination of MPI and CUDA (MPI+CUDA). However, the approach faces programming complexity issues. This paper introduces the XcalableACC (XACC) language, which is a hybrid model of XcalableMP (XMP) and OpenACC. While XMP is a directive-based language for distributed memory systems, OpenACC is also a directive-based language for accelerators. XACC enables programmers to develop applications on accelerated clusters with ease. To evaluate XACC performance and productivity levels, we implemented a lattice quantum chromodynamics (Lattice QCD) application using XACC on 64 compute nodes and 256 GPUs and found its performance was almost the same as that of MPI+CUDA. Moreover, we found that XACC requires much less change from the serial Lattice QCD code than MPI+CUDA to implement the parallel Lattice QCD code.
Published: 2017

15. Preliminary Performance Evaluation of Application Kernels Using ARM SVE with Multiple Vector Lengths

Author: Miwako Tsuji, Mitsuhisa Sato, Tetsuya Odajima, Motohiko Matsuda, Jinpil Lee, and Yuetsu Kodama
Subjects: Instruction set, 020203 distributed computing, Out-of-order execution, Computer science, 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, Parallel computing, SIMD, Cache, 020202 computer hardware & architecture
Abstract: Modern high performance processors are equipped with very wide SIMD instruction set. SVE (Scalable Vector Extension) is an ARM® SIMD technology that supports vector lengths from 128 bits to 2048 bits. One of its promising features is to offer "vector-length agnostic" programming to allow the same SVE code to run on hardware of any vector length without any modification of the code. This feature would be useful to explore the best vector length with appropriate hardware resources in the space of various combinations of hardware parameters in order to make more efficient use of hardware resources, since we can use the same vectorized SIMDcode. In this paper, we report the performance of application kernelsusing ARM SVE with multiple vector lengths while keeping the hardware resource the same. We have confirmed that when the performance of the program is limited by a bottleneck of a long chain of arithmetic operations or instruction issues, the performance can be improved by increasing the vector length. However, it was necessary to prepare a sufficient number of physical registers for performance improvement, and when the number of physical registers was too small, it was found that with such a program, the performance might be reduced. When the performance is limited by memory access bandwidth to cache and memory, the vector length does not affect the performance significantly.
Published: 2017

16. A Performance Projection of Mini-Applications onto Benchmarks Toward the Performance Projection of Real-Applications

Author: William Kramer, Mitsuhisa Sato, and Miwako Tsuji
Subjects: Computer science, 02 engineering and technology, Parallel computing, Supercomputer, 01 natural sciences, 010305 fluids & plasmas, Set (abstract data type), Computer engineering, 0103 physical sciences, Metric (mathematics), 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), 020201 artificial intelligence & image processing, Projection (set theory), Performance metric
Abstract: Widely used benchmarks, such as High Performance Linpack (HPL), do not always provide direct insights are notoriously poor indicators of into the actual application performance of systems. When real applications are used, and there have been are criticisms indicating that the performance of simplified benchmarks such as HPL no longer strongly correlate to real application performance. In contrast, performance evaluations based on real or mini applications may give a direct estimation into application performance. The Sustained System Performance (SSP) metric, which is used to evaluate systems based on the performance at scale of various applications, has been successfully adopted to procure systems at the National Energy Research Scientific Computing Center (NERSC), the National Center for Supercomputing Applications (NCSA) and other facilities. However, significant effort is required to tune and optimize several mini applications for each of systems. In this paper, we propose a new performance metric — the Simplified Sustained System Performance (SSSP) metric — based on a suite of simple benchmarks, which enables performance projection that correlates with full applications, but use onto a suite of mini applications. While the SSP metric is calculated over a set of applications, the SSSP metric applies its methodology to a set of benchmarks. Preliminary weighting factors for benchmarks are introduced to approximate the original SSP metric more accurately by the SSSP metric. To define the weighting factors, we perform a simple learning algorithm. Our preliminary experiments show that even though our metric is still easy to measure because it is based on a combination of simple benchmarks, it can provide projections of the performance of applications.
Published: 2017

17. Implementation and Evaluation of One-Sided PGAS Communication in XcalableACC for Accelerated Clusters

Author: Mitsuhisa Sato, Taisuke Boku, Masahiro Nakao, Akihiro Tabuchi, and Hitoshi Murai
Subjects: 020203 distributed computing, Computer science, Fortran, Graphics processing unit, 02 engineering and technology, Parallel computing, computer.software_genre, Assignment, Computational science, Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Benchmark (computing), 020201 artificial intelligence & image processing, Compiler, Partitioned global address space, computer, computer.programming_language
Abstract: Clusters equipped with accelerators such as graphics processing unit (GPU) and Many Integrated Core (MIC) are widely used. For such clusters, programmers write programs for their applications by combining MPI with one of the available accelerator programming models. In particular, OpenACC enables programmers to develop their applications easily, but with lower productivity owing to complex MPI programming. XcalableACC (XACC) is a new programming model, which is an "orthogonal" integration of a partitioned global address space (PGAS) language XcalableMP (XMP) and OpenACC. While XMP enables distributed-memory programming on both global-view and local-view models, OpenACC allows operations to be offloaded to a set of accelerators. In the local-view model, programmers can describe communication with the coarray features adopted from Fortran 2008, and we extend them to communication between accelerators. We have designed and implemented an XACC compiler for NVIDIA GPU and evaluated its performance and productivity by using two benchmarks, Himeno benchmark and NAS Parallel Benchmarks CG (NPB-CG). The performance of the XACC version with the Himeno benchmark and NPB-CG are over 85% and 97% in the local-view model against the MPI+OpenACC version, respectively. Moreover, using non-blocking communication makes the performance of local-view version over 89% with the Himeno benchmark. From the viewpoint of productivity, the local-view model provides an intuitive form of array assignment statement for communication.
Published: 2017

18. Preliminary Implementation of Coarray Fortran Translator Based on Omni XcalableMP

Author: Masahiro Nakao, Hidetoshi Iwashita, and Mitsuhisa Sato
Subjects: Source code, Programming language, Computer science, Fortran, media_common.quotation_subject, Message Passing Interface, Parallel computing, computer.software_genre, Porting, Benchmark (computing), Partitioned global address space, Compiler, Software_PROGRAMMINGLANGUAGES, computer, Coarray Fortran, media_common, computer.programming_language
Abstract: XcalableMP (XMP) is a PGAS language for distributed memory environments. It employs Coarray Fortran (CAF) features as the local-view programming model. We implemented the main part of CAF in the form of a translator, i.e., a source-to-source compiler, as a part of Omni XMP compiler. The compiler uses GASNet and the Fujitsu RDMA interface to allocate static and allocatable coarrays and to get and put coindexed objects while avoiding ill effects in the backend Fortran compiler. The evaluation of the Himeno benchmark shows that ported CAF programs compiled with Omni compiler offer high performance on par with the original message passing interface (MPI) program, despite having 32% fewer lines of source code.
Published: 2015

19. Hybrid Communication with TCA and InfiniBand on a Parallel Programming Language XcalableACC for GPU Clusters

Author: Akihiro Tabuchi, Taisuke Boku, Tetsuya Odajima, Masahiro Nakao, Hitoshi Murai, Toshihiro Hanawa, and Mitsuhisa Sato
Subjects: Computer architecture, Computer science, Stencil code, Parallel programming model, Scalability, InfiniBand, GPU cluster, Partitioned global address space, Communications system, PCI Express
Abstract: For the execution of parallel HPC applications on GPU-ready clusters, high communication latency between GPUs over nodes will be a serious problem on strong scalability. To reduce the communication latency between GPUs, we proposed the Tightly Coupled Accelerator (TCA) architecture and developed the PEACH2 board as a proof-of-concept interconnection system for TCA. Although PEACH2 provides very low communication latency, there are some hardware limitations due to its implementation depending on PCIe technology, such as the practical number of nodes in a system which is 16 currently named sub-cluster. More number of nodes should be connected by conventional interconnections such as InfiniBand, and the entire network system is configured as a hybrid one with global conventional network and local high-speed network by PEACH2. For ease of user programmability, it is desirable to operate such a complicated communication system at the library or language level (which hides the system). In this paper, we develop a hybrid interconnection network system combining PEACH2 and InfiniBand, and implement it based on a high-level PGAS language for accelerated clusters named XcalableACC (XACC). A preliminary performance evaluation confirms that the hybrid network improves the performance based on the Himeno benchmark for stencil computation by up to 40%, relative to MVAPICH2 with GDR on InfiniBand. Additionally, Allgather collective communication with a hybrid network improves the performance by up to 50% for networks of 8 to 16 nodes. The combination of local communication, supported by the low latency of PEACH2 and global communication supported by the high bandwidth and scalability of InfiniBand, results in an improvement of overall performance.
Published: 2015

20. A Design of a Communication Library between Multiple Sets of MPI Processes for MPMD

Author: Hitoshi Murai, Mitsuhisa Sato, and Takenori Shimosaka
Subjects: Distributed database, Programming language, Computer science, Interface (Java), Programming paradigm, Benchmark (computing), Parallel computing, Partitioned global address space, computer.software_genre, computer
Abstract: An MPMD programming model is widely used as a master-worker program or a coupling program for multiple physical models. To utilize recent high-end parallel computers having more than several thousand nodes, we propose the communication library MPMPI between different multiple sets of MPI processes in the MPMD model. In particular, we present MPMPI interfaces that include interfaces for a PGAS language and the basic performance of MPMPI functions. As benchmark programs of the MPMPI library, we evaluated the performance of a master-worker program and a weak coupling program. As a result, we found that Pack/Unpack has a large influence on the performance of MPMPI functions, the interface of MPMPI functions can easily be used in these benchmark programs written in Xcalable MP PGAS language, and the performances of the master-worker and weak coupling benchmark programs using the basic MPMPI functions are practical under the conditions of this paper.
Published: 2014

21. XcalableACC: Extension of XcalableMP PGAS Language Using OpenACC for Accelerator Clusters

Author: Akihiro Tabuchi, Hitoshi Murai, Masahiro Nakao, Taisuke Bokut, Yuetsu Kodama, Mitsuhisa Sato, Toshihiro Hanawa, and Takenori Shimosaka
Subjects: Remote direct memory access, Source lines of code, Computer science, Message Passing Interface, InfiniBand, Programming paradigm, Benchmark (computing), Partitioned global address space, Compiler, Parallel computing, computer.software_genre, computer
Abstract: The present paper introduces the XcalableACC (XACC) programming model, which is a hybrid model of the XcalableMP (XMP) Partitioned Global Address Space (PGAS) language and OpenACC. XACC defines directives that enable programmers to mix XMP and OpenACC directives in order to develop applications that can use accelerator clusters with ease. Moreover, in order to improve the performance of stencil applications, the Omni XACC compiler provides functions that can transfer a halo region on accelerator memory via Tightly Coupled Accelerators (TCA), which is a proprietary network for transferring data directly among accelerators. In the present paper, we evaluate the productivity and the performance of XACC through implementations of the HIMENO Benchmark. The results show that thanks to the productivity improvements, XACC requires less than half the source lines of code compare to a combination of Message Passing Interface (MPI) and OpenACC, which is commonly used together as a typical programming model. As a result of these performance improvements, XACC using TCA achieved up to 2.7 times faster performance than could be obtained via the combination of OpenACC and MPI programming model using GPUDirect RDMA over InfiniBand.
Published: 2014

22. Grid-Oriented Process Clustering System for Partial Message Logging

Author: Yuki Todoroki, Hideyuki Jitsumoto, Yutaka Ishikawa, and Mitsuhisa Sato
Subjects: Mean time between failures, Computational complexity theory, Computer science, Computer cluster, Distributed computing, Graph partition, Fault tolerance, Grid, Lattice graph, Cluster analysis
Abstract: In a computer cluster composed of many nodes, the mean time between failures becomes shorter as the number of nodes increases. This may mean that lengthy tasks cannot be performed, because they will be interrupted by failure. Therefore, fault tolerance has become an essential part of high-performance computing. Partial message logging forms clusters of processes, and coordinates a series of checkpoints to log messages between groups. Our study proposes a system of two features to improve the efficiency of partial message logging: 1) the communication log used in the clustering is recorded at runtime, and 2) a graph partitioning algorithm reduces the complexity of the system by geometrically partitioning a grid graph. The proposed system is evaluated by executing a scientific application. The results of process clustering are compared to existing methods in terms of the clustering performance and quality.
Published: 2014

23. Victim Selection and Distributed Work Stealing Performance: A Case Study

Author: Mitsuhisa Sato and Swann Perarnau
Subjects: Shared memory, Computer science, Work stealing, Computation, Distributed computing
Abstract: Work stealing is a popular solution to perform dynamic load balancing of irregular computations, both for shared memory and distributed memory systems. While shared memory performance of work stealing is well understood, distributing this algorithm to several thousands of nodes can introduce new performance issues. In particular, most studies of work stealing assume that all participating processes are equidistant from each other, in terms of communication latency. This paper presents a new performance evaluation of the popular UTS benchmark, in its work stealing implementation, on the scale of ten thousands of compute nodes. Taking advantage of the physical scale of the K Computer, we investigate in details the performance impact of communication latencies on work stealing. In particular, we introduce a new performance metric to assess the time needed by the work stealing scheduler to distribute work among all processes. Using this metric, we identify a previously overlooked issue: the victim selection function used by the work stealing application can severely impact its performance at large scale. To solve this issue, we introduce a new strategy taking into account the physical distance between nodes and achieve significant performance improvements.
Published: 2014

24. A PGAS Execution Model for Efficient Stencil Computation on Many-Core Processors

Author: Mitsuru Ikei and Mitsuhisa Sato
Subjects: Instruction set, Data access, Shared memory, Computer science, Stencil code, Global Arrays, Parallel computing, Partitioned global address space, Execution model, Blocking (computing)
Abstract: A efficient PGAS execution model on many-core processor for stencil computation is proposed and implemented. We use XcalableMP as a base language and we modify its runtime well fit in many-core processors. The runtime uses processes for parallel execution and global arrays of the stencil codes are broken into blocked sub-arrays placed on shared memory. Using two stencil codes, Laplace and Himeno, we evaluated its performance. In the evaluation, we show (1) Blocking improves locality of memory access during computation therefore improves total CPU execution time. (2) Direct data access using shared memory can relieve communication burden of sub-array halo exchanges.
Published: 2014

25. Multiple-SPMD Programming Environment Based on PGAS and Workflow toward Post-petascale Computing

Author: Maxime Hugues, Serge G. Petiton, Mitsuhisa Sato, and Miwako Tsuji
Subjects: Petascale computing, Workflow, Computer science, Distributed computing, Parallel programming model, Message passing, Partitioned global address space, Parallel computing, SPMD, Workflow management system, Workflow technology
Abstract: In this paper, we propose a new development and execution environment based on workflow and PGAS methodologies for parallel programmings in post-petascale systems. It is expected that post-petascale systems will have a huge and highly hierarchical architecture with nodes of many-core processors and accelerators. For current parallel programs, MPI, MPI/OpenMP hybrid, and so on, it would be sometimes difficult to exploit the post-petascale systems efficiently. The proposed environment, called FP2C (Framework for Post-Petascale Computing), supports multi-program methodologies across multi-architectural levels. It introduces a PGAS parallel programming language called XcalableMP (XMP) to describe tasks into a workflow environment called YML. FP2C is composed of three layers: (1) workflow programming, (2)distributed programming, and (3) shared-memory parallel programming/accelerator. Computational experiments suggest that effective use of cores and memories can be achieved by controlling the level of hierarchization using FP2C.
Published: 2013

26. Towards exascale with the ANR-JST Japanese-French Project FP3C

Author: Alfredo Buttari, Mitsuhisa Sato, Serge G. Petiton, Nahid Emad, Satoshi Matsuoka, M. Dayde, P. Codognet, Tetsuya Sakurai, Yutaka Ishikawa, Gabriel Antoniu, Christophe Calvin, Raymond Namyst, Taisuke Boku, Hiroshi Nakashima, Kengo Nakajima, and G. Joslin
Subjects: Runtime system, Software, Computer architecture, Parallel processing (DSP implementation), Exploit, business.industry, Computer science, Programming paradigm, Parallel computing, Architecture, business, Exascale computing
Abstract: The Japanese-french FP3C (Framework and Programming for Post-Petascale Computing) Project ANR/JST-2010-JTIC-003 aims at studying the software technologies, languages and programming models on the road to exascale computing. The ability to efficiency exploit these future systems is challenging because of their ultra large-scale and highly hierarchical architecture with computational nodes including many-core processors and accelerators. We give an overview of some of the main issues explored within the project.
Published: 2013

27. Interconnection Network for Tightly Coupled Accelerators Architecture

Author: Taisuke Boku, Mitsuhisa Sato, Yuetsu Kodama, and Toshihiro Hanawa
Subjects: Interconnection, Network packet, Computer science, business.industry, Embedded system, Architecture, Latency (engineering), Chip, business, Field-programmable gate array, Computing systems, PCI Express
Abstract: In recent years, heterogeneous clusters using accelerators have entered widespread use in high-performance computing systems. In such clusters, inter-node communication between accelerators normally requires several memory copies via CPU memory, which results in communication latency that causes severe performance degradation. To address this problem, we propose Tightly Coupled Accelerators (TCA) architecture, which is capable of reducing the communication latency between accelerators over different nodes. In the TCA architecture, PCI Express (PCIe)packets are used for direct inter-node communication between accelerators. In addition, we designed a communication chip that we have named PCI Express Adaptive Communication Hub Version 2 (PEACH2) to realize our proposed TCA architecture. In this paper, we introduce the design and implementation of the PEACH2 chip using a field programmable gate array (FPGA), and present a PEACH2 board designed for use as a PCIe extension board. The results of evaluations using ping-pong programs on an eight node TCA cluster demonstrate that the PEACH2 chip achieves 95% of the theoretical peak performance and a latency of 0.96 μsec.
Published: 2013

28. Tightly Coupled Accelerators Architecture for Minimizing Communication Latency among Accelerators

Author: Mitsuhisa Sato, Toshihiro Hanawa, Yuetsu Kodama, and Taisuke Boku
Subjects: Interconnection, Remote direct memory access, business.industry, Computer science, Embedded system, GPU cluster, Latency (engineering), General-purpose computing on graphics processing units, business, Chip, Supercomputer, PCI Express
Abstract: In recent years, heterogeneous clusters using accelerators have been widely used in high performance computing systems. In such clusters, inter-node communication among accelerators requires several memory copies via CPU memory, and the communication latency causes severe performance degradation. In order to address this problem, we propose the Tightly Coupled Accelerators (TCA) architecture to reduce the communication latency between accelerators over different nodes. In addition, we promote the HA-PACS project at the Center for Computational Sciences, University of Tsukuba, in order to build up the HA-PACS base cluster system, as a commodity GPU cluster, and to develop an experimental system based on the TCA architecture as a proprietary interconnection network connecting accelerators beyond the nodes. In the present paper, we describe the TCA architecture and the design and implementation of PEACH2 for realizing the TCA architecture. We also evaluate the functionality and the basic performance of the PEACH2 chip, and the results demonstrate that the PEACH2 chip has sufficient maximum performance with 93% of the theoretical peak performance and a latency between adjacent nodes of approximately 0.8μsec.
Published: 2013

29. Model Checking Stencil Computations Written in a Partitioned Global Address Space Language

Author: Mitsuhisa Sato, Tatsuya Abe, and Toshiyuki Maeda
Subjects: Model checking, Programming language, Stencil code, Computer science, Computation, media_common.quotation_subject, Parallel computing, Abstraction model checking, computer.software_genre, Stencil, Programming style, Partitioned global address space, computer, Formal verification, media_common
Abstract: This paper proposes an approach to software model checking of stencil computations written in partitioned global address space (PGAS)languages. Although a stencil computation offers a simple and powerful programming style, it becomes error prone when considering optimization and parallelization. In the proposed approach, the state explosion problem associated with model checking (that is, where the number of states to be explored increases dramatically) is avoided by introducing abstractions suitable for stencil computation. In addition, this paper also describes XMP-SPIN, our model checker for XcalableMP (XMP), a PGAS language that provides support for implementing parallelized stencil computations. One distinguishing feature of XMP-SPIN is that users are able to define their own abstractions in a simple and flexible way. The proposed abstractions are implemented as user-defined abstractions. This paper also presents experimental results for model checking stencil computations using XMP-SPIN. The results demonstrate the effectiveness and practicality of the proposed approach and XMP-SPIN.
Published: 2013

30. GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Parallelized Accelerated Computing

Author: Taisuke Boku, Mitsuhisa Sato, Jinpil Lee, Tetsuya Odajima, and Toshihiro Hanawa
Subjects: Parallel language, CUDA, Multi-core processor, Computer science, CUDA Pinned memory, Graphics processing unit, Compiler, GPU cluster, Parallel computing, General-purpose computing on graphics processing units, computer.software_genre, computer
Abstract: In this paper, we propose a solution framework to enable the work sharing of parallel processing by the coordination of CPUs and GPUs on hybrid PC clusters based on the high-level parallel language XcalableMPdev. Basic XcalableMP enables high-level parallel programming using sequential code directives that support data distribution and loop/task distribution among multiple nodes on a PC cluster. XcalableMP-dev is an extension of XcalableMP for a hybrid PC cluster, where each node is equipped with accelerated computing devices such as GPUs, many-core environments, etc. Our new framework proposed here, named XcalableMP-dev/Star PU, enables the distribution of data and loop execution among multiple GPUs and multiple CPU cores on each node. We employ a Star PU run-time system for task management with dynamic load balancing. Because of the large performance gap between CPUs and GPUs, the key issue for work sharing among CPU and GPU resources is the task size control assigned to different devices. Since the compiler of the new system is still under construction, we evaluated the performance of hybrid work sharing among four nodes of a GPU cluster and confirmed that the performance gain by the traditional XcalableMP-dev system on NVIDIA CUDA is up to 1.4 times faster than GPU-only execution.
Published: 2012

31. On-the-Fly Synchronization Checking for Interactive Programming in XcalableMP

Author: Tatsuya Abe and Mitsuhisa Sato
Subjects: Source code, Interactive programming, Computer science, On the fly, Programming language, media_common.quotation_subject, Partitioned global address space, computer.software_genre, Directive, computer, Synchronization, media_common
Abstract: Xcalable MP (XMP) is a partitioned global address space language, which is directive based. In XMP, programmers can include explicit synchronizations by adding directives to their source code. In this sense, XMP provides programmers with performance awareness. As such, part of the performance of programs can be attributed to the programmers, i.e., XMP requires interactive programming by the programmers. In this paper, we introduce a tool that alerts programmers to missing and redundant synchronizations they have included referring, respectively, to non-local array indices and a decrease in the performance of programs. The tool uses XMP directives, making programs more structured, and on-the-fly checks whether directives are missing or redundant, while programmers are editing their programs.
Published: 2012

32. An asynchronous parallel genetic algorithm for the maximum likelihood phylogenetic tree search

Author: Akifumi S. Tanabe, Yuji Inagaki, Mitsuhisa Sato, Tetsuo Hashimoto, and Miwako Tsuji
Subjects: education.field_of_study, Theoretical computer science, Phylogenetic tree, Computer science, Population, Parallel algorithm, Tree (data structure), Phylogenetics, Asynchronous communication, Tree rearrangement, Computational phylogenetics, Genetic algorithm, ComputingMethodologies_GENERAL, education
Abstract: A phylogenetic tree represents the evolutionary relationships among biological species. Although parallel computation is essential for the phylogenetic tree searches, it is not easy to maintain the diversity of population in a parallel genetic algorithm. In this paper, we design a new asynchronous parallel genetic algorithm for tree optimization which maintain the diversity of population without any communication or synchronization.
Published: 2012

33. DS-Bench Toolset: Tools for dependability benchmarking with simulation and assurance

Author: Shinpei Kato, Toshihiro Hanawa, Yutaka Ishikawa, Hajime Fujita, Mitsuhisa Sato, and Yutaka Matsuno
Subjects: Exploit, business.industry, Computer science, media_common.quotation_subject, Cloud computing, Benchmarking, Fault injection, computer.software_genre, Reliability engineering, Virtual machine, Information system, Dependability, Software engineering, business, Function (engineering), computer, media_common
Abstract: Today's information systems have become large and complex because they must interact with each other via networks. This makes testing and assuring the dependability of systems much more difficult than ever before. DS-Bench Toolset has been developed to address this issue, and it includes D-Case Editor, DS-Bench, and D-Cloud. D-Case Editor is an assurance case editor. It makes a tool chain with DS-Bench and D-Cloud, and exploits the test results as evidences of the dependability of the system. DS-Bench manages dependability benchmarking tools and anomaly loads according to benchmarking scenarios. D-Cloud is a test environment for performing rapid system tests controlled by DS-Bench. It combines both a cluster of real machines for performance-accurate benchmarks and a cloud computing environment as a group of virtual machines for exhaustive function testing with a fault-injection facility. DS-Bench Toolset enables us to test systems satisfactorily and to explain the dependability of the systems to the stakeholders.
Published: 2012

34. Implementation of XcalableMP Device Acceleration Extention with OpenCL

Author: Taisuke Boku, Daisuke Takahashi, Takuma Nomizu, Mitsuhisa Sato, and Jinpil Lee
Subjects: Software portability, CUDA, Computer science, Programming paradigm, Runtime library, Distributed memory, Partitioned global address space, Parallel computing
Abstract: Due to their outstanding computational performance, many acceleration devices, such as GPUs, the Cell Broadband Engine (Cell/B.E.), and multi-core computing are attracting a lot of attention in the field of high-performance computing. Although there are many programming models and languages de-signed for programming accelerators, such as CUDA, AMD Accelerated Parallel Processing (AMD APP), and OpenCL, these models remain difficult and complex. Furthermore, when programming for accelerator-enhanced clusters, we have to use an inter-node programming interface, such as MPI to coordinate the nodes. In order to address these problems and reduce complexity, an extension to XcalableMP (XMP), a PGAS language, for use on accelerator-enhanced clusters, called XcalableMP Device Acceleration Extension (XMP-dev), is proposed. In XMP-dev, a global distributed data is mapped onto distributed memory of each accelerator, and a fragment of codes can be of-floaded to execute in a set of accelerators. It eliminates the complex programming between nodes and accelerators and between nodes. In this paper, we present an implementation of the XMP-dev runtime library with the OpenCL APIs, while the previous implementation targets CUDA-only. Since OpenCL is a standardized interface supported for various kinds of accelerators, it improves the portability of XMP-dev and reduces the cost of development. In the result of performance evaluation, we show that the OpenCL implementation of XMP-dev can generate portable programs that can run on not only NVIDIA GPU-enhanced clusters but also various accelerator-enhanced clusters.
Published: 2012

35. Productivity and Performance of Global-View Programming with XcalableMP PGAS Language

Author: Mitsuhisa Sato, Taisuke Boku, Jinpil Lee, and Masahiro Nakao
Subjects: Programming language, Computer science, Fortran, Parallel computing, computer.software_genre, Data modeling, Parallel language, Consistency (database systems), Unified Parallel C, Programming paradigm, Compiler, Partitioned global address space, computer, computer.programming_language
Abstract: XcalableMP (XMP) is a PGAS parallel language with a directive-based extension of C and Fortran. While it supports gcoarrayh as a local-view programming model, an XMP global-view programming model is useful when parallelizing data-parallel programs by adding directives with minimum code modification. This paper considers the productivity and performance of the XMP global-view programming model. In the global-view programming model, a programmer describes data distributions and work-mapping to map the computations to nodes, where the computed data are located. Global-view communication directives are used to move a part of the distributed data globally and to maintain consistency in the shadow area. Rich sets of XMP global-view programming model can reduce the cost for parallelization significantly, and optimization of gprivatizationh is not necessary. For productivity and performance study, the Omni XMP compiler and the Berkeley Unified Parallel C compiler are used. Experimental results show that XMP can implement the benchmarks with a smaller programming cost than UPC. Furthermore, XMP has higher access performance for global data, which has an affinity with own process than UPC. In addition, the XMP co array function can effectively tune the application's performance.
Published: 2012

36. XMCAPI: Inter-core Communication Interface on Multi-chip Embedded Systems

Author: Toshihiro Hanawa, Taisuke Boku, Mitsuhisa Sato, and Shin'ichi Miura
Subjects: Protocol stack, Multi-core processor, Core (game theory), Computer architecture, Shared memory, business.industry, Computer science, Embedded system, MCAPI, Distributed memory, business, Chip, Communication interface
Abstract: Multi-core processor technology has been applied to the processors in embedded systems as well as in ordinary PC systems. In multi-core embedded processors, however, a processor may consist of heterogeneous CPU cores that are not configured with a shared memory and do not have a communication mechanism for inter-core communication. MCAPI is a highly portable API standard for providing inter-core communication independent of the architecture heterogeneity. In this paper, we extend the current MCAPI to a multi-chip in a distributed memory configuration and propose its portable implementation, named XMCAPI, on a commodity network stack. With XMCAPI, the inter-core communication method for intra-chip cores is extended to inter-chip cores. We evaluate the XMCAPI implementation, xmcapi/ip, on a standard socket in a portable software development environment.
Published: 2011

37. Audit: New Synchronization for the GET/PUT Protocol

Author: Atsushi Hori, Mitsuhisa Sato, and Jinpil Lee
Subjects: Computer science, FLAGS register, Process (computing), Operating system, Information technology audit, Data synchronization, Parallel programing, Audit, computer.software_genre, Protocol (object-oriented programming), computer, Synchronization
Abstract: The GET/PUT protocol is considered as an effective communication API for parallel computing. However, the one-sided nature of the GET/PUT protocol lacks the synchronization functionality on the target process. So far, several techniques have been proposed to tackle this problem. The APIs for the synchronization proposed so far have failed to hide the implementation details of the synchronization. In this paper, a new synchronization API for the GET/PUT protocol is proposed. The idea here is to associate the synchronization flags with the GET/PUT memory regions. By doing this, the synchronization flags are hidden from users, and users are free from managing the associations between the memory regions and synchronization flags. The proposed API, named "Audit," does not incur additional programing and thus enables natural parallel programing. The evaluations show that {it Audit} exhibits better performance than the Notify API proposed in ARMCI.
Published: 2011

38. PEARL and PEACH: A Novel PCI Express Direct Link and Its Implementation

Author: Taisuke Boku, Toshihiro Hanawa, Mitsuhisa Sato, Shin'ichi Miura, and Kazutami Arimoto
Subjects: business.industry, M.2, Computer science, Communication link, Chip, PEARL (programming language), Parallel processing (DSP implementation), Network interface controller, Power consumption, Embedded system, business, computer, computer.programming_language, PCI Express
Abstract: We have proposed PEARL, which is a power-aware, high-performance, dependable communication link using PCI Express as a direct communication device, for application in a wide range of parallel processing systems, from high-end embedded systems to small-scale high-performance clusters. The PEACH chip used to realize PEARL connects four ports of PCI Express Gen 2 with four lanes and uses an M32R processor with four cores and several DMACs. We also develop the PEACH board as a network interface card for implementing the PEACH chip. The preliminary evaluation results indicate that the PEACH board achieves a maximum performance of 1.1 Gbyte/s. In addition, through power-aware control, the power consumption can be reduced by up to 0.7 watts, and both the time required to reduce the number of lanes and the time required to change from Gen 2 to Gen 1 are 10 $mu$s.
Published: 2011

39. Efficient Work-Stealing Strategies for Fine-Grain Task Parallelism

Author: Adnan and Mitsuhisa Sato
Subjects: Computer science, Distributed computing, Message passing, Task parallelism, Parallel computing, Software_PROGRAMMINGTECHNIQUES, Cilk, Task (computing), Tree (data structure), Work stealing, Task analysis, Benchmark (computing), computer, computer.programming_language
Abstract: Herein, we describe extended work-stealing strategies for Stack Threads/MP, in which thieves steal from the bottom of a victim's logical stack not just the bottommost task but multiple chained tasks. These new strategies offer two advantages: reducing the total cost of stealing tasks and reducing the total idle time. In addition, these strategies attempt to preserve the sequential execution order of tasks in the chain. We evaluated these extended work-stealing strategies by using the unbalanced tree search (UTS) benchmark and could demonstrate its advantages over the original work-stealing strategy and other OpenMP task implementations and Cilk implementation as well. Extended work-stealing strategies exhibit significant improvement with respect to the UTS benchmark, even if the task is very fine-grain and non-uniform.
Published: 2011

40. PEARL: Power-Aware, Dependable, and High-Performance Communication Link Using PCI Express

Author: Mitsuhisa Sato, Toshihiro Hanawa, Taisuke Boku, Shin'ichi Miura, and Kazutami Arimoto
Subjects: business.industry, Computer science, Bandwidth (signal processing), Fault tolerance, Chip, PEARL (programming language), Parallel processing (DSP implementation), Network interface controller, Embedded system, Control system, business, computer, PCI Express, computer.programming_language
Abstract: We have proposed a power-aware, high-performance, dependable communication link using PCI Express as a direct communication device, referred to as PEARL, for application in a wide range of parallel processing systems from high-end embedded system to small-scale high-performance clusters. In the present study, we describe the structure and function of a communicator chip, referred to as the PEACH chip, for realizing PEARL. The PEACH chip connects four ports of PCI Express Gen 2 with four lanes, and uses an M32R processor with four cores and several DMACs. We also develop the PEACH board as the network interface card for implementing the PEACH chip. The PEACH board provides a power-aware, dependable communication link with a theoretical peak bandwidth of 2 Gbytes/s per link.
Published: 2010

41. Customizing Virtual Machine with Fault Injector by Integrating with SpecC Device Model for a Software Testing Environment D-Cloud

Author: Takayuki Banzai, Shin'ichi Miura, Hitoshi Koizumi, Tadatoshi Ishii, Mitsuhisa Sato, Hidehisa Takamizawa, and Toshihiro Hanawa
Subjects: Cloud management, business.industry, Computer science, SpecC, Cloud computing, Fault injection, computer.software_genre, Software, Virtual machine, Software fault tolerance, Embedded system, Operating system, Scenario testing, business, computer
Abstract: D-Cloud is a software testing environment for dependable parallel and distributed systems using cloud computing technology. We use Eucalyptus as cloud management software to manage virtual machines designed based on QEMU, called FaultVM, which have a fault injection mechanism. D-Cloud enables the test procedures to be automated using a large amount of computing resources in the cloud by interpreting the system configuration and the test scenario written in XML in D-Cloud front end and enables tests including hardware faults by emulating hardware faults by FaultVM flexibly. In the present paper, we describe the customization facility of FaultVM used to add new device models. We use SpecC, which is a system description language, to describe the behavior of devices, and a simulator generated from the description by SpecC is linked and integrated into FaultVM. This also makes the definition and injection of faults flexible without the modification of the original QEMU source codes. This facility allows D-Cloud to be used to test distributed systems with customized devices.
Published: 2010

42. Keynote CSE 2010: Trends in Post-Petascale Computing

Author: Mitsuhisa Sato
Subjects: Petascale computing, Computer science, Data science, Computational science
Published: 2010

43. Power-aware, dependable, and high-performance communication link using PCI Express: PEARL

Author: Kazutami Arimoto, Mitsuhisa Sato, Shinaichi Miura, Toshihiro Hanawa, and Taisuke Boku
Subjects: M.2, Computer science, business.industry, PCI configuration space, Network interface, computer.software_genre, PEARL (programming language), Network interface controller, Embedded system, Conventional PCI, Operating system, business, Host (network), computer, PCI Express, computer.programming_language
Abstract: We have proposed a power-aware, dependable, and high-performance communication link using PCI Express as a direct communication device, referred to as PEARL for application in a wide range of parallel and distributed systems from high-end embedded systems to small-scale high-performance clusters. The PEACH chip, as a communicator chip for realizing PEARL, concentrates four ports of PCI Express Gen 2 with four lanes, and employs M32R processor with four cores and several DMACs. The network interface card implementing the PEACH chip provides a power-aware, dependable, and highperformance communication link among host nodes. We also present the results of a preliminary evaluation using the PEACH board. As a result of DMA transfer on PCI Express between PEACH boards, the minimum latency was determined to be less than 1 μs, and the maximum bandwidth was 1.1 Gbytes/s for a data size of 256 Kbyte.
Published: 2010

44. Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems

Author: Jinpil Lee and Mitsuhisa Sato
Subjects: Fortran, Programming language, Computer science, Message passing, Parallel computing, Supercomputer, computer.software_genre, Parallel programming model, Programming paradigm, Benchmark (computing), Distributed memory, Compiler, computer, computer.programming_language
Abstract: Although MPI is a de-facto standard for parallel programming on distributed memory systems, writing MPI programs is often a time-consuming and complicated process. XcalableMP is a language extension of C and Fortran for parallel programming on distributed memory systems that helps users to reduce those programming efforts. XcalableMP provides two programming models. The first one is the global view model, which supports typical parallelization based on the data and task parallel paradigm, and enables parallelizing the original sequential code using minimal modification with simple, OpenMP-like directives. The other one is the local view model, which allows using CAF-like expressions to describe inter-node communication. Users can even use MPI and OpenMP explicitly in our language to optimize performance explicitly. In this paper, we introduce XcalableMP, the implementation of the compiler, and the performance evaluation result. For the performance evaluation, we parallelized HPCC Benchmark in XcalableMP. It shows that users can describe the parallelization for distributed memory system with a small modification to the original sequential code.
Published: 2010

45. Runtime Energy Adaptation with Low-Impact Instrumented Code in a Power-Scalable Cluster System

Author: Mitsuhisa Sato, Takayuki Imada, and Hideaki Kimura
Subjects: Computer science, business.industry, Embedded system, Scalability, Real-time computing, Energy consumption, Interrupt, Frequency scaling, business, Supercomputer, Energy (signal processing), Efficient energy use, Dynamic voltage scaling
Abstract: Recently, improving the energy efficiency of high performance PC clusters has become important. In order to reduce the energy consumption of the microprocessor, many high performance microprocessors have a Dynamic Voltage and Frequency Scaling (DVFS) mechanism. This paper proposes a new DVFS method called the Code-Instrumented Runtime (CIRuntime) DVFS method, in which a combination of voltage and frequency, which is called a P-State, is managed in the instrumented code at runtime. The proposed CI-Runtime DVFS method achieves better energy saving than the Interrupt based Runtime DVFS method, since it selects the appropriate P-State in each defined region based on the characteristics of program execution. Moreover, the proposed CI-Runtime DVFS method is more useful than the Static DVFS method, since it does not acquire exhaustive profiles for each P-State. The method consists of two parts. In the first part of the proposed CI-Runtime DVFS method, the instrumented codes are inserted by defining regions that have almost the same characteristics. The instrumented code must be inserted at the appropriate point, because the performance of the application decreases greatly if the instrumented code is called too many times in a short period. A method for automatically defining regions is proposed in this paper. The second part of the proposed method is the energy adaptation algorithm which is used at runtime. Two types of DVFS control algorithms energy adaptation with estimated energy consumption and energy adaptation with only performance information, are compared. The proposed CIRuntime DVFS method was implemented on a power-scalable PC cluster. The results show that the proposed CI-Runtime with energy adaptation using estimated energy consumption could achieve an energy saving of 14.2% which is close to the optimal value, without obtaining exhaustive profiles for every available P-State setting.
Published: 2010

46. D-Cloud: Design of a Software Testing Environment for Reliable Distributed Systems Using Cloud Computing Technology

Author: Hitoshi Koizumi, Mitsuhisa Sato, Takayuki Imada, Takayuki Banzai, Toshihiro Hanawa, and Ryo Kanbayashi
Subjects: Computer science, business.industry, Software fault tolerance, Distributed computing, Cloud testing, Software performance testing, Cloud computing, Software system, Software reliability testing, Fault injection, business, System integration testing
Abstract: In this paper, we propose a software testing environment, called D-Cloud, using cloud computing technology and virtual machines with fault injection facility. Nevertheless, the importance of high dependability in a software system has recently increased, and exhaustive testing of software systems is becoming expensive and time-consuming, and, in many cases, sufficient software testing is not possible. In particular, it is often difficult to test parallel and distributed systems in the real world after deployment, although reliable systems, such as high-availability servers, are parallel and distributed systems. D-Cloud is a cloud system which manages virtual machines with fault injection facility. D-Cloud sets up a test environment on the cloud resources using a given system configuration file and executes several tests automatically according to a given scenario. In this scenario, D-Cloud enables fault tolerance testing by causing device faults by virtual machine. We have designed the D-Cloud system using Eucalyptus software and a description language for system configuration and the scenario of fault injection written in XML. We found that the D-Cloud system, which allows a user to easily set up and test a distributed system on the cloud and effectively reduces the cost and time of testing.
Published: 2010

47. Power and QoS performance characteristics of virtualized servers

Author: Takayuki Imada, Mitsuhisa Sato, and Hideaki Kimura
Subjects: Web server, Multi-core processor, business.industry, Computer science, Quality of service, Provisioning, Energy consumption, computer.software_genre, Server farm, Virtual machine, Server, Operating system, business, computer, Computer network
Abstract: In this paper, we investigate power and QoS (Quality of Service) performance characteristics of virtualized servers with virtual machine technology. Currently, one of the critical problems at data centers with many thousands of servers is the increased power consumption. Virtual machines (VMs) are often used for Internet services for efficient server management and provisioning. While we expect that virtualized servers where multiple VMs run help save power, new issues in virtualized servers arise compared to conventional physical servers: migration of the load between servers and processor core assignment of the server's workload from the viewpoints of QoS performance and energy consumption. Our experimental results show that server consolidation using VM migration contributes to power reduction without or with slight QoS performance degradation, and assignment of VMs to multiple processor cores running at a lower frequency can achieve additional power reduction on a server node.
Published: 2009

48. Performance Evaluation of OpenMP and MPI Hybrid Programs on a Large Scale Multi-core Multi-socket Cluster, T2K Open Supercomputer

Author: Mitsuhisa Sato and Miwako Tsuji
Subjects: Multi-core processor, Hardware_MEMORYSTRUCTURES, ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION, Computer science, Message passing, Message Passing Interface, Parallel computing, Thread (computing), Software_PROGRAMMINGTECHNIQUES, Supercomputer, Non-uniform memory access, Programming paradigm, Distributed memory, SPMD
Abstract: Non-uniform memory access (NUMA) systems, where each processor has its own memory, have been popular platform in high-end computing. While some early studies had reported that a flat-MPI programming model outperformed an OpenMP/MPI hybrid programming model on SMP clusters, the hybrid of a shared-memory, thread-based programming and a distributed-memory, message passing programming is considered to be a promising programming model on the multi-core multi-socket NUMA clusters. We explore the performance of the OpenMP/MPI hybrid programming model on a large scale multi-core multi-socket cluster called T2K Open Supercomputer. Both of benchmark (NPB, NAS Parallel Benchmarks) and application (RSDFT, Real-Space Density Function Theory) codes are considered. The hybridization for the RSDFT code is also shown. Our experiments show that the multi-core multi-socket cluster can take advantage of the hybrid programming model when it uses MPI across sockets and OpenMP within sockets.
Published: 2009

49. RI2N/DRV: Multi-link ethernet for high-bandwidth and fault-tolerant network on PC clusters

Author: Toshihiro Hanawa, Mitsuhisa Sato, Taiga Yonemoto, Shin'ichi Miura, and Taisuke Boku
Subjects: Ethernet, Interconnection, Computer science, Transmission Control Protocol, Ethernet over PDH, business.industry, Network packet, Retransmission, ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, Ethernet flow control, Gigabit Ethernet, Jumbo frame, Channel bonding, Networking hardware, Metro Ethernet, Link aggregation, Synchronous Ethernet, TCP offload engine, Network interface controller, Embedded system, ATA over Ethernet, business, Carrier Ethernet, Computer network
Abstract: Although recent high-end interconnection network devices and switches provide a high performance to cost ratio, most of the small to medium sized PC clusters are still built on the commodity network, Ethernet. To enhance performance on commonly used Gigabit Ethernet networks, link aggregation or binding technology is used. Currently, Linux kernels are equipped with software named Linux Channel Bonding (LCB), which is based IEEE802.3ad Link Aggregation technology. However, standard LCB has the disadvantage of mismatch with the TCP protocol; consequently, both large latency and bandwidth instability can occur. Fault-tolerance feature is supported by LCB, but the usability is not sufficient. We developed a new implementation similar to LCB named Redundant Interconnection with Inexpensive Network with Driver (RI2N/DRV) for use on Gigabit Ethernet. RI2N/DRV has a complete software stack that is very suitable for TCP, an upper layer protocol. Our algorithm suppresses unnecessary ACK packets and retransmission of packets, even in imbalanced network traffic and link failures on multiple links. It provides both high-bandwidth and fault-tolerant communication on multi-link Gigabit Ethernet. We confirmed that this system improves the performance and reliability of the network, and our system can be applied to ordinary UNIX services such as network file system (NFS), without any modification of other modules.
Published: 2009

50. Using a cluster as a memory resource: A fast and large virtual memory on MPI

Author: Mitsuhisa Sato, Taisuke Boku, Kazuhiro Saito, and Hiroko Midorikawa
Subjects: Software portability, Memory address, Memory management, Computer science, Computer cluster, Message passing, Virtual memory, Benchmark (computing), Message Passing Interface, Operating system, computer.software_genre, computer
Abstract: The 64-bit OS provides ample memory address space that is beneficial for applications using a large amount of data. This paper proposes using a cluster as a memory resource for sequential applications requiring a large amount of memory. This system is an extension of our previously proposed socket-based Distributed Large Memory System (DLM), which offers large virtual memory by using remote memory distributed over nodes in a cluster. The newly designed DLM is based on MPI (Message Passing Interface) to exploit higher portability. MPI-based DLM provides fast and large virtual memory on widely available open clusters managed with an MPI batch queuing system. To access this remote memory, we rely on swap protocols adequate for MPI thread support levels. In experiments, we confirmed that it achieves 493 MB/s and 613 MB/s of remote memory bandwidth with the STREAM benchmark on 2.5 GB/s and 5 GB/s links (Myri-10G x2, x4) and high performance of applications with NPB and Himeno benchmarks. Additionally, this system enables users unfamiliar with parallel programming to use a cluster.
Published: 2009

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

82 results on '"Mitsuhisa Sato"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources