Author: "MATSUOKA, SATOSHI" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"MATSUOKA, SATOSHI"' showing total 1,861 results

Start Over Author "MATSUOKA, SATOSHI"

1,861 results on '"MATSUOKA, SATOSHI"'

1. Exploiting Scratchpad Memory for Deep Temporal Blocking: A case study for 2D Jacobian 5-point iterative stencil kernel (j2d5pt)

Author: Zhang, Lingqi, Wahib, Mohamed, Chen, Peng, Meng, Jintao, Wang, Xiao, Endo, Toshio, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: General Purpose Graphics Processing Units (GPGPU) are used in most of the top systems in HPC. The total capacity of scratchpad memory has increased by more than 40 times in the last decade. However, existing optimizations for stencil computations using temporal blocking have not aggressively exploited the large capacity of scratchpad memory. This work uses the 2D Jacobian 5-point iterative stencil as a case study to investigate the use of large scratchpad memory. Unlike existing research that tiles the domain in a thread block fashion, we tile the domain so that each tile is large enough to utilize all available scratchpad memory on the GPU. Consequently, we process several time steps inside a single tile before offloading the result back to global memory. Our evaluation shows that our performance is comparable to state-of-the-art implementations, yet our implementation is much simpler and does not require auto-generation of code., Comment: This is short paper is published in the 15th workshop on general purpose processing using GPU (GPGPU 2023)
Published: 2023
Full Text: View/download PDF

2. Origin of the intermolecular forces that produce donor–acceptor stacks in π-conjugated charge-transfer complexes

Author: Tsuzuki, Seiji, Ono, Ryota, Inoue, Satoru, Matsuoka, Satoshi, and Hasegawa, Tatsuo
Published: 2024
Full Text: View/download PDF

3. A machine-learning-based prediction of non-home discharge among acute heart failure patients

Author: Okada, Akira, Kaneko, Hidehiro, Konishi, Masaaki, Kamiya, Kentaro, Sugimoto, Tadafumi, Matsuoka, Satoshi, Yokota, Isao, Suzuki, Yuta, Yamaguchi, Satoko, Itoh, Hidetaka, Fujiu, Katsuhito, Michihata, Nobuaki, Jo, Taisuke, Matsui, Hiroki, Fushimi, Kiyohide, Takeda, Norifumi, Morita, Hiroyuki, Yasunaga, Hideo, and Komuro, Issei
Published: 2024
Full Text: View/download PDF

4. Revisiting Temporal Blocking Stencil Optimizations

Author: Zhang, Lingqi, Wahib, Mohamed, Chen, Peng, Meng, Jintao, Wang, Xiao, Endo, Toshio, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to $2.53$x and a geometric mean speedup of $1.49$x over the best state-of-the-art performance in each stencil benchmark., Comment: This paper will be published in 2023 International Conference on Supercomputing (ICS23)
Published: 2023
Full Text: View/download PDF

5. Myths and Legends in High-Performance Computing

Author: Matsuoka, Satoshi, Domke, Jens, Wahib, Mohamed, Drozd, Aleksandr, and Hoefler, Torsten
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, Computer Science - Computers and Society, Computer Science - Machine Learning, Computer Science - Social and Information Networks
Abstract: In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore's law. While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research. Nevertheless, these myths are rarely based on scientific facts, but rather on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates, such as whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment.
Published: 2023

6. Preparing for the Future -- Rethinking Proxy Apps

Author: Matsuoka, Satoshi, Domke, Jens, Wahib, Mohamed, Drozd, Aleksandr, Bair, Ray, Chien, Andrew A., Vetter, Jeffrey S., and Shalf, John
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN's Supercomputer Fugaku, ANL's Aurora, or ORNL's Frontier. This process was necessary to standardize the procurement while avoiding duplicated effort at each HPC center to develop their own benchmarks. Unfortunately, proxy applications force HPC centers and providers (vendors) into a an undesirable state of rigidity, in contrast to the fast-moving trends of current technology and future heterogeneity. To accommodate an extremely-heterogeneous future, we have to reconsider how to co-design supercomputers during the next decade, and avoid repeating the past mistakes. This position paper outlines the current state-of-the-art in system co-design, challenges encountered over the past years, and a proposed plan to move forward.
Published: 2022

7. PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

Author: Zhang, Lingqi, Wahib, Mohamed, Chen, Peng, Meng, Jintao, Wang, Xiao, Endo, Toshio, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of $2.12$x for 2D stencils and $1.24$x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of $4.86$x in smaller SpMV datasets from SuiteSparse and $1.43$x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS., Comment: This paper will be published in 2023 International Conference on Supercomputing (ICS23)
Published: 2022
Full Text: View/download PDF

8. At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads

Author: Domke, Jens, Vatai, Emil, Gerofi, Balazs, Kodama, Yuetsu, Wahib, Mohamed, Podobas, Artur, Mittal, Sparsh, Pericàs, Miquel, Zhang, Lingqi, Chen, Peng, Drozd, Aleksandr, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.
Published: 2022

9. The estimation of coronary artery calcium thickness by computed tomography angiography based on optical coherence tomography measurements

Author: Okutsu, Masaaki, Mitomo, Satoru, Onishi, Hirokazu, Nakajima, Akihiro, Yabushita, Hiroto, Matsuoka, Satoshi, Kawamoto, Hiroyoshi, Watanabe, Yusuke, Tanaka, Kentaro, Naganuma, Toru, Tahara, Satoko, Nakamura, Shotaro, Basavarajaiah, Sandeep, and Nakamura, Sunao
Published: 2023
Full Text: View/download PDF

10. MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

Author: Farrell, Steven, Emani, Murali, Balma, Jacob, Drescher, Lukas, Drozd, Aleksandr, Fink, Andreas, Fox, Geoffrey, Kanter, David, Kurth, Thorsten, Mattson, Peter, Mu, Dawei, Ruhela, Amit, Sato, Kento, Shirahata, Koichi, Tabaru, Tsuguchika, Tsaris, Aristeidis, Balewski, Jan, Cumming, Ben, Danjo, Takumi, Domke, Jens, Fukai, Takaaki, Fukumoto, Naoto, Fukushi, Tatsuya, Gerofi, Balazs, Honda, Takumi, Imamura, Toshiyuki, Kasagi, Akihiko, Kawakami, Kentaro, Kudo, Shuhei, Kuroda, Akiyoshi, Martinasso, Maxime, Matsuoka, Satoshi, Mendonça, Henrique, Minami, Kazuki, Ram, Prabhat, Sawada, Takashi, Shankar, Mallikarjun, John, Tom St., Tabuchi, Akihiro, Vishwanath, Venkatram, Wahib, Mohamed, Yamazaki, Masafumi, and Yin, Junqi
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons Association. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior to parameterize extended roofline performance models in future rounds.
Published: 2021

11. Digital transformation of droplet/aerosol infection risk assessment realized on 'Fugaku' for the fight against COVID-19

Author: Ando, Kazuto, Bale, Rahul, Li, ChungGang, Matsuoka, Satoshi, Onishi, Keiji, and Tsubokura, Makoto
Subjects: Computer Science - Computational Engineering, Finance, and Science, Physics - Fluid Dynamics
Abstract: The fastest supercomputer in 2020, Fugaku, has not only achieved digital transformation of epidemiology in allowing end-to-end, detailed quantitative modeling of COVID-19 transmissions for the first time, but also transformed the behavior of the entire Japanese public through its detailed analysis of transmission risks in multitudes of societal situations entailing heavy risks. A novel aerosol simulation methodology was synthesized out of a combination of a new CFD methods meeting industrial demands, CUBE, which not only allowed the simulations to scale massively with high resolution required for micrometer virus-containing aerosol particles, but also extremely rapid time-to-solution due to its ability to generate the digital twins representing multitudes of societal situations in minutes not week, attaining true overall application high performance; such simulations have been running for the past 1.5 years on Fugaku, cumulatively consuming top supercomputer-class resources and the result communicated by the media as well as becoming official public policies., Comment: 24 pages, 12 figures
Published: 2021

12. Efficacy of azilsartan on left ventricular diastolic dysfunction compared with candesartan: J-TASTE randomized controlled trial

Author: Ito, Shin, Takahama, Hiroyuki, Asakura, Masanori, Abe, Yukio, Ajioka, Masayoshi, Anzai, Toshihisa, Arikawa, Takuo, Hayashi, Takaharu, Higashino, Yorihiko, Hiramitsu, Shinya, Iwahashi, Noriaki, Izumi, Chisato, Kimura, Kazuo, Kinugawa, Koichiro, Kioka, Hidetaka, Lim, Young-Jae, Matsuoka, Ken, Matsuoka, Satoshi, Motoki, Hirohiko, Nakamura, Sunao, Nakayama, Takafumi, Nomura, Akihiro, Sasaoka, Taishi, Takiuchi, Shin, Toyoda, Shigeru, Ueda, Tomoya, Watanabe, Tetsuya, Yamada, Akira, Yamamoto, Masayoshi, Sozu, Takashi, and Kitakaze, Masafumi
Published: 2023
Full Text: View/download PDF

13. Performance Portable Back-projection Algorithms on CPUs: Agnostic Data Locality and Vectorization Optimizations

Author: Chen, Peng, Wahib, Mohamed, Wang, Xiao, Takizawa, Shinichiro, Hirofuchi, Takahiro, Ogawa, Hirotaka, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Computed Tomography (CT) is a key 3D imaging technology that fundamentally relies on the compute-intense back-projection operation to generate 3D volumes. GPUs are typically used for back-projection in production CT devices. However, with the rise of power-constrained micro-CT devices, and also the emergence of CPUs comparable in performance to GPUs, back-projection for CPUs could become favorable. Unlike GPUs, extracting parallelism for back-projection algorithms on CPUs is complex given that parallelism and locality are not explicitly defined and controlled by the programmer, as is the case when using CUDA for instance. We propose a collection of novel back-projection algorithms that reduce the arithmetic computation, robustly enable vectorization, enforce a regular memory access pattern, and maximize the data locality. We also implement the novel algorithms as efficient back-projection kernels that are performance portable over a wide range of CPUs. Performance evaluation using a variety of CPUs from different vendors and generations demonstrates that our back-projection implementation achieves on average 5.2x speedup over the multi-threaded implementation of the most widely used, and optimized, open library. With a state-of-the-art CPU, we reach performance that rivals top-performing GPUs., Comment: ACM International Conference on Supercomputing 2021 (ICS'21)
Published: 2021
Full Text: View/download PDF

14. Co-design Center for Exascale Machine Learning Technologies (ExaLearn)

Author: Alexander, Francis J, Ang, James, Bilbrey, Jenna A, Balewski, Jan, Casey, Tiernan, Chard, Ryan, Choi, Jong, Choudhury, Sutanay, Debusschere, Bert, DeGennaro, Anthony M, Dryden, Nikoli, Ellis, J Austin, Foster, Ian, Cardona, Cristina Garcia, Ghosh, Sayan, Harrington, Peter, Huang, Yunzhi, Jha, Shantenu, Johnston, Travis, Kagawa, Ai, Kannan, Ramakrishnan, Kumar, Neeraj, Liu, Zhengchun, Maruyama, Naoya, Matsuoka, Satoshi, McCarthy, Erin, Mohd-Yusof, Jamaludin, Nugent, Peter, Oyama, Yosuke, Proffen, Thomas, Pugmire, David, Rajamanickam, Sivasankaran, Ramakrishniah, Vinay, Schram, Malachi, Seal, Sudip K, Sivaraman, Ganesh, Sweeney, Christine, Tan, Li, Thakur, Rajeev, Van Essen, Brian, Ward, Logan, Welch, Paul, Wolf, Michael, Xantheas, Sotiris S, Yager, Kevin G, Yoo, Shinjae, and Yoon, Byung-Jun
Subjects: Bioengineering, Affordable and Clean Energy, Machine learning, exascale computing, reinforcement learning, active learning, high-performance computing for machine learning, machine learning for high-performance computing, Distributed Computing
Abstract: Rapid growth in data, computational methods, and computing power is driving a remarkable revolution in what variously is termed machine learning (ML), statistical learning, computational learning, and artificial intelligence. In addition to highly visible successes in machine-based natural language translation, playing the game Go, and self-driving cars, these new technologies also have profound implications for computational and experimental science and engineering, as well as for the exascale computing systems that the Department of Energy (DOE) is developing to support those disciplines. Not only do these learning technologies open up exciting opportunities for scientific discovery on exascale systems, they also appear poised to have important implications for the design and use of exascale computers themselves, including high-performance computing (HPC) for ML and ML for HPC. The overarching goal of the ExaLearn co-design project is to provide exascale ML software for use by Exascale Computing Project (ECP) applications, other ECP co-design centers, and DOE experimental facilities and leadership class computing facilities.
Published: 2021

15. The association of BP with cardiovascular outcomes in patients with dipstick proteinuria and preserved kidney function

Author: Suzuki, Yuta, Kaneko, Hidehiro, Yano, Yuichiro, Okada, Akira, Fujiu, Katsuhito, Matsuoka, Satoshi, Michihata, Nobuaki, Jo, Taisuke, Takeda, Norifumi, Morita, Hiroyuki, Node, Koichi, Yasunaga, Hideo, Oparil, Suzanne, and Komuro, Issei
Published: 2023
Full Text: View/download PDF

16. Teddy: A Sketching Interface for 3D Freeform Design

Author: Igarashi, Takeo, primary, Matsuoka, Satoshi, additional, and Tanaka, Hidehiko, additional
Published: 2023
Full Text: View/download PDF

17. Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?

Author: Domke, Jens, Vatai, Emil, Drozd, Aleksandr, Chen, Peng, Oyama, Yosuke, Zhang, Lingqi, Salaria, Shweta, Mukunoki, Daichi, Podobas, Artur, Wahib, Mohamed, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too. Hence, our goal is to identify the practical added benefits for HPC and machine learning applications by having access to matrix engines. For this purpose, we perform an in-depth survey of software stacks, proxy applications and benchmarks, and historical batch job records. We provide a cost-benefit analysis of matrix engines, both asymptotically and in conjunction with state-of-the-art processors. While our empirical data will temper the enthusiasm, we also outline opportunities to misuse these dense matrix-multiplication engines if they come for free., Comment: IEEE International Parallel and Distributed Processing Symposium 2021 (IPDPS'21)
Published: 2020

18. Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Author: Wahib, Mohamed, Zhang, Haoyu, Nguyen, Truong Thao, Drozd, Aleksandr, Domke, Jens, Zhang, Lingqi, Takano, Ryousei, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism. We propose a performance model based on the concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. We also introduce the first method to solve the challenging problem of out-of-core multi-node training by carefully pipelining gradient exchanges and performing the parameter updates on the host. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG., Comment: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'20)
Published: 2020

19. The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

Author: Oyama, Yosuke, Maruyama, Naoya, Dryden, Nikoli, McCarthy, Erin, Harrington, Peter, Balewski, Jan, Matsuoka, Satoshi, Nugent, Peter, and Van Essen, Brian
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks. Deep learning-based emerging scientific workflows often require model training with large, high-dimensional samples, which can make training much more costly and even infeasible due to excessive memory usage. We solve these challenges by extensively applying hybrid parallelism throughout the end-to-end training pipeline, including both computations and I/O. Our hybrid-parallel algorithm extends the standard data parallelism with spatial parallelism, which partitions a single sample in the spatial domain, realizing strong scaling beyond the mini-batch dimension with a larger aggregated memory capacity. We evaluate our proposed training algorithms with two challenging 3D CNNs, CosmoFlow and 3D U-Net. Our comprehensive performance studies show that good weak and strong scaling can be achieved for both networks using up 2K GPUs. More importantly, we enable training of CosmoFlow with much larger samples than previously possible, realizing an order-of-magnitude improvement in prediction accuracy., Comment: 12 pages, 10 figures
Published: 2020

20. A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

Author: Zhang, Lingqi, Wahib, Mohamed, Zhang, Haoyu, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia's latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of those synchronization methods. This work explores important undocumented features and provides an in-depth analysis of the performance considerations and pitfalls of the state-of-art synchronization methods for Nvidia GPUs. The provided analysis would be useful when making design choices for applications, libraries, and frameworks running on single and/or multi-GPU environments. We provide a case study of the commonly used reduction operator to illustrate how the knowledge gained in our analysis can be useful. We also describe our micro-benchmarks and measurement methods., Comment: IPDPS20
Published: 2020

21. A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Author: Podobas, Artur, Sano, Kentaro, and Matsuoka, Satoshi
Subjects: Computer Science - Hardware Architecture, A.1, B.0, C.1, C.3
Abstract: With the end of both Dennard's scaling and Moore's law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with a particular focus on the premise behind the different CGRAs and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.
Published: 2020
Full Text: View/download PDF

22. High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

Author: Zohouri, Hamid Reza, Podobas, Artur, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, compute performance compared to first-order stencils. We use an OpenCL-based design that, apart from parameterizing performance knobs, also parameterizes the stencil radius. Furthermore, we show that our performance model exhibits the same accuracy as first-order stencils in predicting the performance of high-order ones. On an Intel Arria 10 GX 1150 device, for 2D and 3D star-shaped stencils, we achieve over 700 and 270 GFLOP/s of compute performance, respectively, up to a stencil radius of four. These results outperform the state-of-the-art YASK framework on a modern Xeon for 2D and 3D stencils, and outperform a modern Xeon Phi for 2D stencils, while achieving competitive performance in 3D. Furthermore, our FPGA design achieves better power efficiency in almost all cases., Comment: Published at RAW'18: 25th Anniversary of Reconfigurable Architectures Workshop held in conjunction with IPDPS'18
Published: 2020
Full Text: View/download PDF

23. AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

Author: Matsumura, Kazuaki, Zohouri, Hamid Reza, Wahib, Mohamed, Endo, Toshio, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.
Published: 2020
Full Text: View/download PDF

24. Association of Metabolic Dysfunction-Associated Fatty Liver Disease With Risk of HF and AF

Author: Ohno, Ryusei, Kaneko, Hidehiro, Suzuki, Yuta, Okada, Akira, Matsuoka, Satoshi, Ueno, Kensuke, Fujiu, Katsuhito, Michihata, Nobuaki, Jo, Taisuke, Takeda, Norifumi, Morita, Hiroyuki, Node, Koichi, Yasunaga, Hideo, and Komuro, Issei
Published: 2023
Full Text: View/download PDF

25. Association Between Early Initiation of Cardiac Rehabilitation and Short-Term Outcomes of Patients With Acute Heart Failure Admitted to the Intensive Care Unit

Author: Ishibashi, Takuma, Kaneko, Hidehiro, Ueno, Kensuke, Morita, Kojiro, Itoh, Hidetaka, Okada, Akira, Kamiya, Kentaro, Suzuki, Yuta, Matsuoka, Satoshi, Fujiu, Katsuhito, Michihata, Nobuaki, Jo, Taisuke, Takeda, Norifumi, Morita, Hiroyuki, Ako, Junya, Node, Koichi, Yasunaga, Hideo, and Komuro, Issei
Published: 2023
Full Text: View/download PDF

26. Impact of Stent Expansion Index on Stent Failure After Left Main Stenting

Author: Watanabe, Yusuke, Mitomo, Satoru, Naganuma, Toru, Nakajima, Akihiro, Matsuoka, Satoshi, Tahara, Satoko, Okutsu, Masaaki, Nakamura, Shotaro, and Nakamura, Sunao
Published: 2023
Full Text: View/download PDF

27. Association of four health behaviors in Life's Essential 8 with the incidence of hypertension and diabetes mellitus

Author: Ueno, Kensuke, Kaneko, Hidehiro, Okada, Akira, Suzuki, Yuta, Matsuoka, Satoshi, Fujiu, Katsuhito, Michihata, Nobuaki, Jo, Taisuke, Takeda, Norifumi, Morita, Hiroyuki, Kamiya, Kentaro, Ako, Junya, Node, Koichi, Yasunaga, Hideo, and Komuro, Issei
Published: 2023
Full Text: View/download PDF

28. The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs With Hybrid Parallelism

Author: Oyama, Yosuke, Maruyama, Naoya, Dryden, Nikoli, McCarthy, Erin, Harrington, Peter, Balewski, Jan, Matsuoka, Satoshi, Nugent, Peter, and Van Essen, Brian
Subjects: Bioengineering, Neurosciences, Computer Software, Distributed Computing, Communications Technologies
Published: 2021

29. Giant bulk piezophotovoltaic effect in 3R-MoS2

Author: Dong, Yu, Yang, Ming-Min, Yoshii, Mao, Matsuoka, Satoshi, Kitamura, Sota, Hasegawa, Tatsuo, Ogawa, Naoki, Morimoto, Takahiro, Ideue, Toshiya, and Iwasa, Yoshihiro
Published: 2023
Full Text: View/download PDF

30. The First Exascale Supercomputer Accelerating AI-for-Science and Beyond

Author: Matsuoka, Satoshi, primary, Sato, Kento, additional, Wahib, Mohamed, additional, and Drozd, Aleksandr, additional
Published: 2023
Full Text: View/download PDF

31. The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

Author: Zohouri, Hamid Reza and Matsuoka, Satoshi
Subjects: Computer Science - Hardware Architecture, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Supported by their high power efficiency and recent advancements in High Level Synthesis (HLS), FPGAs are quickly finding their way into HPC and cloud systems. Large amounts of work have been done so far on loop and area optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and efficiency of the memory controller of FPGAs is missing in literature, which becomes even more crucial when the limited memory bandwidth of modern FPGAs compared to their GPU counterparts is taken into account. In this work, we will analyze the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of overlapped blocking. Our results point to multiple shortcomings in the memory controller of Intel FPGAs, especially with respect to memory access alignment, that can hinder the programmer's ability in maximizing memory performance in their design. For some of these cases, we will provide work-arounds to improve memory bandwidth efficiency; however, a general solution will require major changes in the memory controller itself., Comment: Published at H2RC'19: Fifth International Workshop on Heterogeneous High-performance Reconfigurable Computing held in conjunction with SC'19
Published: 2019
Full Text: View/download PDF

32. iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction

Author: Chen, Peng, Wahib, Mohamed, Takizawa, Shinichiro, Takano, Ryousei, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Computed Tomography (CT) is a widely used technology that requires compute-intense algorithms for image reconstruction. We propose a novel back-projection algorithm that reduces the projection computation cost to 1/6 of the standard algorithm. We also propose an efficient implementation that takes advantage of the heterogeneity of GPU-accelerated systems by overlapping the filtering and back-projection stages on CPUs and GPUs, respectively. Finally, we propose a distributed framework for high-resolution image reconstruction on state-of-the-art GPU-accelerated supercomputers. The framework relies on an elaborate interleave of MPI collective communication steps to achieve scalable communication. Evaluation on a single Tesla V100 GPU demonstrates that our back-projection kernel performs up to 1.6x faster than the standard FDK implementation. We also demonstrate the scalability and instantaneous CT capability of the distributed framework by using up to 2,048 V100 GPUs to solve 4K and 8K problems within 30 seconds and 2 minutes, respectively (including I/O)., Comment: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19)
Published: 2019
Full Text: View/download PDF

33. A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

Author: Chen, Peng, Wahib, Mohamed, Takizawa, Shinichiro, Takano, Ryousei, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs., Comment: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19)
Published: 2019
Full Text: View/download PDF

34. Efficient checkpoint/Restart of CUDA applications

Author: Nukada, Akira, Suzuki, Taichiro, and Matsuoka, Satoshi
Published: 2023
Full Text: View/download PDF

35. Gait Speed and Cardiovascular Disease by Glycemic Status

Author: Ueno, Kensuke, Kaneko, Hidehiro, Kamiya, Kentaro, Okada, Akira, Suzuki, Yuta, Fujiu, Katsuhito, Matsuoka, Satoshi, Michihata, Nobuaki, Takeda, Norifumi, Jo, Taisuke, Morita, Hiroyuki, Ako, Junya, Node, Koichi, Yasunaga, Hideo, and Komuro, Issei
Published: 2023
Full Text: View/download PDF

36. Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks

Author: Nagasaka, Yusuke, Nukada, Akira, Kojima, Ryosuke, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Graph Convolutional Networks (GCNs) are recently getting much attention in bioinformatics and chemoinformatics as a state-of-the-art machine learning approach with high accuracy. GCNs process convolutional operations along with graph structures, and GPUs are used to process enormous operations including sparse-dense matrix multiplication (SpMM) when the graph structure is expressed as an adjacency matrix with sparse matrix format. However, the SpMM operation on small graph, where the number of nodes is tens or hundreds, hardly exploits high parallelism or compute power of GPU. Therefore, SpMM becomes a bottleneck of training and inference in GCNs applications. In order to improve the performance of GCNs applications, we propose new SpMM algorithm especially for small sparse matrix and Batched SpMM, which exploits high parallelism of GPU by processing multiple SpMM operations with single CUDA kernel. To the best of our knowledge, this is the first work of batched approach for SpMM. We evaluated the performance of the GCNs application on TSUBAME3.0 implementing NVIDIA Tesla P100 GPU, and our batched approach shows significant speedups of up to 1.59x and 1.37x in training and inference, respectively., Comment: 10 pages, 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
Published: 2019

37. A New Linear Time Correctness Condition for Multiplicative Linear Logic

Author: Matsuoka, Satoshi
Subjects: Computer Science - Logic in Computer Science
Abstract: In this paper, we give a new linear time correctness condition for proof nets of Multiplicative Linear Logic without units. Our approach is based on a rewriting system over trees. We have only three rewrite rules. Compared with previous linear time correctness conditions, our system is surprisingly simple and intuitively appealing., Comment: Found an bug in the proof of the linear time claim in the second version. Adapted the algorithm in order to guarantee the linear time termination. Added an additional example
Published: 2019

38. Adaptive Pattern Matching with Reinforcement Learning for Dynamic Graphs

Author: Kanezashi, Hiroki, Suzumura, Toyotaro, Garcia-Gasulla, Dario, Oh, Min-hwan, and Matsuoka, Satoshi
Subjects: Computer Science - Databases
Abstract: Graph pattern matching algorithms to handle million-scale dynamic graphs are widely used in many applications such as social network analytics and suspicious transaction detections from financial networks. On the other hand, the computation complexity of many graph pattern matching algorithms is expensive, and it is not affordable to extract patterns from million-scale graphs. Moreover, most real-world networks are time-evolving, updating their structures continuously, which makes it harder to update and output newly matched patterns in real time. Many incremental graph pattern matching algorithms which reduce the number of updates have been proposed to handle such dynamic graphs. However, it is still challenging to recompute vertices in the incremental graph pattern matching algorithms in a single process, and that prevents the real-time analysis. We propose an incremental graph pattern matching algorithm to deal with time-evolving graph data and also propose an adaptive optimization system based on reinforcement learning to recompute vertices in the incremental process more efficiently. Then we discuss the qualitative efficiency of our system with several types of data graphs and pattern graphs. We evaluate the performance using million-scale attributed and time-evolving social graphs. Our incremental algorithm is up to 10.1 times faster than an existing graph pattern matching and 1.95 times faster with the adaptive systems in a computation node than naive incremental processing., Comment: 10 pages and 11 figures
Published: 2018
Full Text: View/download PDF

39. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Author: Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın
Subjects: Distributed Computing and Systems Software, Information and Computing Sciences, Sparse matrix, SpGEMM, Intel KNL, Distributed Computing, Cognitive Sciences, Distributed computing and systems software
Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supports the recipe. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. Finally, we integrate our implementations into a large-scale protein clustering code named HipMCL, accelerating its SpGEMM kernel by up to 10X and achieving an overall performance boost for the whole HipMCL application by 2.6X.
Published: 2019

40. Biomarkers associated with coronary high-risk plaques

Author: Nakajima, Akihiro, Libby, Peter, Mitomo, Satoru, Yuki, Haruhito, Araki, Makoto, Seegers, Lena Marie, McNulty, Iris, Lee, Hang, Ishibashi, Midori, Kobayashi, Kazuna, Dijkstra, Jouke, Ouchi, Toru, Onishi, Hirokazu, Yabushita, Hiroto, Matsuoka, Satoshi, Kawamoto, Hiroyoshi, Watanabe, Yusuke, Tanaka, Kentaro, Chou, Shengpu, Sato, Tomohiko, Naganuma, Toru, Okutsu, Masaaki, Tahara, Satoko, Kurita, Naoyuki, Nakamura, Shotaro, Kuter, David J., Nakamura, Sunao, and Jang, Ik-Kyung
Published: 2022
Full Text: View/download PDF

41. Impact of directional coronary atherectomy followed by drug-coated balloon strategy to avoid the complex stenting for bifurcation lesions

Author: Okutsu, Masaaki, Mitomo, Satoru, Ouchi, Toru, Yuki, Hisahito, Ueno, Takahiro, Onish, Hirokazu, Yabushita, Hiroto, Matsuoka, Satoshi, Kawamoto, Hiroyoshi, Watanabe, Yusuke, Tanaka, Kentaro, Naganuma, Toru, Sato, Tomohiko, Tahara, Satoko, Kurita, Naoyuki, Nakamura, Shotaro, and Nakamura, Sunao
Published: 2022
Full Text: View/download PDF

42. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Author: Osawa, Kazuki, Tsuji, Yohei, Ueno, Yuichiro, Naruse, Akira, Yokota, Rio, and Matsuoka, Satoshi
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative approach using a second-order optimization method that shows similar generalization capability to first-order methods, but converges faster and can handle larger mini-batches. To test our method on a benchmark where highly optimized first-order methods are available as references, we train ResNet-50 on ImageNet. We converged to 75% Top-1 validation accuracy in 35 epochs for mini-batch sizes under 16,384, and achieved 75% even with a mini-batch size of 131,072, which took only 978 iterations., Comment: 10 pages, 7 figures. Accepted at CVPR 2019, Long Beach, CA
Published: 2018

43. Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?

Author: Domke, Jens, Matsumura, Kazuaki, Wahib, Mohamed, Zhang, Haoyu, Yashima, Keita, Tsuchikawa, Toshiki, Tsuji, Yohei, Podobas, Artur, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view. In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy application on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNM and KNL architecturally deviate at one important point: the silicon area devoted to double-precision arithmetic's. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic. Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g. upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units., Comment: IEEE International Parallel and Distributed Processing Symposium 2019
Published: 2018

44. {\mu}-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

Author: Oyama, Yosuke, Ben-Nun, Tal, Hoefler, Torsten, and Matsuoka, Satoshi
Subjects: Computer Science - Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Mathematical Software, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning, I.2.6
Abstract: NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory footprint may vary considerably, depending on the layer dimensions. When an algorithm is automatically selected by cuDNN, the decision is performed on a per-layer basis, and thus it often resorts to slower algorithms that fit the workspace size constraints. We present {\mu}-cuDNN, a transparent wrapper library for cuDNN, which divides layers' mini-batch computation into several micro-batches. Based on Dynamic Programming and Integer Linear Programming, {\mu}-cuDNN enables faster algorithms by decreasing the workspace requirements. At the same time, {\mu}-cuDNN keeps the computational semantics unchanged, so that it decouples statistical efficiency from the hardware efficiency safely. We demonstrate the effectiveness of {\mu}-cuDNN over two frameworks, Caffe and TensorFlow, achieving speedups of 1.63x for AlexNet and 1.21x for ResNet-18 on P100-SXM2 GPU. These results indicate that using micro-batches can seamlessly increase the performance of deep learning, while maintaining the same memory footprint., Comment: 11 pages, 14 figures. Part of the content have been published in IPSJ SIG Technical Report, Vol. 2017-HPC-162, No. 22, pp. 1-9, 2017. (DOI: http://id.nii.ac.jp/1001/00184814)
Published: 2018

45. High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

Author: Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, and Buluç, Aydın
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix., Comment: 12 pages (extended version of conference paper)
Published: 2018

46. Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Author: Zohouri, Hamid Reza, Podobas, Artur, and Matsuoka, Satoshi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture
Abstract: Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively., Comment: FPGA '18: 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Published: 2018
Full Text: View/download PDF

47. Glycemic status and the association of change in blood pressure with incident cardiovascular disease

Author: Suzuki, Yuta, Kaneko, Hidehiro, Yano, Yuichiro, Okada, Akira, Itoh, Hidetaka, Matsuoka, Satoshi, Fujiu, Katsuhito, Michihata, Nobuaki, Jo, Taisuke, Takeda, Norifumi, Morita, Hiroyuki, Kamiya, Kentaro, Matsunaga, Atsuhiko, Ako, Junya, Node, Koichi, Yasunaga, Hideo, and Komuro, Issei
Published: 2022
Full Text: View/download PDF

48. Can the United States Maintain Its Leadership in High-Performance Computing? - A report from the ASCAC Subcommittee on American Competitiveness and Innovation to the ASCR Office

Author: Dongarra, Jack, primary, Deelman, Ewa, additional, Hey, Tony, additional, Matsuoka, Satoshi, additional, Sarakar, Vivek, additional, Bell, Greg, additional, Foster, Ian, additional, Keyes, David, additional, Kranzlmueller, Dieter, additional, Lucas, Bob, additional, Parker, Lynne, additional, Shalf, John, additional, Stanzione, Dan, additional, Stevens, Rick, additional, and Yelick, Katherine, additional
Published: 2023
Full Text: View/download PDF

49. Kidney outcomes in patients with diabetes mellitus did not differ between individual sodium-glucose cotransporter-2 inhibitors

Author: Suzuki, Yuta, Kaneko, Hidehiro, Okada, Akira, Matsuoka, Satoshi, Fujiu, Katsuhito, Michihata, Nobuaki, Jo, Taisuke, Takeda, Norifumi, Morita, Hiroyuki, Node, Koichi, Nangaku, Masaomi, Yasunaga, Hideo, and Komuro, Issei
Published: 2022
Full Text: View/download PDF

50. Observation of the drainage process of the residual lipoma after endoscopic unroofing technique during colonoscopic evaluation of post-procedural hematochezia

Author: Ko, Yi-Ling, Matsuoka, Hiroki, Nomaru, Ryohei, Imakiire, So, Sakisaka, Hideto, Matsuoka, Satoshi, Kuno, Nobuaki, Abe, Koichi, Funakoshi, Sadahiro, Ishida, Yusuke, Ishibashi, Hideki, Koga, Kaori, Saito, Tetsuhiro, Takeshita, Morishige, and Hirai, Fumihito
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,861 results on '"MATSUOKA, SATOSHI"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources