142 results on '"Xeon Phi"'
Search Results
2. A Case Study for Performance Portability Using OpenMP 4.5
- Author
-
Gayatri, Rahulkumar, Yang, Charlene, Kurth, Thorsten, Deslippe, Jack, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Chandrasekaran, Sunita, editor, Juckeland, Guido, editor, and Wienke, Sandra, editor
- Published
- 2019
- Full Text
- View/download PDF
3. Toward a BLAS library truly portable across different accelerator types.
- Author
-
Rodriguez-Gutiez, Eduardo, Moreton-Fernandez, Ana, Gonzalez-Escribano, Arturo, and Llanos, Diego R.
- Subjects
- *
COMPUTING platforms , *LINEAR algebra , *SUPPORT groups , *COPROCESSORS , *DATA structures , *PORTABLE computers - Abstract
Scientific applications are some of the most computationally demanding software pieces. Their core is usually a set of linear algebra operations, which may represent a significant part of the overall run-time of the application. BLAS libraries aim to solve this problem by exposing a set of highly optimized, reusable routines. There are several implementations specifically tuned for different types of computing platforms, including coprocessors. Some examples include the one bundled with the Intel MKL library, which targets Intel CPUs or Xeon Phi coprocessors, or the cuBLAS library, which is specifically designed for NVIDIA GPUs. Nowadays, computing nodes in many supercomputing clusters include one or more different coprocessor types. To fully exploit these platforms might require programs that can adapt at run-time to the chosen device type, hardwiring in the program the code needed to use a different library for each device type that can be selected. This also forces the programmer to deal with different interface particularities and mechanisms to manage the memory transfers of the data structures used as parameters. This paper presents a unified, performance-oriented, and portable interface for BLAS. This interface has been integrated into a heterogeneous programming model (Controllers) which supports groups of CPU cores, Xeon Phi accelerators, or NVIDIA GPUs in a transparent way. The contribution of this paper includes: An abstraction layer to hide programming differences between diverse BLAS libraries; new types of kernel classes to support the context manipulation of different external BLAS libraries; a new kernel selection policy that considers both programmer kernels and different external libraries; a complete new Controller library interface for the whole collection of BLAS routines. This proposal enables the creation of BLAS-based portable codes that can execute on top of different types of accelerators by changing a single initialization parameter. Our software internally exploits different preexisting and widely known BLAS library implementations, such as cuBLAS, MAGMA, or the one found in Intel MKL. It transparently uses the most appropriate library for the selected device. Our experimental results show that our abstraction does not introduce significant performance penalties, while achieving the desired portability. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
4. Vessel Segmentation for Noisy CT Data with Quality Measure Based on Single-Point Contrast-to-Noise Ratio
- Author
-
Nikonorov, A., Kolsanov, A., Petrov, M., Yuzifovich, Y., Prilepin, E., Chaplygin, S., Zelter, P., Bychenkov, K., Diniz Junqueira Barbosa, Simone, Series editor, Chen, Phoebe, Series editor, Du, Xiaoyong, Series editor, Filipe, Joaquim, Series editor, Kara, Orhun, Series editor, Liu, Ting, Series editor, Kotenko, Igor, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Obaidat, Mohammad S., editor, and Lorenz, Pascal, editor
- Published
- 2016
- Full Text
- View/download PDF
5. SIMD Monte-Carlo Numerical Simulations Accelerated on GPU and Xeon Phi.
- Author
-
Plazolles, Bastien, El Baz, Didier, Spel, Martin, Rivola, Vincent, and Gegout, Pascal
- Subjects
- *
GRAPHICS processing units , *SIMD (Computer architecture) , *COMPUTING platforms , *MONTE Carlo method , *LOOPS (Group theory) - Abstract
The efficiency of a pleasingly parallel application is studied for several computing platforms. A real world problem, i.e., Monte-Carlo numerical simulations of stratospheric balloon envelope drift descent is considered. We detail the optimization of the SIMD parallel codes on the K40 and K80 GPUs as well as on the Intel Xeon Phi. We emphasize on loop and task parallelism, multi-threading and vectorization, respectively. The experiments show that GPU and MIC permit one to decrease computing time by non negligeable factors, as compared to a parallel code implemented on a two sockets CPU (E5-2680-v2) which finally allows us to use these devices in operational conditions. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
6. Speeding-up Bioinformatics Algorithms with Heterogeneous Architectures: Highly Heterogeneous Smith-Waterman (HHeterSW).
- Author
-
GÁLVEZ, SERGIO, FERUSIC, ADIS, ESTEBAN, FRANCISCO J., HERNÁNDEZ, PILAR, CABALLERO, JUAN A., and DORADO, GABRIEL
- Subjects
- *
BIOINFORMATICS , *SEQUENCE alignment , *HETEROGENEOUS computing , *BIOLOGICAL databases , *INTEL microprocessors - Abstract
The Smith-Waterman algorithm has a great sensitivity when used for biological sequencedatabase searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardwarearchitectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
7. An Optimizing Multi-platform Source-to-source Compiler Framework for the NEURON MODeling Language
- Author
-
Liam Keegan, Pramod Kumbhar, Michael L. Hines, Felix Schürmann, Omar Awile, James G. King, and Jorge Blanco Alonso
- Subjects
0301 basic medicine ,FOS: Computer and information sciences ,Speedup ,Modeling language ,Computer science ,Parallel computing ,computer.software_genre ,Article ,DSL ,03 medical and health sciences ,CUDA ,0302 clinical medicine ,Software ,Code generation ,NEURON ,SIMD ,Massively parallel ,SPMD ,business.industry ,030104 developmental biology ,FOS: Biological sciences ,Quantitative Biology - Neurons and Cognition ,HPC ,Production (computer science) ,Neurons and Cognition (q-bio.NC) ,Computer Science - Mathematical Software ,Compiler ,business ,computer ,Mathematical Software (cs.MS) ,030217 neurology & neurosurgery ,Xeon Phi ,Neuroscience - Abstract
Domain-specific languages (DSLs) play an increasingly important role in the generation of high performing software. They allow the user to exploit domain knowledge for the generation of more efficient code on target architectures. Here, we describe a new code generation framework (NMODL) for an existing DSL in the NEURON framework, a widely used software for massively parallel simulation of biophysically detailed brain tissue models. Existing NMODL DSL transpilers lack either essential features to generate optimized code or capability to parse the diversity of existing models in the user community. Our NMODL framework has been tested against a large number of previously published user models and offers high-level domain-specific optimizations and symbolic algebraic simplifications before target code generation. NMODL implements multiple SIMD and SPMD targets optimized for modern hardware. When comparing NMODL-generated kernels with NEURON we observe a speedup of up to 20\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}, resulting in overall speedups of two different production simulations by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sim }{7}{\times }$$\end{document}. When compared to SIMD optimized kernels that heavily relied on auto-vectorization by the compiler still a speedup of up to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sim }{2}{\times }$$\end{document} is observed.
- Published
- 2020
8. VIENNACL--LINEAR ALGEBRA LIBRARY FOR MULTI- AND MANY-CORE ARCHITECTURES.
- Author
-
RUPP, KARL, TILLET, PHILIPPE, RUDOLF, FLORIAN, WEINBUB, JOSEF, MORHAMMER, ANDREAS, GRASSER, TIBOR, JÜNGEL, ANSGAR, and SELBERHERR, SIEGFRIED
- Subjects
- *
LINEAR algebra , *GRAPHICS processing units , *COMPUTER programming , *CENTRAL processing units , *SCIENTIFIC computing - Abstract
CUDA, OpenCL, and OpenMP are popular programming models for the multicore architectures of CPUs and many-core architectures of GPUs or Xeon Phis. At the same time, computational scientists face the question of which programming model to use to obtain their scientific results. We present the linear algebra library ViennaCL, which is built on top of all three programming models, thus enabling computational scientists to interface to a single library, yet obtain high performance for all three hardware types. Since the respective compute back end can be selected at runtime, one can seamlessly switch between different hardware types without the need for errorprone and time-consuming recompilation steps. We present new benchmark results for sparse linear algebra operations in ViennaCL, complementing results for the dense linear algebra operations in ViennaCL reported in earlier work. Comparisons with vendor libraries show that ViennaCL provides better overall performance for sparse matrix-vector and sparse matrix-matrix products. Additional benchmark results for pipelined iterative solvers with kernel fusion and preconditioners identify the respective sweet spots for CPUs, Xeon Phis, and GPUs. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
9. Manycore Algorithms for Batch Scalar and Block Tridiagonal Solvers.
- Author
-
László, Endre, Giles, Mike, and Appleyard, Jeremy
- Subjects
- *
ALGORITHMS , *PARALLEL programming , *GRAPHICS processing units , *DATA transmission systems , *BANDWIDTHS , *COMPUTATIONAL fluid dynamics - Abstract
Engineering, scientific, and financial applications often require the simultaneous solution of a large number of independent tridiagonal systems of equations with varying coefficients. Since the number of systems is large enough to offer considerable parallelism on manycore systems, the choice between different tridiagonal solution algorithms, such as Thomas, Cyclic Reduction (CR) or Parallel Cyclic Reduction (PCR) needs to be reexamined. This work investigates the optimal choice of tridiagonal algorithm for CPU, Intel MIC, and NVIDIA GPU with a focus on minimizing the amount of data transfer to and from the main memory using novel algorithms and the register-blocking mechanism, and maximizing the achieved bandwidth. It also considers block tridiagonal solutions, which are sometimes required in Computational Fluid Dynamic (CFD) applications. A novel work-sharing and register blocking—based Thomas solver is also presented. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
10. Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi.
- Author
-
Tran, Tuan Tu, Liu, Yongchao, and Schmidt, Bertil
- Subjects
- *
PARALLEL computers , *APPROXIMATION theory , *GRAPHICS processing units , *DETERMINISTIC finite automata , *ERROR analysis in mathematics - Abstract
Approximate pattern matching (APM) targets to find the occurrences of a pattern inside a subject text allowing a limited number of errors. It has been widely used in many application areas such as bioinformatics and information retrieval. Bit-parallel APM takes advantage of the intrinsic parallelism of bitwise operations inside a machine word. This approach typically encodes non-deterministic finite automaton (NFA) states or value differences between adjacent cells of a dynamic programming matrix in the form of bit arrays. Wu–Manber (WM) is a well-known bit-parallel APM algorithm, which simulates an NFA and gains parallel efficiency by performing multiple state updates within a machine word. An important parameter is the machine word size (e.g. 32 or 64 bits for CPUs). Due to increasing vector capabilities, efficient mapping of bit-parallel APM algorithms onto modern high performance computing architectures is an interesting research topic. Prominent examples are Xeon Phi coprocessors and CUDA-enabled GPUs, which provide words of size 512 bits (by means of vector registers) and 1024 bits (by means of warps), respectively. In this paper, we investigate mappings of the WM algorithm onto these two accelerator types. Both architectures are able to achieve around two orders-of-magnitude speedups compared to a single-threaded CPU implementation. Moreover, our tile-based implementation on a GeForce Titan graphics card runs up to 2.9 × faster than our implementation on an Intel Xeon Phi 5110P. Source code is available at http://xbitpar.sourceforge.net . [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
11. Vectorizing unstructured mesh computations for many-core architectures.
- Author
-
Reguly, I Z., László, Endre, Mudalige, Gihan R., and Giles, Mike B.
- Subjects
COMPUTER architecture ,PERFORMANCE evaluation ,COMPUTER input-output equipment ,SIMD (Computer architecture) ,GRID computing - Abstract
Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon-Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon-Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many-core systems. We show that auto-vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near-optimal performance, two times faster than non-vectorized code. We observe that the Xeon-Phi does not provide good performance for these applications but is still comparable with a pair of mid-range Xeon chips. Copyright © 2015 John Wiley & Sons, Ltd. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
12. Exact diagonalization of quantum lattice models on coprocessors.
- Author
-
Siro, T. and Harju, A.
- Subjects
- *
COPROCESSORS , *LANCZOS method , *GRAPHICS processing units , *QUANTUM computers , *MOTHERBOARDS - Abstract
We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
13. Multigrid for Matrix-Free High-Order Finite Element Computations on Graphics Processors
- Author
-
Karl Ljungkvist and Martin Kronbichler
- Subjects
Multi-core processor ,Computer science ,Double-precision floating-point format ,010103 numerical & computational mathematics ,Intrinsics ,Solver ,01 natural sciences ,Single-precision floating-point format ,Finite element method ,Computer Science Applications ,Computational science ,010101 applied mathematics ,CUDA ,Multigrid method ,Computational Theory and Mathematics ,Hardware and Architecture ,Modeling and Simulation ,Polygon mesh ,0101 mathematics ,Software ,Xeon Phi - Abstract
This article presents matrix-free finite-element techniques for efficiently solving partial differential equations on modern many-core processors, such as graphics cards. We develop a GPU parallelization of a matrix-free geometric multigrid iterative solver targeting moderate and high polynomial degrees, with support for general curved and adaptively refined hexahedral meshes with hanging nodes. The central algorithmic component is the matrix-free operator evaluation with sum factorization. We compare the node-level performance of our implementation running on an Nvidia Pascal P100 GPU to a highly optimized multicore implementation running on comparable Intel Broadwell CPUs and an Intel Xeon Phi. Our experiments show that the GPU implementation is approximately 1.5 to 2 times faster across four different scenarios of the Poisson equation and a variety of element degrees in 2D and 3D. The lowest time to solution per degree of freedom is recorded for moderate polynomial degrees between 3 and 5. A detailed performance analysis highlights the capabilities of the GPU architecture and the chosen execution model with threading within the element, particularly with respect to the evaluation of the matrix-vector product. Atomic intrinsics are shown to provide a fast way for avoiding the possible race conditions in summing the elemental residuals into the global vector associated to shared vertices, edges, and surfaces. In addition, the solver infrastructure allows for using mixed-precision arithmetic that performs the multigrid V-cycle in single precision with an outer correction in double precision, increasing throughput by up to 83%.
- Published
- 2019
14. A high-order cross-platform incompressible Navier–Stokes solver via artificial compressibility with application to a turbulent jet
- Author
-
Niki A. Loppi, Freddie D. Witherden, Antony Jameson, Peter E. Vincent, and Engineering & Physical Science Research Council (EPSRC)
- Subjects
Technology ,Parallel algorithms ,Computer science ,General Physics and Astronomy ,FLUX RECONSTRUCTION SCHEMES ,Modern hardware ,010103 numerical & computational mathematics ,01 natural sciences ,Computational science ,CUDA ,CONSERVATION-LAWS ,Multigrid method ,Incompressible flows ,FINITE-ELEMENT-METHOD ,Incompressible flow ,Code generation ,Artificial compressibility ,0101 mathematics ,EQUATIONS ,Massively parallel ,01 Mathematical Sciences ,ComputingMethodologies_COMPUTERGRAPHICS ,Science & Technology ,02 Physical Sciences ,SPECTRAL DIFFERENCE METHOD ,Physics ,PYFR ,Solver ,FRAMEWORK ,Nuclear & Particles Physics ,Physics, Mathematical ,Turbulence ,010101 applied mathematics ,Hardware and Architecture ,Physical Sciences ,Computer Science ,SIMULATION ,Computer Science::Mathematical Software ,Compressibility ,Computer Science, Interdisciplinary Applications ,08 Information and Computing Sciences ,UNSTRUCTURED GRIDS ,Flux reconstruction ,Xeon Phi - Abstract
Modern hardware architectures such as GPUs and manycore processors are characterised by an abundance of compute capability relative to memory bandwidth. This makes them well-suited to solving temporally explicit and spatially compact discretisations of hyperbolic conservation laws. However, classical pressure-projection-based incompressible Navier–Stokes formulations do not fall into this category. One attractive formulation for solving incompressible problems on modern hardware is the method of artificial compressibility. When combined with explicit dual time stepping and a high-order Flux Reconstruction discretisation, the majority of operations can be cast as compute bound matrix–matrix multiplications that are well-suited for GPU acceleration and manycore processing. In this work, we develop a high-order cross-platform incompressible Navier–Stokes solver, via artificial compressibility and dual time stepping, in the PyFR framework. The solver runs on a range of computer architectures, from laptops to the largest supercomputers, via a platform-unified templating approach that can generate/compile CUDA, OpenCL and C/OpenMP code at runtime. The extensibility of the cross-platform templating framework defined within PyFR is clearly demonstrated, as is the utility of P -multigrid for convergence acceleration. The platform independence of the solver is verified on Nvidia Tesla P100 GPUs and Intel Xeon Phi 7210 KNL manycore processors with a 3D Taylor–Green vortex test case. Additionally, the solver is applied to a 3D turbulent jet test case at R e = 10,000 , and strong scaling is reported up to 144 GPUs. The new software constitutes the first high-order accurate cross-platform implementation of an incompressible Navier–Stokes solver via artificial compressibility and P -multigrid accelerated dual time stepping to be published in the literature. The technology has applications in a range of sectors, including the maritime and automotive industries. Moreover, due to its cross-platform nature, the technology is well placed to remain relevant in an era of rapidly evolving hardware architectures. Program summary Program Title: PyFR v1.7.5 Program Files doi: http://dx.doi.org/10.17632/65m665nt9c.1 Licensing provisions: BSD 3-clause Programming language: Python, CUDA, OpenCL and C Supplementary material: Configuration and mesh files for the Taylor–Green Vortex and Turbulent jet test cases Journal reference of previous version: Comput. Phys. Commun. 185 (2014) 3028–3040 Does the new version supersede the previous version?: Yes Reasons for the new version: Adding support for incompressible flows Summary of revisions: Introducing a new high-order cross-platform incompressible flow solver via artificial compressibility and P -multigrid accelerated dual time stepping. Nature of problem: Incompressible Euler and Navier–Stokes equations for solving unsteady turbulent flows. Solution method: Artificial compressibility formulation discretised with a high-order Flux Reconstruction approach in space and P -multigrid accelerated dual time stepping in time. Additional comments including restrictions and unusual features: The algorithm targets modern massively parallel hardware platforms. Cross-platform capability is achieved via runtime code generation.
- Published
- 2018
15. Performance and Portability of a Linear Solver Across Emerging Architectures
- Author
-
Eric J. Nielsen, Aaron Walden, and Mohammad Zubair
- Subjects
CUDA ,Kernel (linear algebra) ,Xeon ,Computer science ,Programming paradigm ,Benchmark (computing) ,Memory bandwidth ,Parallel computing ,Intrinsics ,Xeon Phi - Abstract
A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear solver is challenging on recent CPUs offering vector architectures. Vector loads and stores are essential to effectively utilize available memory bandwidth on CPUs, and maintaining performance across different CPUs can be difficult in the face of varying vector lengths offered by each. A similar challenge occurs on GPU architectures, where it is essential to have coalesced memory accesses to utilize memory bandwidth effectively. In this work, we demonstrate that restructuring a computation, and possibly data layout, with regard to architecture is essential to achieve optimal performance by establishing a performance benchmark for each target architecture in a low level language such as vector intrinsics or CUDA. In doing so, we demonstrate how a linear solver kernel can be mapped to Intel® Xeon™ and Xeon Phi™, Marvell® ThunderX2®, NEC® SX-Aurora™ TSUBASA Vector Engine, and NVIDIA® and AMD® GPUs. We further demonstrate that the required code restructuring can be achieved in higher level programming environments such as OpenACC, OCCA, and Intel® OneAPI™/SYCL, and that each generally results in optimal performance on the target architecture. Relative performance metrics for all implementations are shown, and subjective ratings for ease of implementation and optimization are suggested.
- Published
- 2021
16. Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures
- Author
-
Peng Zhang, Canqun Yang, Zheng Wang, Jianbin Fang, Tao Tang, and Chun Huang
- Subjects
CUDA ,Speedup ,Computational Theory and Mathematics ,Hardware and Architecture ,Computer science ,Distributed computing ,Signal Processing ,Task analysis ,Task parallelism ,Partition (database) ,Multiplexing ,Xeon Phi - Abstract
As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way of improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks – a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learned model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93 percent of the performance delivered by a theoretically perfect predictor.
- Published
- 2020
17. A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit
- Author
-
Jiří Filipovič, Filip Petrovič, Jaroslav Ol’ha, Jana Hozzová, Siegfried Benkner, Richard Trembecký, David Střelák, European Commission, Fundación 'la Caixa', and Comunidad de Madrid
- Subjects
FOS: Computer and information sciences ,Autotuning benchmark set ,Xeon ,OpenCL ,Computer Networks and Communications ,Computer science ,Performance tuning ,020206 networking & telecommunications ,02 engineering and technology ,Parallel computing ,Cuda ,Program optimization ,Porting ,CUDA ,Software portability ,Computer Science - Distributed, Parallel, and Cluster Computing ,Hardware and Architecture ,Dynamic autotuning ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Hard coding ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Software ,Xeon Phi ,Performance optimization - Abstract
© 2020., In recent years, the heterogeneity of both commodity and supercomputers hardware has increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often key to improving speed and energy efficiency of highly-parallel codes. However, due to the complexity of heterogeneous architectures, optimization of codes for a certain type of architecture as well as porting codes across different architectures, while maintaining a comparable level of performance, can be extremely challenging. Addressing the challenges associated with performance optimization and performance portability, autotuning has gained a lot of interest. Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various GPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes. Although it is generally believed that autotuning spaces tend to be too large to be searched during application runtime, we show that it is not necessarily the case when tuning spaces are designed rationally. Many of our kernels reach near peak-performance with moderately sized tuning spaces that can be searched at runtime with acceptable overhead. Finally we demonstrate, how dynamic performance tuning can be integrated into a real-world application from cryo-electron microscopy domain., The work was supported from European Regional Development Fund-Project “CERIT Scientific Cloud” (No.CZ.02.1.01/0.0/0.0/16_013/0001802). The project that gave rise to these results received the support of a fellowship from ”la Caixa” Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No. 713673. The Spanish Ministry of Economy and Competitiveness through Grants BIO2016-76400-R(AEI/FEDER, UE). “Comunidad Autónoma de Madrid” through Grant: S2017/BMD-3817.
- Published
- 2020
18. Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores
- Author
-
Lilun Zhang, Yu Zhuang, Anthony T. Chronopoulos, Xinghua Cheng, Wei Liu, and Yongxian Wang
- Subjects
FOS: Computer and information sciences ,Computer Science - Performance ,General Computer Science ,Computer science ,020209 energy ,General Engineering ,Symmetric multiprocessor system ,02 engineering and technology ,Parallel computing ,Solver ,Grid ,Supercomputer ,01 natural sciences ,Computational science ,Performance (cs.PF) ,010101 applied mathematics ,CUDA ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,0101 mathematics ,Xeon Phi - Abstract
For computational fluid dynamics (CFD) applications with a large number of grid points/cells, parallel computing is a common efficient strategy to reduce the computational time. How to achieve the best performance in the modern supercomputer system, especially with heterogeneous computing resources such as hybrid CPU+GPU, or a CPU + Intel Xeon Phi (MIC) co-processors, is still a great challenge. An in-house parallel CFD code capable of simulating three dimensional structured grid applications is developed and tested in this study. Several methods of parallelization, performance optimization and code tuning both in the CPU-only homogeneous system and in the heterogeneous system are proposed based on identifying potential parallelism of applications, balancing the work load among all kinds of computing devices, tuning the multi-thread code toward better performance in intra-machine node with hundreds of CPU/MIC cores, and optimizing the communication among inter-nodes, inter-cores, and between CPUs and MICs. Some benchmark cases from model and/or industrial CFD applications are tested on the Tianhe-1A and Tianhe-2 supercomputer to evaluate the performance. Among these CFD cases, the maximum number of grid cells reached 780 billion. The tuned solver successfully scales up to half of the entire Tianhe-2 supercomputer system with over 1.376 million of heterogeneous cores. The test results and performance analysis are discussed in detail., Comment: 12pages, 12 figures
- Published
- 2018
19. Language Constructs and Semantics for Runtime-independent Parallelism Expression on Heterogeneous Systems
- Author
-
Xiaoshe Dong, Weiduo Chen, Shusen Wu, and Yufei Wang
- Subjects
010302 applied physics ,020203 distributed computing ,Computer science ,Semantics (computer science) ,Programming language ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Operational semantics ,Runtime system ,Software portability ,CUDA ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Compiler ,computer ,Massively parallel ,Core language ,Language construct ,Xeon Phi - Abstract
The emergence of heterogeneous processors such as GPUs provide massively parallel computing power but also exacerbate the difficulties of parallel programming. Although low-level programming methods such as CUDA and OpenCL can yield good performance, the programming productivity is poor and applications lack portability. In this paper, we present a core language Ruler, which extends C with high-level parallel constructs. These constructs enable programmers to express parallelism in programs without concerning runtime details, thus ease user programming. We present the operational semantics of the language and show how these constructs reserve parallel patterns and parallelism degree of high-level applications. Those information could inform the compiler to generate efficient code and maintain the performance on different platforms. We have implemented a compiler and runtime system for Ruler on the top of OpenCL. Multiple benchmarks are rebuilt with Ruler and evaluated on both a NVIDIA GPU and an Intel MIC platform to demonstrate the effectiveness of our techniques. The size of Ruler code is only 13%-64% to that of the OpenCL code. The rebuilt benchmarks execute smoothly on both platforms after compilation, yielding a competitive performance to that of handcrafted benchmark OpenCL code on both platforms.
- Published
- 2019
20. Accelerating supply chains with Ant Colony Optimization across range of hardware solutions
- Author
-
Tatiana Kalganova and Ivars Dzalbs
- Subjects
ant colony optimization ,transportation network optimization ,General Computer Science ,Computer science ,0211 other engineering and technologies ,Context (language use) ,02 engineering and technology ,Travelling salesman problem ,Article ,Parallel ACO on Xeon Phi/GPU ,parallel ACO on Xeon Phi/GPU ,CUDA ,0202 electrical engineering, electronic engineering, information engineering ,Metaheuristic ,021103 operations research ,business.industry ,Ant colony optimization algorithms ,General Engineering ,Ant colony ,Transportation network optimisation ,Ant Colony Optimization ,Benchmark (computing) ,020201 artificial intelligence & image processing ,business ,Computer hardware ,Xeon Phi - Abstract
Highlights • Standard TSP instances are not suitable for generalized conclusions. • Although good for TSPs, GPUs are not well suited for explored supply chain problem. • 25.4x speed-up was achieved on CPU compared to its sequential counterpart. • 148x speed-up was achieved on Xeon Phi compared to its sequential counterpart., Ant Colony algorithm has been applied to various optimisation problems, however, most of the previous work on scaling and parallelism focuses on Travelling Salesman Problems (TSPs). Although useful for benchmarks and new idea comparison, the algorithmic dynamics do not always transfer to complex real-life problems, where additional meta-data is required during solution construction. This paper explores how the benchmark performance differs from real-world problems in the context of Ant Colony Optimization (ACO) and demonstrate that in order to generalise the findings, the algorithms have to be tested on both standard benchmarks and real-world applications. ACO and its scaling dynamics with two parallel ACO architectures – Independent Ant Colonies (IAC) and Parallel Ants (PA). Results showed that PA was able to reach a higher solution quality in fewer iterations as the number of parallel instances increased. Furthermore, speed performance was measured across three different hardware solutions – 16 core CPU, 68 core Xeon Phi and up to 4 Geforce GPUs. State of the art, ACO vectorisation techniques such as SS-Roulette were implemented using C++ and CUDA. Although excellent for routing simple TSPs, it was concluded that for complex real-world supply chain routing GPUs are not suitable due to meta-data access footprint required. Thus, our work demonstrates that the standard benchmarks are not suitable for generalised conclusions.
- Published
- 2019
21. Parallelization and Performance of the NIM Weather Model on CPU, GPU, and MIC Processors
- Author
-
Tom Henderson, Ning Wang, Alexander E. MacDonald, Antonio Duarte, Jim Rosinski, Jacques Middlecoff, Paul Madden, Mark Govett, Julie Schramm, and Jin Lee
- Subjects
020203 distributed computing ,Atmospheric Science ,010504 meteorology & atmospheric sciences ,Computer science ,Fortran ,Graphics processing unit ,Multiprocessing ,02 engineering and technology ,Parallel computing ,01 natural sciences ,CUDA ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Central processing unit ,computer ,Xeon Phi ,0105 earth and related environmental sciences ,computer.programming_language - Abstract
The design and performance of the Non-Hydrostatic Icosahedral Model (NIM) global weather prediction model is described. NIM is a dynamical core designed to run on central processing unit (CPU), graphics processing unit (GPU), and Many Integrated Core (MIC) processors. It demonstrates efficient parallel performance and scalability to tens of thousands of compute nodes and has been an effective way to make comparisons between traditional CPU and emerging fine-grain processors. The design of the NIM also serves as a useful guide in the fine-grain parallelization of the finite volume cubed (FV3) model recently chosen by the National Weather Service (NWS) to become its next operational global weather prediction model. This paper describes the code structure and parallelization of NIM using standards-compliant open multiprocessing (OpenMP) and open accelerator (OpenACC) directives. NIM uses the directives to support a single, performance-portable code that runs on CPU, GPU, and MIC systems. Performance results are compared for five generations of computer chips including the recently released Intel Knights Landing and NVIDIA Pascal chips. Single and multinode performance and scalability is also shown, along with a cost–benefit comparison based on vendor list prices.
- Published
- 2017
22. Out-of-core implementation for accelerator kernels on heterogeneous clouds
- Author
-
Alexey Lastovetsky, Ziming Zhong, Hamidreza Khaleghzadeh, and Ravi Reddy
- Subjects
020203 distributed computing ,Multi-core processor ,Speedup ,Xeon ,Computer science ,Node (networking) ,020206 networking & telecommunications ,02 engineering and technology ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,computer.software_genre ,Theoretical Computer Science ,CUDA ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Out-of-core algorithm ,Field-programmable gate array ,computer ,Software ,Xeon Phi ,Information Systems - Abstract
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address the programming challenges that assail execution of large instances of data-parallel applications using these accelerators in this paper. In a typical hybrid node in a cloud, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations poses formidable programming challenges to execution of large problem sizes on these accelerators. In this paper, we describe a library containing interfaces (HCLOOC) that addresses these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in NVidia GPUs. We elucidate the key features of our library using an out-of-core implementation of matrix multiplication of large dense matrices on a hybrid node, an Intel Haswell multicore CPU server hosting three accelerators that includes NVidia K40c GPU, Intel Xeon Phi 3120P, and a Xilinx FPGA. Based on experiments with the GPU, we show that our out-of-core implementation achieves 82% of peak double-precision floating performance of the GPU and a speedup of 2.7 times over the NVidia’s out-of-core matrix multiplication implementation (CUBLAS-XT). We also demonstrate that our implementation exhibits 0% drop in performance when the problem size exceeds the main memory of the GPU. We observe this 0% drop also for our implementation for Intel Xeon Phi and Xilinx FPGA.
- Published
- 2017
23. Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers
- Author
-
Leonid Oliker, William Tang, Kamesh Madduri, Samuel Williams, Khaled Z. Ibrahim, Stephane Ethier, and Bei Wang
- Subjects
Toroid ,Computer science ,010103 numerical & computational mathematics ,Plasma ,01 natural sciences ,010305 fluids & plasmas ,Theoretical Computer Science ,Computational science ,CUDA ,Hardware and Architecture ,0103 physical sciences ,Scalability ,Code (cryptography) ,Particle-in-cell ,0101 mathematics ,Poisson's equation ,Software ,Xeon Phi - Abstract
The gyrokinetic toroidal code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5-D Vlasov–Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P’s multiple levels of parallelism, including internode 2-D domain decomposition and particle decomposition, as well as intranode shared memory partition and vectorization, have enabled pushing the scalability of the PIC method to extreme computational scales. In this article, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) coprocessors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of ion–temperature–gradient driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects, and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.
- Published
- 2017
24. Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor
- Author
-
Bin Chen, Xinyu Dai, R. Kantowski, E. Baron, and P. Van der Mark
- Subjects
FOS: Computer and information sciences ,Coprocessor ,Speedup ,FOS: Physical sciences ,Parallel computing ,Software_PROGRAMMINGTECHNIQUES ,Gravitational microlensing ,01 natural sciences ,CUDA ,0103 physical sciences ,Graphics ,Instrumentation and Methods for Astrophysics (astro-ph.IM) ,010303 astronomy & astrophysics ,ComputingMethodologies_COMPUTERGRAPHICS ,High Energy Astrophysical Phenomena (astro-ph.HE) ,Physics ,010308 nuclear & particles physics ,Astronomy and Astrophysics ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Computer Science Applications ,Gravitational lens ,Computer Science - Distributed, Parallel, and Cluster Computing ,Space and Planetary Science ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Astrophysics - Instrumentation and Methods for Astrophysics ,Astrophysics - High Energy Astrophysical Phenomena ,Xeon Phi ,Fermi Gamma-ray Space Telescope - Abstract
Recently Graphics Processing Units (GPUs) have been used to speed up very CPU-intensive gravitational microlensing simulations. In this work, we use the Xeon Phi coprocessor to accelerate such simulations and compare its performance on a microlensing code with that of NVIDIA's GPUs. For the selected set of parameters evaluated in our experiment, we find that the speedup by Intel's Knights Corner coprocessor is comparable to that by NVIDIA's Fermi family of GPUs with compute capability 2.0, but less significant than GPUs with higher compute capabilities such as the Kepler. However, the very recently released second generation Xeon Phi, Knights Landing, is about 5.8 times faster than the Knights Corner, and about 2.9 times faster than the Kepler GPU used in our simulations. We conclude that the Xeon Phi is a very promising alternative to GPUs for modern high performance microlensing simulations., Comment: 18 pages, 3 figures, accepted by the Astronomy & Computing
- Published
- 2017
25. A lightweight approach to performance portability with targetDP
- Author
-
Alan Gray and Kevin Stratford
- Subjects
FOS: Computer and information sciences ,Source code ,Computer science ,media_common.quotation_subject ,FOS: Physical sciences ,Context (language use) ,02 engineering and technology ,Theoretical Computer Science ,Abstraction layer ,CUDA ,Software portability ,High Energy Physics - Lattice ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,media_common ,business.industry ,High Energy Physics - Lattice (hep-lat) ,021001 nanoscience & nanotechnology ,Grid ,Computer Science - Distributed, Parallel, and Cluster Computing ,Hardware and Architecture ,Embedded system ,Distributed, Parallel, and Cluster Computing (cs.DC) ,0210 nano-technology ,business ,Software ,Xeon Phi - Abstract
Leading HPC systems achieve their status through use of highly parallel devices such as NVIDIA GPUs or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus a separate lattice QCD particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with MPI to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and GPU-accelerated large scale supercomputers., Comment: 11 pages, 5 figures, accepted to the International Journal of High Performance Computing Applications (IJHPCA), acceptance date 27th October 2016
- Published
- 2016
26. A Cross-Platform SpMV Framework on Many-Core Architectures
- Author
-
Shigang Li, Huiyang Zhou, Yunquan Zhang, and Shengen Yan
- Subjects
Computer science ,Parallel algorithm ,020207 software engineering ,Double-precision floating-point format ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Single-precision floating-point format ,Bit field ,CUDA ,High memory ,Titan (supercomputer) ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Software ,Xeon Phi ,Information Systems - Abstract
Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth remain the critical performance bottlenecks. We present our novel solutions to these problems, for both GPUs and Intel MIC many-core architectures. First, we devise a new SpMV format, called Blocked Compressed Common Coordinate (BCCOO). BCCOO extends the blocked Common Coordinate (COO) by using bit flags to store the row indices to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices for better data locality. Then, to address the load imbalance problem, we propose a highly efficient matrix-based segmented sum/scan algorithm for SpMV, which eliminates global synchronization. At last, we introduce an autotuning framework to choose optimization parameters. Experimental results show that our proposed framework has a significant advantage over the existing SpMV libraries. In single precision, our proposed scheme outperforms clSpMV COCKTAIL format by 255% on average on AMD FirePro W8000, and outperforms CUSPARSE V7.0 by 73.7% on average and outperforms CSR5 by 53.6% on average on GeForce Titan X; in double precision, our proposed scheme outperforms CUSPARSE V7.0 by 34.0% on average and outperforms CSR5 by 16.2% on average on Tesla K20, and has equivalent performance compared with CSR5 on Intel MIC.
- Published
- 2016
27. GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems
- Author
-
Jonas Thies, Martin Galgon, Faisal Shahzad, Georg Hager, Andreas Pieper, Moritz Kreutzer, Achim Basermann, Holger Fehske, Gerhard Wellein, and Melven Röhrig-Zöllner
- Subjects
FOS: Computer and information sciences ,Institut für Simulations- und Softwaretechnik ,Interface (Java) ,Data parallelism ,Computer science ,Task parallelism ,Symmetric multiprocessor system ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,task parallelism ,01 natural sciences ,Theoretical Computer Science ,CUDA ,Software ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,large scale computing ,020203 distributed computing ,Multi-core processor ,business.industry ,sparse linear algebra ,heterogeneous computing ,software library ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Mathematical Software ,Distributed, Parallel, and Cluster Computing (cs.DC) ,business ,Mathematical Software (cs.MS) ,Xeon Phi ,Information Systems - Abstract
While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack., Comment: 32 pages, 11 figures
- Published
- 2016
28. Performance portable C++ programming with RAJA
- Author
-
Arturo Vargas, Richard D. Hornung, David Beckingsale, and Thomas R. W. Scogland
- Subjects
business.product_category ,Programming language ,Computer science ,Performance tuning ,computer.software_genre ,CUDA ,Memory management ,Laptop ,Code (cryptography) ,Programming paradigm ,business ,computer ,Xeon Phi ,Range (computer programming) - Abstract
With the rapid change of computing architectures, and variety of programming models; the ability to develop performance portable applications has become of great importance. This is particularly true in large production codes where developing and maintaining hardware specific versions is untenable. To simplify the development of performance portable code, we introduce RAJA, our C++ library that allows developers to write single-source applications that can target multiple hardware and programming model back-ends. We provide a thorough introduction to all of RAJA features, and walk through some hands-on examples that will allow attendees to understand how RAJA might benefit their own applications. Attendees should bring a laptop computer to participate in the hands-on exercises. This tutorial will introduce attendees to RAJA, a C++ library for developing performance portable applications. Attendees will learn how to write performance portable code that can execute on a range of programming models (OpenMP, CUDA, Intel TBB, and HCC) and hardware (CPU, GPU, Xeon Phi). Specifically, attendees will learn how to convert existing C++ applications to use RAJA, and how to use RAJA's programming abstractions to expose existing parallelism in their applications without complex algorithm rewrites. We will also cover specific guidelines for using RAJA in a large application, including some common "gotchas" and how to handle memory management. Finally, attendees will learn how to categorize loops to allow for simple and systematic performance tuning on any architecture.
- Published
- 2019
29. On the Portability of CPU-Accelerated Applications via Automated Source-to-Source Translation
- Author
-
Paul Sathre, Mark K. Gardner, and Wu-chun Feng
- Subjects
TOP500 ,business.industry ,Computer science ,Software_PROGRAMMINGTECHNIQUES ,computer.software_genre ,CUDA ,Software portability ,Embedded system ,Code (cryptography) ,Compiler ,business ,Field-programmable gate array ,computer ,Xeon Phi - Abstract
Over the past decade, accelerator-based supercomputers have grown from 0% to 42% performance share on the TOP500. Ideally, GPU-accelerated code on such systems should be "write once, run anywhere," regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due to the sheer volume of code implemented in non-portable languages. For example, the tremendous success of CUDA, as evidenced by the vast cornucopia of CUDA-accelerated applications, makes it infeasible to manually rewrite all these applications to achieve portability. Consequently, we achieve portability by using our automated CUDA-to-OpenCL source-to-source translator called CU2CL. To demonstrate the state of the practice, we use CU2CL to automatically translate three medium-to-large, CUDA-optimized codes to OpenCL, thus enabling the codes to run on other GPU-accelerated systems (as well as CPU- or FPGA-based systems). These automatically translated codes deliver performance portability, including as much as three-fold performance improvement, on a GPU device not supported by CUDA.
- Published
- 2019
30. A Case Study for Performance Portability Using OpenMP 4.5
- Author
-
Charlene Yang, Thorsten Kurth, Jack Deslippe, and Rahulkumar Gayatri
- Subjects
Software portability ,CUDA ,Xeon ,Computer science ,Parallel computing ,Compiler ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Software_PROGRAMMINGTECHNIQUES ,computer.software_genre ,computer ,Implementation ,Xeon Phi - Abstract
In recent years, the HPC landscape has shifted away from traditional multi-core CPU systems to energy-efficient architectures, such as many-core CPUs and accelerators like GPUs, to achieve high performance. The goal of performance portability is to enable developers to rapidly produce applications which can run efficiently on a variety of these architectures, with little to no architecture specific code adoptions required. We implement a key kernel from a material science application using OpenMP 3.0, OpenMP 4.5, OpenACC, and CUDA on Intel architectures, Xeon and Xeon Phi, and NVIDIA GPUs, P100 and V100. We will compare the performance of the OpenMP 4.5 implementation with that of the more architecture-specific implementations, examine the performance of the OpenMP 4.5 implementation on CPUs after back-porting, and share our experience optimizing large reduction loops, as well as discuss the latest compiler status for OpenMP 4.5 and OpenACC.
- Published
- 2019
31. Performance Impact of Memory Channels on Sparse and Irregular Algorithms
- Author
-
Oded Green, David A. Bader, Jeffrey Young, James M. Fox, and Jun Shirako
- Subjects
FOS: Computer and information sciences ,Computer Science - Performance ,business.industry ,Computer science ,Locality ,Memory bandwidth ,Thread (computing) ,Parallel computing ,Performance (cs.PF) ,CUDA ,Computer Science - Distributed, Parallel, and Cluster Computing ,Analytics ,Hardware Architecture (cs.AR) ,Graph (abstract data type) ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Latency (engineering) ,Computer Science - Hardware Architecture ,business ,Xeon Phi - Abstract
Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization of the memory system for graph algorithms is not necessarily the raw bandwidth or even the latency of memory requests. Instead, we show that performance is proportional to the number of memory channels available to handle small data transfers with limited spatial locality. Using several widely used graph frameworks, including Gunrock (on the GPU) and GAPBS \& Ligra (for CPUs), we evaluate key graph analytics kernels using two unique memory hierarchies, DDR-based and HBM/MCDRAM. Our results show that the differences in the peak bandwidths of several Pascal-generation GPU memory subsystems aren't reflected in the performance of various analytics. Furthermore, our experiments on CPU and Xeon Phi systems demonstrate that the number of memory channels utilized can be a decisive factor in performance across several different applications. For CPU systems with smaller thread counts, the memory channels can be underutilized while systems with high thread counts can oversaturate the memory subsystem, which leads to limited performance. Finally, we model the potential performance improvements of adding more memory channels with narrower access widths than are found in current platforms, and we analyze performance trade-offs for the two most prominent types of memory accesses found in graph algorithms, streaming and random accesses.
- Published
- 2019
- Full Text
- View/download PDF
32. Compiling SIMT Programs on Multi- and Many-Core Processors with Wide Vector Units: A Case Study with CUDA
- Author
-
John Ravi, Michela Becchi, and Hancheng Wu
- Subjects
020203 distributed computing ,POSIX Threads ,Coprocessor ,Computer science ,020207 software engineering ,02 engineering and technology ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Intrinsics ,computer.software_genre ,Instruction set ,CUDA ,MIMD ,0202 electrical engineering, electronic engineering, information engineering ,x86 ,Compiler ,SIMD ,computer ,Xeon Phi - Abstract
Manycore processors and coprocessors with wide vector extensions, such as Intel Phi and Skylake devices, have become popular due to their high throughput capability. Performance optimization on these devices requires using both their x86-compatible cores and their vector units. While the x86-compatible cores can be programmed using traditional programming interfaces following the MIMD model, such as POSIX threads, MPI and OpenMP, the SIMD vector units are harder to program. The Intel software stack provides two approaches for code vectorization: automatic vectorization through the Intel compiler and manual vectorization through vector intrinsics. While the Intel compiler often fails to vectorize code with complex control flows and function calls, the manual approach is error-prone and leads to less portable code. Hence, there has been an increasing interest in SIMT programming tools allowing the simultaneous use of x86 cores and vector units while providing programmability and code portability. However, the effective implementation of the SIMT model on these hybrid architectures is not well understood. In this work, we target this problem. First, we propose a set of compiler techniques to transform programs written using a SIMT programming model (a subset of CUDA C) into code that leverages both the x86 cores and the vector units of a hybrid MIMD/SIMD architecture, thus providing programmability, high system utilization and performance. Second, we evaluate the proposed techniques on Xeon Phi and Skylake processors using micro-benchmarks and real-world applications. Third, we compare the resulting performance with that achieved by the same code on GPUs. Based on this analysis, we point out the main challenges in supporting the SIMT model on hybrid MIMD/SIMD architectures, while providing performance comparable to that of SIMT systems (e.g., GPUs).
- Published
- 2018
33. CosmoFlow: Using Deep Learning to Learn the Universe at Scale
- Author
-
Tuomas Kärnä, Lei Shao, Victor W. Lee, Amrita Mathuriya, Nalini Kumar, Simon J. Pennycook, Kristyn Maschhoff, Peter Mendygral, Lawrence Meadows, Jason Sewall, Diana Moise, Prabhat Prabhat, James Arnemann, Michael F. Ringenburg, Siyu He, Shirley Ho, and Deborah Bard
- Subjects
FOS: Computer and information sciences ,tensorflow ,Computer Science - Machine Learning ,Coprocessor ,Cosmology and Nongalactic Astrophysics (astro-ph.CO) ,Computer science ,cs.LG ,High Performance Computing ,FOS: Physical sciences ,02 engineering and technology ,01 natural sciences ,Machine Learning (cs.LG) ,Machine Learning ,CUDA ,Affordable and Clean Energy ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,010303 astronomy & astrophysics ,Instrumentation and Methods for Astrophysics (astro-ph.IM) ,020203 distributed computing ,Xeon ,business.industry ,Deep learning ,Computational Physics (physics.comp-ph) ,Supercomputer ,Cosmology ,Computer engineering ,physics.comp-ph ,Scalability ,astro-ph.CO ,Artificial intelligence ,maching learning ,business ,Astrophysics - Instrumentation and Methods for Astrophysics ,Physics - Computational Physics ,Xeon Phi ,Astrophysics - Cosmology and Nongalactic Astrophysics ,astro-ph.IM - Abstract
Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational cost of this problem, we present CosmoFlow: a highly scalable deep learning application built on top of the TensorFlow framework. CosmoFlow uses efficient implementations of 3D convolution and pooling primitives, together with improvements in threading for many element-wise operations, to improve training performance on Intel(C) Xeon Phi(TM) processors. We also utilize the Cray PE Machine Learning Plugin for efficient scaling to multiple nodes. We demonstrate fully synchronous data-parallel training on 8192 nodes of Cori with 77% parallel efficiency, achieving 3.5 Pflop/s sustained performance. To our knowledge, this is the first large-scale science application of the TensorFlow framework at supercomputer scale with fully-synchronous training. These enhancements enable us to process large 3D dark matter distribution and predict the cosmological parameters $\Omega_M$, $\sigma_8$ and n$_s$ with unprecedented accuracy., Comment: 11 pages, 6 pages, presented at SuperComputing 2018
- Published
- 2018
34. Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks
- Author
-
Mary Hall, Hans Johansen, Samuel Williams, and Tuowen Zhao
- Subjects
CUDA ,Speedup ,Memory hierarchy ,Xeon ,Stencil code ,Computer science ,Code generation ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Software_PROGRAMMINGTECHNIQUES ,Stencil ,Xeon Phi ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
Achieving high performance on stencil computations poses a number of challenges on modern architectures. The optimization strategy varies significantly across architectures, types of stencils, and types of applications. The standard approach to adapting stencil computations to different architectures, used by both compilers and application programmers, is through the use of iteration space tiling, whereby the data footprint of the computation and its computation partitioning are adjusted to match the memory hierarchy and available parallelism of different platforms. In this paper, we explore an alternative performance portability strategy for stencils, a data layout library for stencils called bricks, that adapts data footprint and parallelism through fine-grained data blocking. Bricks are designed to exploit the inherent multi-dimensional spatial locality of stencils, facilitating improved code generation that can adapt to CPUs or GPUs, and reducing pressure on the memory system. We demonstrate that bricks are performance-portable across CPU and GPU architectures and afford performance advantages over various tiling strategies, particularly for modern multi-stencil and high-order stencil computations. For a range of stencil computations, we achieve high performance on both the Intel Knights Landing (Xeon Phi) and Skylake (Xeon) CPUs as well as the NVIDIA P100 (Pascal) GPU delivering up to a 5x speedup against tiled code.
- Published
- 2018
35. A Technique for Large-Scale 2D Seismic Field Simulations on Supercomputers
- Author
-
Valery V. Kovalevsky and Dmitry A. Karavaev
- Subjects
Computer Science::Performance ,CUDA ,Coprocessor ,Scale (ratio) ,Computer science ,Computation ,Isotropy ,Computer Science::Mathematical Software ,Finite-difference time-domain method ,Xeon Phi ,Synthetic data ,ComputingMethodologies_COMPUTERGRAPHICS ,Computational science - Abstract
An algorithm for the simulation of elastic wave propagation in 2D isotropic inhomogeneous media with complex geometrical structure is presented. A parallel implementation of the FDTD method of fourth order in space to perform calculations on high-performance computing systems with different architectures (CPUs, GPUs, or Xeon Phi coprocessors) is discussed. The proposed technique of mathematical modeling with the use of MPI and CUDA is used to design program codes for computations for a realistic long-distance model. Program codes for single device and multi device use are developed. A large-scale geophysical model of the Baikal rift zone is reconstructed. New synthetic data depicting the structure of the seismic field for the rift zone are presented.
- Published
- 2018
36. Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi
- Author
-
Bertil Schmidt, Yongchao Liu, and Tuan Tu Tran
- Subjects
020203 distributed computing ,Speedup ,Coprocessor ,Xeon ,Computer Networks and Communications ,Computer science ,02 engineering and technology ,Parallel computing ,Supercomputer ,Computer Graphics and Computer-Aided Design ,Theoretical Computer Science ,CUDA ,Artificial Intelligence ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,SIMD ,Bitwise operation ,Software ,Word (computer architecture) ,Xeon Phi - Abstract
Advanced SIMD features on GPUs and Xeon Phis promote efficient long pattern search.A tiled approach to accelerating the Wu-Manber algorithm on GPUs has been proposed.Both the GPU and Xeon Phi yield two orders-of-magnitude speedup over one CPU core.The GPU-based version with tiling runs up to 2.9 × faster than the Xeon Phi version. Approximate pattern matching (APM) targets to find the occurrences of a pattern inside a subject text allowing a limited number of errors. It has been widely used in many application areas such as bioinformatics and information retrieval. Bit-parallel APM takes advantage of the intrinsic parallelism of bitwise operations inside a machine word. This approach typically encodes non-deterministic finite automaton (NFA) states or value differences between adjacent cells of a dynamic programming matrix in the form of bit arrays. Wu-Manber (WM) is a well-known bit-parallel APM algorithm, which simulates an NFA and gains parallel efficiency by performing multiple state updates within a machine word. An important parameter is the machine word size (e.g. 32 or 64?bits for CPUs). Due to increasing vector capabilities, efficient mapping of bit-parallel APM algorithms onto modern high performance computing architectures is an interesting research topic. Prominent examples are Xeon Phi coprocessors and CUDA-enabled GPUs, which provide words of size 512?bits (by means of vector registers) and 1024?bits (by means of warps), respectively. In this paper, we investigate mappings of the WM algorithm onto these two accelerator types. Both architectures are able to achieve around two orders-of-magnitude speedups compared to a single-threaded CPU implementation. Moreover, our tile-based implementation on a GeForce Titan graphics card runs up to 2.9 × faster than our implementation on an Intel Xeon Phi 5110P. Source code is available at http://xbitpar.sourceforge.net.
- Published
- 2016
37. HIPAcc: A Domain-Specific Language and Compiler for Image Processing
- Author
-
Jürgen Teich, Mario Körner, Wieland Eckert, Oliver Reiche, Frank Hannig, and Richard Membarth
- Subjects
Domain-specific language ,Programming language ,Computer science ,Memory bandwidth ,02 engineering and technology ,computer.software_genre ,RenderScript ,020202 computer hardware & architecture ,CUDA ,Computational Theory and Mathematics ,Computer architecture ,Hardware and Architecture ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Code generation ,Compiler ,Graphics ,computer ,Implementation ,Xeon Phi - Abstract
Domain-specific languages (DSLs) provide high-level and domain-specific abstractions that allow expressive and concise algorithm descriptions. Since the description in a DSL hides also the properties of the target hardware, DSLs are a promising path to target different parallel and heterogeneous hardware from the same algorithm description. In theory, the DSL description can capture all characteristics of the algorithm that are required to generate highly efficient parallel implementations. However, most frameworks do not make use of this knowledge and the performance cannot reach that of optimized library implementations. In this article, we present the HIPAcc framework, a DSL and source-to-source compiler for image processing. We show that domain knowledge can be captured in the language and that this knowledge enables us to generate tailored implementations for a given target architecture. Back ends for CUDA, OpenCL, and Renderscript allow us to target discrete graphics processing units (GPUs) as well as mobile, embedded GPUs. Exploiting the captured domain knowledge, we can generate specialized algorithm variants that reach the maximal achievable performance due to the peak memory bandwidth. These implementations outperform state-of-the-art domain-specific languages and libraries significantly.
- Published
- 2016
38. ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures
- Author
-
Ansgar Jüngel, Philippe Tillet, Siegfried Selberherr, Karl Rupp, Tibor Grasser, Florian Rudolf, Josef Weinbub, and Andreas Morhammer
- Subjects
020203 distributed computing ,Multi-core processor ,Xeon ,Computer science ,Interface (Java) ,Applied Mathematics ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Computational science ,Computational Mathematics ,CUDA ,Linear algebra ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,Benchmark (computing) ,0101 mathematics ,Xeon Phi - Abstract
CUDA, OpenCL, and OpenMP are popular programming models for the multicore architectures of CPUs and many-core architectures of GPUs or Xeon Phis. At the same time, computational scientists face the question of which programming model to use to obtain their scientific results. We present the linear algebra library ViennaCL, which is built on top of all three programming models, thus enabling computational scientists to interface to a single library, yet obtain high performance for all three hardware types. Since the respective compute back end can be selected at runtime, one can seamlessly switch between different hardware types without the need for error-prone and time-consuming recompilation steps. We present new benchmark results for sparse linear algebra operations in ViennaCL, complementing results for the dense linear algebra operations in ViennaCL reported in earlier work. Comparisons with vendor libraries show that ViennaCL provides better overall performance for sparse matrix-vector and sparse mat...
- Published
- 2016
39. HSTREAM: A directive-based language extension for heterogeneous stream computing
- Author
-
Suejb Memeti and Sabri Pllana
- Subjects
FOS: Computer and information sciences ,020203 distributed computing ,Computer Science - Programming Languages ,Xeon ,Computer science ,Stream ,02 engineering and technology ,Parallel computing ,computer.software_genre ,Instruction set ,CUDA ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,Benchmark (computing) ,020201 artificial intelligence & image processing ,Compiler ,computer ,Xeon Phi ,Programming Languages (cs.PL) - Abstract
Big data streaming applications require utilization of heterogeneous parallel computing systems, which may comprise multiple multi-core CPUs and many-core accelerating devices such as NVIDIA GPUs and Intel Xeon Phis. Programming such systems require advanced knowledge of several hardware architectures and device-specific programming models, including OpenMP and CUDA. In this paper, we present HSTREAM, a compiler directive-based language extension to support programming stream computing applications for heterogeneous parallel computing systems. HSTREAM source-to-source compiler aims to increase the programming productivity by enabling programmers to annotate the parallel regions for heterogeneous execution and generate target specific code. The HSTREAM runtime automatically distributes the workload across CPUs and accelerating devices. We demonstrate the usefulness of HSTREAM language extension with various applications from the STREAM benchmark. Experimental evaluation results show that HSTREAM can keep the same programming simplicity as OpenMP, and the generated code can deliver performance beyond what CPUs-only and GPUs-only executions can deliver., Preprint, 21st IEEE International Conference on Computational Science and Engineering (CSE 2018)
- Published
- 2018
40. SIMD Monte-Carlo Numerical Simulations Accelerated on GPU and Xeon Phi
- Author
-
Vincent Rivola, Bastien Plazolles, Martin Spel, Didier El Baz, Pascal Gegout, Équipe Calcul Distribué et Asynchronisme (LAAS-CDA), Laboratoire d'analyse et d'architecture des systèmes (LAAS), Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées, Géosciences Environnement Toulouse (GET), Institut national des sciences de l'Univers (INSU - CNRS)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche pour le Développement (IRD)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Observatoire Midi-Pyrénées (OMP), Météo France-Centre National d'Études Spatiales [Toulouse] (CNES)-Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche pour le Développement (IRD)-Météo France-Centre National d'Études Spatiales [Toulouse] (CNES)-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche pour le Développement (IRD), Laboratoire d'analyse et d'architecture des systèmes [Toulouse] (LAAS), Centre National de la Recherche Scientifique (CNRS)-Université Toulouse III - Paul Sabatier (UPS), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National Polytechnique [Toulouse] (INP)-Centre National de la Recherche Scientifique (CNRS)-Université Toulouse III - Paul Sabatier (UPS), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National Polytechnique [Toulouse] (INP), Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT)-Université de Toulouse (UT)-Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Université de Toulouse (UT)-Institut National des Sciences Appliquées (INSA)-Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3), Université de Toulouse (UT)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université de Toulouse (UT)-Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT), Institut de Recherche pour le Développement (IRD)-Université Toulouse III - Paul Sabatier (UT3), Université de Toulouse (UT)-Université de Toulouse (UT)-Institut national des sciences de l'Univers (INSU - CNRS)-Observatoire Midi-Pyrénées (OMP), and Université de Toulouse (UT)-Université de Toulouse (UT)-Institut national des sciences de l'Univers (INSU - CNRS)-Centre National d'Études Spatiales [Toulouse] (CNES)-Centre National de la Recherche Scientifique (CNRS)-Météo-France -Institut de Recherche pour le Développement (IRD)-Institut national des sciences de l'Univers (INSU - CNRS)-Centre National d'Études Spatiales [Toulouse] (CNES)-Centre National de la Recherche Scientifique (CNRS)-Météo-France -Centre National de la Recherche Scientifique (CNRS)
- Subjects
020301 aerospace & aeronautics ,Computer science ,Monte Carlo method ,Task parallelism ,02 engineering and technology ,Parallel computing ,Theoretical Computer Science ,Computer Science::Performance ,CUDA ,0203 mechanical engineering ,Vectorization (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,020201 artificial intelligence & image processing ,SIMD ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,Envelope (mathematics) ,Software ,Xeon Phi ,Information Systems - Abstract
International audience; The efficiency of a pleasingly parallel application is studied for several computing platforms. A real world problem, i.e., Monte-Carlo numerical simulations of stratospheric balloon envelope drift descent is considered. We detail the optimization of the SIMD parallel codes on the K40 and K80 GPUs as well as on the Intel Xeon Phi. We emphasize on loop and task parallelism, multi-threading and vectorization, respectively. The experiments show that GPU and MIC permit one to decrease computing time by non negligeable factors, as compared to a parallel code implemented on a two sockets CPU (E5-2680-v2) which finally allows us to use these devices in operational conditions.
- Published
- 2018
41. Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs
- Author
-
Moritz Kreutzer, Dominik Ernst, Alan R. Bishop, Holger Fehske, Georg Hager, Kengo Nakajima, and Gerhard Wellein
- Subjects
Address space ,Computer science ,Pipeline (computing) ,010103 numerical & computational mathematics ,Parallel computing ,01 natural sciences ,Linear subspace ,Chebyshev filter ,CUDA ,Filter (video) ,0103 physical sciences ,0101 mathematics ,010306 general physics ,Subspace topology ,Eigenvalues and eigenvectors ,Xeon Phi ,Block (data storage) ,Sparse matrix - Abstract
Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address space supported by both architectures under consideration: Intel Xeon Phi “Knights Landing” and Nvidia “Pascal”/“Volta.” After a thorough performance analysis of the single-device implementations using the roofline model we propose two optimizations: (1) Subspace blocking is applied for improved performance and data access efficiency. We also show that it allows transparently handling problems much larger than the high-bandwidth memory without significant performance penalties. (2) Pipelining of communication and computation phases of successive subspaces is implemented to hide communication costs without extra memory traffic. As an application scenario we perform filter diagonalization studies for topological quantum matter. Performance numbers on up to 2048 nodes of the Oakforest-PACS and Piz Daint supercomputers are presented, achieving beyond 500 Tflop/s for computing \(10^2\) inner eigenvalues of sparse matrices of dimension \(4\cdot 10^9\).
- Published
- 2018
42. Abelian: A Compiler for Graph Analytics on Distributed, Heterogeneous Platforms
- Author
-
Roshan Dathathri, Andrew Lenharth, Loc Hoang, Keshav Pingali, and Gurbinder Gill
- Subjects
Computer science ,020207 software engineering ,Graph theory ,Symmetric multiprocessor system ,02 engineering and technology ,Parallel computing ,Supercomputer ,computer.software_genre ,CUDA ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,Graph (abstract data type) ,Compiler ,Abelian group ,computer ,Xeon Phi - Abstract
The trend towards processor heterogeneity and distributed-memory has significantly increased the complexity of parallel programming. In addition, the mix of applications that need to run on parallel platforms today is very diverse, and includes graph applications that typically have irregular memory accesses and unpredictable control-flow. To simplify the programming of graph applications on such platforms, we have implemented a compiler called Abelian that translates shared-memory descriptions of graph algorithms written in the Galois programming model into efficient code for distributed-memory platforms with heterogeneous processors. The compiler manages inter-device synchronization and communication while leveraging state-of-the-art compilers for generating device-specific code. The experimental results show that the novel communication optimizations in the Abelian compiler reduce the volume of communication by 23\(\times \), enabling the code produced by Abelian to match the performance of handwritten distributed CPU and GPU programs that use the same runtime. The programs produced by Abelian for distributed CPUs are roughly 2.4\(\times \) faster than those in the Gemini system, a third-party distributed CPU-only system, demonstrating that Abelian can manage heterogeneity and distributed-memory successfully while generating high-performance code.
- Published
- 2018
43. libtropicon: A Scalable Library for Computing Intersection Points of Generic Tropical Hyper-surfaces
- Author
-
Tianran Chen
- Subjects
Interface (Java) ,Computer science ,010103 numerical & computational mathematics ,02 engineering and technology ,Algebraic geometry ,Parallel computing ,01 natural sciences ,CUDA ,Intersection ,Computer cluster ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,0101 mathematics ,Xeon Phi ,Block (data storage) - Abstract
The computation of intersection points of generic tropical hyper-surfaces is a fundamental problem in computational algebraic geometry. An efficient algorithm for solving this problem will be a basic building block in many higher level algorithms for studying tropical varieties, computing mixed volume, enumerating mixed cells, constructing polyhedral homotopies, etc. libtropicon is a library for computing intersection points of generic tropical hyper-surfaces that provides a unified framework where the several conceptually opposite approaches coexist and complement one another. In particular, great efficiency is achieve by the data cross-feeding of the “pivoting” and the “elimination” step — data by-product generated by the pivoting step is selectively saved to bootstrap the elimination step, and vice versa. The core algorithm is designed to be naturally parallel and highly scalable, and the implementation directly supports multi-core architectures, computer clusters, and GPUs based on CUDA or ROCm/OpenCL technology. Many-core architectures such as Intel Xeon Phi are also partially supported. This library also includes interface layers that allows it to be tightly integrated into the existing ecosystem of software in computational algebraic geometry.
- Published
- 2018
44. Highly Heterogeneous Smith-Waterman (HHeterSW): Exploiting heterogeneous architectures to speed-up bioinformatics algorithms
- Author
-
Esteban, Francisco J., Ferusic, Adis, Hernández Molina, Pilar, Caballero, Juan Antonio, Gálvez, Sergio, Dorado, Gabriel, Ministerio de Economía y Competitividad (España), CSIC - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Junta de Andalucía, and Universidad de Córdoba (España)
- Subjects
Nucleic acids ,Sequence alignment ,Xeon Phi ,Cuda ,Peptides - Abstract
Trabajo presentado en el VIII Jornadas de Divulgación de la Investigación en Biología Molecular, Celular, Genética y Biotecnología, celebrado en córdoba del 13 al 15 de junio de 2018., Second-generation sequencing (SGS) and Third-Generation Sequencing (TGS) are exponentially increasing the amount of data available for bioinformatics analyses. Severa! strategies have been devised to cope with such challenge. Firstly, microprocessors with severa! cores have been designed, for parallel processing. Secondly, new bioinformatics algorithms have been developed, to exploit such parallel-processing capabilities. Thirdly, heterogeneous architectures using both Central Processing Units (CPU) and Graphics Processing Units (GPU) can be further deployed, to speed-up bioinformatics algorithms. A Highly Heterogeneous Smith-Waterman (HHeterSW) algorithm is described, as a proof of concept for such latter strategy. This way, a 2.58-fold speed increase was obtained, when compared to non-hybrid implementations. Likewise, it was fas ter in 78% of tests, when compared to popular Basic Local-Alignment Search Too! (BLAST), effectively exploiting available hardware., Supported by "Ministerio de Economía y Competitividad" (MINECO grants AGL2010-17316 and BIO2011-15237-E) and "Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria" (MINECO and INIA RF2012-00002-C02-02); "Consejería de Agricultura y Pesca" (04 l/C/2007, 75/C/2009 and 56/C/201 O), "Consejería de Economía, Innovación y Ciencia" (Pl l-AGR-7322 and P12-AGR-0482), and "Grupo PAi" (AGR-248) of"Junta de Andalucía"; and "Universidad de Córdoba" ("Ayuda a Grupos"), Spain.
- Published
- 2018
45. Optimization of Hierarchical Matrix Computation on GPU
- Author
-
Satoshi Ohshima, Rio Yokota, Akihiro Ida, and Ichitaro Yamazaki
- Subjects
Computer science ,Computation ,Hierarchical matrix ,010102 general mathematics ,010103 numerical & computational mathematics ,01 natural sciences ,Computational science ,Matrix (mathematics) ,Kernel (linear algebra) ,CUDA ,Kernel (image processing) ,Linear algebra ,Computer Science::Mathematical Software ,Multiplication ,0101 mathematics ,Xeon Phi ,Sparse matrix - Abstract
The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (\(\mathcal {H}\)-matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of \(\mathcal {H}\)-matrices is more complex than that of dense and sparse matrices; thus, accelerating the \(\mathcal {H}\)-matrices is required. We focus on \(\mathcal {H}\)-matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.
- Published
- 2018
46. Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
- Author
-
James Price, Hans Joachim Pflug, Christian Terboven, Matthias S. Müller, and Jonas Hahnfeld
- Subjects
020203 distributed computing ,Coprocessor ,Computer science ,02 engineering and technology ,Pascal (programming language) ,Parallel computing ,Software_PROGRAMMINGTECHNIQUES ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,010502 geochemistry & geophysics ,01 natural sciences ,Abstraction layer ,CUDA ,Asynchronous communication ,Conjugate gradient method ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,computer ,Xeon Phi ,0105 earth and related environmental sciences ,computer.programming_language - Abstract
Accelerator devices are increasingly used to build large supercomputers and current installations usually include more than one accelerator per system node. To keep all devices busy, kernels have to be executed concurrently which can be achieved via asynchronous kernel launches. This work compares the performance for an implementation of the Conjugate Gradient method with CUDA, OpenCL, and OpenACC on NVIDIA Pascal GPUs. Furthermore, it takes a look at Intel Xeon Phi coprocessors when programmed with OpenCL and OpenMP. In doing so, it tries to answer the question of whether the higher abstraction level of directive based models is inferior to lower level paradigms in terms of performance.
- Published
- 2018
47. Numerical simulation of compressible flows on heterogeneous computational architecture
- Author
-
Pavel Vashchenkov, A. A. Shershnev, and Alexander V. Kashkovsky
- Subjects
Source code ,Computer simulation ,Computer science ,media_common.quotation_subject ,Parallel computing ,Software_PROGRAMMINGTECHNIQUES ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Computational architecture ,CUDA ,Compressibility ,Code (cryptography) ,Adaptation (computer science) ,Xeon Phi ,media_common - Abstract
The technology of adaptation of the HyCFS numerical code, which was originally developed for supercomputers with graphical processor units (GPUs), to various computational platforms, such as conventional CPU-based systems and new super-computers based on the Intel Xeon Phi co-processors is developed. The main idea of adaptation is to use OpenMP threads instead of CUDA threads. This approach provides a possibility of using a unified source code for different platforms.
- Published
- 2018
48. Monte Carlo Methods for Massively Parallel Computers
- Author
-
Martin Weigel
- Subjects
Canonical ensemble ,education.field_of_study ,Markov chain ,Computer science ,Monte Carlo method ,Population ,Parallel computing ,01 natural sciences ,010305 fluids & plasmas ,Computational science ,CUDA ,0103 physical sciences ,010306 general physics ,Field-programmable gate array ,education ,Massively parallel ,Xeon Phi - Abstract
Applications that require substantial computational resources today cannot avoid the use of heavily parallel machines. Embracing the opportunities of parallel computing and especially the possibilities provided by a new generation of massively parallel accelerator devices such as GPUs, Intel's Xeon Phi or even FPGAs enables applications and studies that are inaccessible to serial programs. Here we outline the opportunities and challenges of massively parallel computing for Monte Carlo simulations in statistical physics, with a focus on the simulation of systems exhibiting phase transitions and critical phenomena. This covers a range of canonical ensemble Markov chain techniques as well as generalized ensembles such as multicanonical simulations and population annealing. While the examples discussed are for simulations of spin systems, many of the methods are more general and moderate modifications allow them to be applied to other lattice and off-lattice problems including polymers and particle systems. We discuss important algorithmic requirements for such highly parallel simulations, such as the challenges of random-number generation for such cases, and outline a number of general design principles for parallel Monte Carlo codes to perform well.
- Published
- 2017
49. Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on Next-Generation Architectures
- Author
-
Simon Garcia de Gonzalo, and Wen-Mei Hwu, Simon D. Hammond, and Christian Robert Trott
- Subjects
Profiling (computer programming) ,Computer science ,Sparse matrix-vector multiplication ,010103 numerical & computational mathematics ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Matrix (mathematics) ,Kernel (linear algebra) ,CUDA ,Computer engineering ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Xeon Phi ,Sparse matrix - Abstract
Sparse-Matrix Vector products (SpMV) are highly irregular computational kernels that can be found in a diverse collection of high-performance science applications. Performance for this important kernel is often highly correlated with the associated matrix sparsity, which, in turn, governs the computational granularity, and therefore, the efficiency of the memory system. In this paper, we propose to extend the current set of Kokkos profiling tools with an autotuner that can iterate over possible choices for thread-team size and vector width, taking advantage of runtime information, while, choosing the optimal parameters for a particular input. This approach allows an iterative application that calls the same kernel multiple times to continue to progress towards a solution while, at the same time, alleviating the burden from the application programmer of knowing details of the underlying hardware and accounting for variable inputs. We compare the autotuner approach against a fixed approach that attempts to use all the hardware resources all the time, and show that the optimal choice made by the autotuner is significantly different among the two latest classes of accelerator architectures. After 100 iterations we identify which subset of the matrices benefit from improved performance, while others are near the break-even point, where the overhead of the tool has been completely hidden. We highlight the properties of sparse matrices that can help determine when autotuning will be of benefit. Finally, we connect the overhead of the autotuner to specific sparsity patterns and hardware resources.
- Published
- 2017
50. MILC Code Performance on High End CPU and GPU Supercomputer Clusters
- Author
-
Doug Toussaint, Steven Gottlieb, Ruizi Li, and Carleton DeTar
- Subjects
Memory hierarchy ,010308 nuclear & particles physics ,Physics ,QC1-999 ,Programming complexity ,High Energy Physics - Lattice (hep-lat) ,FOS: Physical sciences ,Parallel computing ,Computational Physics (physics.comp-ph) ,Software_PROGRAMMINGTECHNIQUES ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Supercomputer ,01 natural sciences ,CUDA ,High Energy Physics - Lattice ,Conjugate gradient method ,0103 physical sciences ,Parallelism (grammar) ,Code (cryptography) ,010306 general physics ,Physics - Computational Physics ,Xeon Phi - Abstract
With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
- Published
- 2017
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.