6 results on '"Sameh S. Sharkawi"'
Search Results
2. The high-speed networks of the Summit and Sierra supercomputers
- Author
-
Michael Kagan, Sameh S. Sharkawi, Bryan S. Rosenburg, Richard L. Graham, George Chochia, Craig B. Stunkel, and Gilad Shainer
- Subjects
Remote direct memory access ,ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION ,General Computer Science ,Computer science ,Quality of service ,Message Passing Interface ,InfiniBand ,Network topology ,Supercomputer ,computer.software_genre ,Scalability ,Operating system ,IBM ,computer - Abstract
Oak Ridge National Laboratory's Summit supercomputer and Lawrence Livermore National Laboratory's Sierra supercomputer utilize InfiniBand interconnect in a Fat-tree network topology, interconnecting all compute nodes, storage nodes, administration, and management nodes into one linearly scalable network. These networks are based on Mellanox 100-Gb/s EDR InfiniBand ConnectX-5 adapters and Switch-IB2 switches, with compute-rack packaging and cooling contributions from IBM. These devices support in-network computing acceleration engines such as Mellanox Scalable Hierarchical Aggregation and Reduction Protocol, graphics processor unit (GPU) Direct RDMA, advanced adaptive routing, Quality of Service, and other network and application acceleration. The overall IBM Spectrum Message Passing Interface (MPI) messaging software stack implements Open MPI, and was a collaboration between IBM, Mellanox, and NVIDIA to optimize direct communication between endpoints, whether compute nodes (with IBM POWER CPUs, NVIDIA GPUs, and flash memory devices), or POWER-hosted storage nodes. The Fat-tree network can isolate traffic among the compute partitions and to/from the storage subsystem, providing more predictable application performance. In addition, the high level of redundancy of this network and its reconfiguration capability ensures reliable high performance even after network component failures. This article details the hardware and software architecture and performance of the networks and describes a number of the high-performance computing (HPC) enhancements engineered into this generation of InfiniBand.
- Published
- 2020
- Full Text
- View/download PDF
3. Optimization of Message Passing Services on POWER8 InfiniBand Clusters
- Author
-
Amith R. Mamidala, K A Nysal Jan, Sameer Kumar, Robert S. Blackmore, Sameh S. Sharkawi, and T. J. Chris Ward
- Subjects
020203 distributed computing ,MPICH ,Interface (Java) ,Computer science ,Node (networking) ,Message passing ,POWER8 ,InfiniBand ,Throughput ,02 engineering and technology ,Parallel computing ,Software_PROGRAMMINGTECHNIQUES ,computer.software_genre ,020204 information systems ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,computer - Abstract
We present scalability and performance enhancements to MPI libraries on POWER8 InfiniBand clusters. We explore optimizations in the Parallel Active Messaging Interface (PAMI) libraries. We bypass IB VERBS via low level inline calls resulting in low latencies and high message rates. MPI is enabled on POWER8 by extension of both MPICH and Open MPI to call PAMI libraries. The IBM POWER8 nodes have GPU accelerators to optimize floating throughput of the node. We explore optimized algorithms for GPU-to-GPU communication with minimal processor involvement. We achieve a peak MPI message rate of 186 million messages per second. We also present scalable performance in the QBOX and AMG applications.
- Published
- 2016
- Full Text
- View/download PDF
4. Optimization and Analysis of MPI Collective Communication on Fat-Tree Networks
- Author
-
K A Nysal Jan, Sameer Kumar, and Sameh S. Sharkawi
- Subjects
Binary tree ,Computer science ,Distributed computing ,POWER8 ,InfiniBand ,Throughput ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Tree traversal ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,x86 ,0101 mathematics ,Fat tree ,Cluster analysis - Abstract
We explore new collective algorithms to optimize MPIBcast, MPIReduce and MPIAllreduce on InfiniBand clusters. Our algorithms are specifically designed for fat-tree networks. We present multi-color k-ary trees with a novel mapping scheme to map the colors to fat-tree network nodes. Our multi-color tree algorithms result in better utilization of network links over traditional algorithms on fat-tree networks. We also present optimizations for clusters of SMP nodes as we explore both hybrid and Multi Leader SMP techniques to achieve the best performance. We show the benefits of our algorithms with performance results from micro-benchmarks on POWER8 and X86 InfiniBand clusters. We also show performance optimizations from our algorithms in the PARATEC and QBOX applications.
- Published
- 2016
- Full Text
- View/download PDF
5. SWAPP: A Framework for Performance Projections of HPC Applications Using Benchmarks
- Author
-
Valerie Taylor, Stephen Stevens, Xingfu Wu, Don DeSota, Raj Panda, and Sameh S. Sharkawi
- Subjects
HPC Challenge Benchmark ,Tree (data structure) ,POWER5 ,Computer science ,POWER6 ,Benchmark (computing) ,Operating system ,InfiniBand ,Performance prediction ,IBM ,computer.software_genre ,Supercomputer ,computer - Abstract
Surrogate-based Workload Application Performance Projection (SWAPP) is a framework for performance projections of High Performance Computing (HPC) applications using benchmark data. Performance projections of HPC applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement. SWAPP assumes that one has access to a base system and only benchmark data for a target system, the target system is not available for running the HPC application. Projections are developed using the performance profiles of the benchmarks and application on the base system and the benchmark data for the target system. SWAPP projects the performances of compute and communication components separately then combine the two projections to get the full application projection. In this paper SWAPP was used to project the performance of three NAS Multi-Zone benchmarks onto three systems (an IBM POWER6 575 cluster and an IBM Intel West mere x5670 both using an Infiniband interconnect and an IBM Blue Gene/P with a 3D Torus and Collective Tree interconnects), the base system is an IBM POWER5+ 575 cluster. The projected performance of the three benchmarks was within 11.44% average error magnitude and standard deviation of 2.64% for the three systems.
- Published
- 2012
- Full Text
- View/download PDF
6. Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems
- Author
-
Xingfu Wu, Valerie Taylor, Charles Lively, and Sameh S. Sharkawi
- Subjects
Loop unrolling ,Parallel processing (DSP implementation) ,Computer science ,Loop fusion ,Loop nest optimization ,Node (circuits) ,Parallel computing ,Supercomputer ,Cluster analysis - Abstract
Chip multiprocessors (CMP) are widely used for high performance computing. Further, these CMPs are being configured in a hierarchical manner to compose a node in a cluster system. A major challenge to be addressed is efficient use of such cluster systems for large-scale scientific applications. In this paper, we quantify the performance gap resulting from using different number of processors per node; this information is used to provide a baseline for the amount of optimization needed when using all processors per node on CMP clusters. We conduct detailed performance analysis to identify how applications can be modified to efficiently utilize all processors per node on CMP clusters, especially focusing on two scientific applications: a 3D particle-in-cell, magnetic fusion application gyrokinetic toroidal code (GTC) and a lattice Boltzmann method for simulating fluid dynamics (LBM). In terms of refinements, we use conventional techniques such as cache blocking, loop unrolling and loop fusion, and develop hybrid methods for optimizing MPI_Allreduce and MPI_Reduce. Using these optimizations, the application performance for utilizing all processors per node was improved by up to 18.97% for GTC and 15.77% for LBM on up to 2048 total processors on the CMP clusters.
- Published
- 2008
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.