Author: "Sameh S. Sharkawi" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Sameh S. Sharkawi"' showing total 6 results

Start Over Author "Sameh S. Sharkawi"

6 results on '"Sameh S. Sharkawi"'

1. Communication protocol optimization for enhanced GPU performance.

Author: Sameh S. Sharkawi and George A. Chochia
Published: 2020
Full Text: View/download PDF

2. The high-speed networks of the Summit and Sierra supercomputers

Author: Michael Kagan, Sameh S. Sharkawi, Bryan S. Rosenburg, Richard L. Graham, George Chochia, Craig B. Stunkel, and Gilad Shainer
Subjects: Remote direct memory access, ComputerSystemsOrganization_COMPUTERSYSTEMIMPLEMENTATION, General Computer Science, Computer science, Quality of service, Message Passing Interface, InfiniBand, Network topology, Supercomputer, computer.software_genre, Scalability, Operating system, IBM, computer
Abstract: Oak Ridge National Laboratory's Summit supercomputer and Lawrence Livermore National Laboratory's Sierra supercomputer utilize InfiniBand interconnect in a Fat-tree network topology, interconnecting all compute nodes, storage nodes, administration, and management nodes into one linearly scalable network. These networks are based on Mellanox 100-Gb/s EDR InfiniBand ConnectX-5 adapters and Switch-IB2 switches, with compute-rack packaging and cooling contributions from IBM. These devices support in-network computing acceleration engines such as Mellanox Scalable Hierarchical Aggregation and Reduction Protocol, graphics processor unit (GPU) Direct RDMA, advanced adaptive routing, Quality of Service, and other network and application acceleration. The overall IBM Spectrum Message Passing Interface (MPI) messaging software stack implements Open MPI, and was a collaboration between IBM, Mellanox, and NVIDIA to optimize direct communication between endpoints, whether compute nodes (with IBM POWER CPUs, NVIDIA GPUs, and flash memory devices), or POWER-hosted storage nodes. The Fat-tree network can isolate traffic among the compute partitions and to/from the storage subsystem, providing more predictable application performance. In addition, the high level of redundancy of this network and its reconfiguration capability ensures reliable high performance even after network component failures. This article details the hardware and software architecture and performance of the networks and describes a number of the high-performance computing (HPC) enhancements engineered into this generation of InfiniBand.
Published: 2020
Full Text: View/download PDF

3. Optimization of Message Passing Services on POWER8 InfiniBand Clusters

Author: Amith R. Mamidala, K A Nysal Jan, Sameer Kumar, Robert S. Blackmore, Sameh S. Sharkawi, and T. J. Chris Ward
Subjects: 020203 distributed computing, MPICH, Interface (Java), Computer science, Node (networking), Message passing, POWER8, InfiniBand, Throughput, 02 engineering and technology, Parallel computing, Software_PROGRAMMINGTECHNIQUES, computer.software_genre, 020204 information systems, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Operating system, computer
Abstract: We present scalability and performance enhancements to MPI libraries on POWER8 InfiniBand clusters. We explore optimizations in the Parallel Active Messaging Interface (PAMI) libraries. We bypass IB VERBS via low level inline calls resulting in low latencies and high message rates. MPI is enabled on POWER8 by extension of both MPICH and Open MPI to call PAMI libraries. The IBM POWER8 nodes have GPU accelerators to optimize floating throughput of the node. We explore optimized algorithms for GPU-to-GPU communication with minimal processor involvement. We achieve a peak MPI message rate of 186 million messages per second. We also present scalable performance in the QBOX and AMG applications.
Published: 2016
Full Text: View/download PDF

4. Optimization and Analysis of MPI Collective Communication on Fat-Tree Networks

Author: K A Nysal Jan, Sameer Kumar, and Sameh S. Sharkawi
Subjects: Binary tree, Computer science, Distributed computing, POWER8, InfiniBand, Throughput, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, 01 natural sciences, Tree traversal, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, x86, 0101 mathematics, Fat tree, Cluster analysis
Abstract: We explore new collective algorithms to optimize MPIBcast, MPIReduce and MPIAllreduce on InfiniBand clusters. Our algorithms are specifically designed for fat-tree networks. We present multi-color k-ary trees with a novel mapping scheme to map the colors to fat-tree network nodes. Our multi-color tree algorithms result in better utilization of network links over traditional algorithms on fat-tree networks. We also present optimizations for clusters of SMP nodes as we explore both hybrid and Multi Leader SMP techniques to achieve the best performance. We show the benefits of our algorithms with performance results from micro-benchmarks on POWER8 and X86 InfiniBand clusters. We also show performance optimizations from our algorithms in the PARATEC and QBOX applications.
Published: 2016
Full Text: View/download PDF

5. SWAPP: A Framework for Performance Projections of HPC Applications Using Benchmarks

Author: Valerie Taylor, Stephen Stevens, Xingfu Wu, Don DeSota, Raj Panda, and Sameh S. Sharkawi
Subjects: HPC Challenge Benchmark, Tree (data structure), POWER5, Computer science, POWER6, Benchmark (computing), Operating system, InfiniBand, Performance prediction, IBM, computer.software_genre, Supercomputer, computer
Abstract: Surrogate-based Workload Application Performance Projection (SWAPP) is a framework for performance projections of High Performance Computing (HPC) applications using benchmark data. Performance projections of HPC applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement. SWAPP assumes that one has access to a base system and only benchmark data for a target system, the target system is not available for running the HPC application. Projections are developed using the performance profiles of the benchmarks and application on the base system and the benchmark data for the target system. SWAPP projects the performances of compute and communication components separately then combine the two projections to get the full application projection. In this paper SWAPP was used to project the performance of three NAS Multi-Zone benchmarks onto three systems (an IBM POWER6 575 cluster and an IBM Intel West mere x5670 both using an Infiniband interconnect and an IBM Blue Gene/P with a 3D Torus and Collective Tree interconnects), the base system is an IBM POWER5+ 575 cluster. The projected performance of the three benchmarks was within 11.44% average error magnitude and standard deviation of 2.64% for the three systems.
Published: 2012
Full Text: View/download PDF

6. Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

Author: Xingfu Wu, Valerie Taylor, Charles Lively, and Sameh S. Sharkawi
Subjects: Loop unrolling, Parallel processing (DSP implementation), Computer science, Loop fusion, Loop nest optimization, Node (circuits), Parallel computing, Supercomputer, Cluster analysis
Abstract: Chip multiprocessors (CMP) are widely used for high performance computing. Further, these CMPs are being configured in a hierarchical manner to compose a node in a cluster system. A major challenge to be addressed is efficient use of such cluster systems for large-scale scientific applications. In this paper, we quantify the performance gap resulting from using different number of processors per node; this information is used to provide a baseline for the amount of optimization needed when using all processors per node on CMP clusters. We conduct detailed performance analysis to identify how applications can be modified to efficiently utilize all processors per node on CMP clusters, especially focusing on two scientific applications: a 3D particle-in-cell, magnetic fusion application gyrokinetic toroidal code (GTC) and a lattice Boltzmann method for simulating fluid dynamics (LBM). In terms of refinements, we use conventional techniques such as cache blocking, loop unrolling and loop fusion, and develop hybrid methods for optimizing MPI_Allreduce and MPI_Reduce. Using these optimizations, the application performance for utilizing all processors per node was improved by up to 18.97% for GTC and 15.77% for LBM on up to 2048 total processors on the CMP clusters.
Published: 2008
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

6 results on '"Sameh S. Sharkawi"'

1. Communication protocol optimization for enhanced GPU performance.

2. The high-speed networks of the Summit and Sierra supercomputers

3. Optimization of Message Passing Services on POWER8 InfiniBand Clusters

4. Optimization and Analysis of MPI Collective Communication on Fat-Tree Networks

5. SWAPP: A Framework for Performance Projections of HPC Applications Using Benchmarks

6. Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

6 results on '"Sameh S. Sharkawi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources