Back to Search Start Over

The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems

Authors :
Bill Hanson
Chris Marroquin
Martin Ohmacht
Sarp Oral
Tom Gooding
Feiyi Wang
Adam Moody
Mallikarjun Shankar
Junqi Yin
Ben Casses
Gene Davison
Sudharshan S. Vazhkudai
David Appelhans
Arthur S. Bland
Ian Karlin
VerĂ³nica G. Vergara Larrea
Al Geist
James H. Rogers
Py C. Watson
Chris Chambreau
Bronis R. de Supinski
Robert S. Blackmore
Fernando Pizzano
Matthew L. Leininger
Elsa Gonsiorowski
J. Kahle
Lance D. Weems
Drew Schmidt
Bryan S. Rosenburg
Leopold Grinberg
Scott Atchley
Bob Walkup
Ramesh Pankajakshan
Wayne Joubert
Don Maxwell
James C. Sexton
Dustin Leverman
Adam Bertsch
Matthew A Ezell
Bill Hartner
Christopher Zimmer
George Chochia
Robin Goldstone
Source :
SC
Publication Year :
2018
Publisher :
IEEE, 2018.

Abstract

CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and Sequoia systems. Summit and Sierra are currently ranked No. 1 and No. 3, respectively on the Top500 list. We discuss the design and key differences of the systems. Our evaluation of the systems highlights the following. Applications that fit in HBM see the most benefit and may prefer more GPUs; however, for some applications, the CPU-GPU bandwidth is more important than the number of GPUs. The node-local burst buffer scales linearly, and can achieve a 4X improvement over the parallel file system for large jobs; smaller jobs, however, may benefit from writing directly to the PFS. Finally, several CPU, network and memory bound analytics and GPU-bound deep learning codes achieve up to a 11X and 79X speedup/node, respectively over Titan.

Details

Database :
OpenAIRE
Journal :
SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
Accession number :
edsair.doi...........fbbd635a6c9da8b8582568aa55304b22
Full Text :
https://doi.org/10.1109/sc.2018.00055