1. Deploying Scientific Al Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers
- Author
-
David Brayford and Sofia Vallecorsa
- Subjects
Scale (ratio) ,Software deployment ,Computer science ,Distributed computing ,Production (economics) ,Supercomputer ,Energy (signal processing) ,Domain (software engineering) ,Production system - Abstract
There is an ever-increasing need for computational power to train complex artificial intelligence (AI) and machine learning (ML) models to tackle large scientific problems. High performance computing (HPC) resources are required to efficiently compute and scale complex models across tens of thousands of compute nodes. In this paper, we discuss the issues associated with the deployment of machine learning frameworks on large scale secure HPC systems and how we successfully deployed a standard machine learning framework on a secure large scale HPC production system, to train a complex three-dimensional convolutional GAN (3DGAN), with petaflop performance. 3DGAN is an example from the high energy physics domain, designed to simulate the energy pattern produced by showers of secondary particles inside a particle detector on various HPC systems.
- Published
- 2020
- Full Text
- View/download PDF