1. Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment
- Author
-
Aswathy Ravikumar and Harini Sriraman
- Subjects
All reduce ,bottleneck ,data parallel ,fault tolerance ,generative adversarial network ,GPU ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Due to its fantastic performance in the quality of the images created, Generator Adversarial Networks have recently become a viable option for image reconstruction. The main problem with employing GAN is how expensive the computations are. Researchers have developed techniques for distributing GANs across multiple nodes. However, these techniques typically do not scale because they frequently separate the components (Discriminator and Generator), leading to high communication overhead or encountering distribution-related problems unique to GAN training. In this study, the training procedure for the GAN is parallelized and carried out over many Graphical Processing Units (GPUs). TensorFlow’s built-in logic and a custom loop were tweaked for more control over the resources allotted to each GPU worker. In this study, GPU image processing improvements and multi-GPU learning are used. The GAN model is accelerated using Distributed TensorFlow with synchronous data-parallel training on a single system and several GPUs. Acceleration was accomplished using the Genesis Cloud Platform and the NVIDIA Ⓡ GeForceTM GTX 108 GPU accelerator. The speed-up of 1.322 for two GPUs, 1.688 for three GPUs, and 1.7792 for four GPUs using multi-GPU acceleration. The parameter server model’s data initialization and image production bottlenecks are removed, but the results’ speed-up is not linear. Increasing the number of GPUs and removing the connectivity constraint will accelerate things even more. The bottlenecks are detected using new network lines and resources, and solutions are suggested. Recomputation and quantization are the two techniques to reduce the amount of GPU acceleration in memory. Deployment and versioning are essential for successfully operating multi-node GAN models in MLflow. Properly deploying and versioning these models can improve scalability, reproducibility, and collaboration across teams working on the same model. MLflow provides built-in tools for versioning and tracking model performance, making it easier to manage multiple versions of the model and reproduce it in different environments.
- Published
- 2023
- Full Text
- View/download PDF