235 results on '"Mpi"'
Search Results
2. Exploring Hierarchical MPI Reduction Collective Algorithms Targeted to Multicore Node Clusters
- Author
-
Utrera, Gladys, Gil, Marisa, Martorell, Xavier, Spataro, William, Giordano, Andrea, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Sergeyev, Yaroslav D., editor, Kvasov, Dmitri E., editor, and Astorino, Annabella, editor
- Published
- 2025
- Full Text
- View/download PDF
3. Radical-Cylon: A Heterogeneous Data Pipeline for Scientific Computing
- Author
-
Sarker, Arup Kumar, Alsaadi, Aymen, Perera, Niranda, Staylor, Mills, von Laszewski, Gregor, Turilli, Matteo, Kilic, Ozgur Ozan, Titov, Mikhail, Merzky, Andre, Jha, Shantenu, Fox, Geoffrey, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Klusáček, Dalibor, editor, Corbalán, Julita, editor, and Rodrigo, Gonzalo P., editor
- Published
- 2025
- Full Text
- View/download PDF
4. ExaNBody: A HPC Framework for N-Body Applications
- Author
-
Carrard, Thierry, Prat, Raphaël, Latu, Guillaume, Babilotte, Killian, Lafourcade, Paul, Amarsid, Lhassan, Soulard, Laurent, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Zeinalipour, Demetris, editor, Blanco Heras, Dora, editor, Pallis, George, editor, Herodotou, Herodotos, editor, Trihinas, Demetris, editor, Balouek, Daniel, editor, Diehl, Patrick, editor, Cojean, Terry, editor, Fürlinger, Karl, editor, Kirkeby, Maja Hanne, editor, Nardelli, Matteo, editor, and Di Sanzo, Pierangelo, editor
- Published
- 2024
- Full Text
- View/download PDF
5. PeriLab — Peridynamic Laboratory
- Author
-
Christian Willberg, Jan-Timo Hesse, and Anna Pernatii
- Subjects
Peridynamics ,Fracture ,HPC ,Computational science ,MPI ,Julia ,Computer software ,QA76.75-76.765 - Abstract
This paper introduces PeriLab, a modern Peridynamics solver developed in the Julia programming language. Emphasizing easy installation, usability, and implementation of new features, the code’s structure is detailed, accompanied by illustrative examples illustrating some of the code’s core functionality. The fully Message Passing Interface (MPI) parallelized code undergoes a separate benchmark problem with two million degrees of freedom, revealing large scale capabilities and analyzing the communication cost occurring in such analysis. The paper highlights key considerations for the adoption of Peridynamics, including the need for a straightforward installation process, user-friendly interfaces, efficient research code development, and well-documented as well as tested functionalities. Overcoming these challenges is crucial for Peridynamics’ widespread acceptance in engineering applications, and PeriLab serves as a valuable contribution to addressing these issues.
- Published
- 2024
- Full Text
- View/download PDF
6. Hyper Burst Buffer: A Lightweight Burst Buffer I/O Library for High Performance Computing Applications
- Author
-
Fredj, Erick, Laufer, Michael, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
- Published
- 2023
- Full Text
- View/download PDF
7. Two different parallel approaches for a hybrid fractional order Coronavirus model
- Author
-
N.H. Sweilam, S. Ahmed, and Monika Heiner
- Subjects
HPC ,Parallel computing ,MPI ,GPU CUDA ,Julia ,Coronavirus mathematical model ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
In this paper, two different parallel approaches for a hybrid fractional order Coronavirus (2019-nCov) mathematical model are presented. Both parallel approaches are implemented using Julia high level language. Parallel algorithm implementations are developed for the HPC cluster using Message Passing Interface (MPI) technology and general-purpose computing on GPUs (GPGPU) using Compute Unified Device Architecture (CUDA) based on hardware environments. The algorithm implementation are used to solve the real-world problem of the hybrid fractional order Coronavirus (2019-nCov) mathematical model and to study the parallel efficiency. The introduced hybrid fractional order derivative is defined as a linear combination of the integral of Riemann-Liouville and the fractional order Caputo derivative. A parallel algorithm is designed based on the predictor-corrector method with the discretization of the Caputo proportional constant fractional hybrid operator for solving the model problem numerically. Simulation results show that, both the new parallel approaches achieve significant efficiency.
- Published
- 2023
- Full Text
- View/download PDF
8. An introspection monitoring library to improve MPI communication time.
- Author
-
Jeannot, Emmanuel and Sartori, Richard
- Subjects
- *
INTROSPECTION , *LIBRARIES - Abstract
In this paper, we describe how to improve communication time of MPI parallel applications with the use of a library that enables to monitor MPI applications and allows for introspection (the program itself can query the state of the monitoring system). Based on previous work, this library is able to see how collective communications are decomposed into point-to-point messages. It also features monitoring sessions that allow suspending and restarting the monitoring, limiting it to specific portions of the code. Experiments show that the monitoring overhead is very small and that the proposed features allow for dynamic and efficient rank reordering enabling up to 2-time reduction of communication parts of some program. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
9. Root Causing MPI Workloads Imbalance Issues via Scalable MPI Critical Path Analysis
- Author
-
Shatalin, Artem, Slobodskoy, Vitaly, Fatin, Maksim, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Voevodin, Vladimir, editor, Sobolev, Sergey, editor, Yakobovskiy, Mikhail, editor, and Shagaliev, Rashit, editor
- Published
- 2022
- Full Text
- View/download PDF
10. Interferences Between Communications and Computations in Distributed HPC Systems
- Author
-
Swartvagher, Philippe, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Chaves, Ricardo, editor, B. Heras, Dora, editor, Ilic, Aleksandar, editor, Unat, Didem, editor, Badia, Rosa M., editor, Bracciali, Andrea, editor, Diehl, Patrick, editor, Dubey, Anshu, editor, Sangyoon, Oh, editor, L. Scott, Stephen, editor, and Ricci, Laura, editor
- Published
- 2022
- Full Text
- View/download PDF
11. A Study in SHMEM: Parallel Graph Algorithm Acceleration with Distributed Symmetric Memory
- Author
-
Ing, Michael, George, Alan D., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Poole, Stephen, editor, Hernandez, Oscar, editor, Baker, Matthew, editor, and Curtis, Tony, editor
- Published
- 2022
- Full Text
- View/download PDF
12. Serverless High-Performance Computing over Cloud
- Author
-
Petrosyan Davit and Astsatryan Hrachya
- Subjects
hpc ,mpi ,kubernetes ,containerization ,cloud ,Cybernetics ,Q300-390 - Abstract
HPC clouds may provide fast access to fully configurable and dynamically scalable virtualized HPC clusters to address the complex and challenging computation and storage-intensive requirements. The complex environmental, software, and hardware requirements and dependencies on such systems make it challenging to carry out our large-scale simulations, prediction systems, and other data and compute-intensive workloads over the cloud. The article aims to present an architecture that enables HPC workloads to be serverless over the cloud (Shoc), one of the most critical cloud capabilities for HPC workloads. On one hand, Shoc utilizes the abstraction power of container technologies like Singularity and Docker, combined with the scheduling and resource management capabilities of Kubernetes. On the other hand, Shoc allows running any CPU-intensive and data-intensive workloads in the cloud without needing to manage HPC infrastructure, complex software, and hardware environment deployments.
- Published
- 2022
- Full Text
- View/download PDF
13. Accelerating Phase-Field Simulations for HPC-Systems
- Author
-
Seiz, M., Hötzer, J., Hierl, H., Reiter, A., Schratz, K., Nestler, B., Nagel, Wolfgang E., editor, Kröner, Dietmar H., editor, and Resch, Michael M., editor
- Published
- 2021
- Full Text
- View/download PDF
14. ABCpy: A High-Performance Computing Perspective to Approximate Bayesian Computation
- Author
-
Ritabrata Dutta, Marcel Schoengens, Lorenzo Pacchiardi, Avinash Ummadisingu, Nicole Widmer, Pierre Künzli, Jukka-Pekka Onnela, and Antonietta Mira
- Subjects
abc ,hpc ,spark ,mpi ,parallel ,imbalance ,python library ,Statistics ,HA1-4737 - Abstract
ABCpy is a highly modular scientific library for approximate Bayesian computation (ABC) written in Python. The main contribution of this paper is to document a software engineering effort that enables domain scientists to easily apply ABC to their research without being ABC experts; using ABCpy they can easily run large parallel simulations without much knowledge about parallelization. Further, ABCpy enables ABC experts to easily develop new inference schemes and evaluate them in a standardized environment and to extend the library with new algorithms. These benefits come mainly from the modularity of ABCpy. We give an overview of the design of ABCpy and provide a performance evaluation concentrating on parallelization. This points us towards the inherent imbalance in some of the ABC algorithms. We develop a dynamic scheduling MPI implementation to mitigate this issue and evaluate the various ABC algorithms according to their adaptability towards high-performance computing.
- Published
- 2021
- Full Text
- View/download PDF
15. Supercomputing Frontiers
- Author
-
Panda, Dhabaleswar K. and Sullivan, Michael
- Subjects
cloud computing ,computer networks ,computer programming ,computer systems ,CUDA ,distributed computer systems ,gpu ,gpus ,hpc ,microprocessor chips ,mpi ,parallel algorithms ,parallel architectures ,parallel processing systems ,parallel programming ,programming languages ,signal processing ,telecommunication systems ,algorithms ,high performance computing ,bic Book Industry Communication::U Computing & information technology::UT Computer networking & communications ,bic Book Industry Communication::U Computing & information technology::UM Computer programming / software development::UMZ Software Engineering ,bic Book Industry Communication::U Computing & information technology::UL Operating systems ,bic Book Industry Communication::U Computing & information technology::UY Computer science::UYF Computer architecture & logic design ,bic Book Industry Communication::U Computing & information technology::UK Computer hardware::UKN Network hardware - Abstract
This open access book constitutes the refereed proceedings of the 7th Asian Conference Supercomputing Conference, SCFA 2022, which took place in Singapore in March 2022. The 8 full papers presented in this book were carefully reviewed and selected from 21 submissions. They cover a range of topics including file systems, memory hierarchy, HPC cloud platform, container image configuration workflow, large-scale applications, and scheduling.
- Published
- 2022
- Full Text
- View/download PDF
16. GPU Computing for Compute-Intensive Scientific Calculation
- Author
-
Dubey, Sandhya Parasnath, Kumar, M. Sathish, Balaji, S., Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Das, Kedar Nath, editor, Bansal, Jagdish Chand, editor, Deep, Kusum, editor, Nagar, Atulya K., editor, Pathipooranam, Ponnambalam, editor, and Naidu, Rani Chinnappa, editor
- Published
- 2020
- Full Text
- View/download PDF
17. Investigation into MPI All-Reduce Performance in a Distributed Cluster with Consideration of Imbalanced Process Arrival Patterns
- Author
-
Proficz, Jerzy, Sumionka, Piotr, Skomiał, Jarosław, Semeniuk, Marcin, Niedzielewski, Karol, Walczak, Maciej, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Barolli, Leonard, editor, Amato, Flora, editor, Moscato, Francesco, editor, Enokido, Tomoya, editor, and Takizawa, Makoto, editor
- Published
- 2020
- Full Text
- View/download PDF
18. Checkpointing Kernel Executions of MPI+CUDA Applications
- Author
-
Baird, Max, Scholz, Sven-Bodo, Šinkarovs, Artjoms, Bautista-Gomez, Leonardo, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Schwardmann, Ulrich, editor, Boehme, Christian, editor, B. Heras, Dora, editor, Cardellini, Valeria, editor, Jeannot, Emmanuel, editor, Salis, Antonio, editor, Schifanella, Claudio, editor, Manumachu, Ravi Reddy, editor, Schwamborn, Dieter, editor, Ricci, Laura, editor, Sangyoon, Oh, editor, Gruber, Thomas, editor, Antonelli, Laura, editor, and Scott, Stephen L., editor
- Published
- 2020
- Full Text
- View/download PDF
19. Evaluating the Advantage of Reactive MPI-aware Power Control Policies
- Author
-
Cesarini, Daniele, Cavazzoni, Carlo, Bartolini, Andrea, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wyrzykowski, Roman, editor, Deelman, Ewa, editor, Dongarra, Jack, editor, and Karczewski, Konrad, editor
- Published
- 2020
- Full Text
- View/download PDF
20. A Methodology Approach to Compare Performance of Parallel Programming Models for Shared-Memory Architectures
- Author
-
Utrera, Gladys, Gil, Marisa, Martorell, Xavier, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Sergeyev, Yaroslav D., editor, and Kvasov, Dmitri E., editor
- Published
- 2020
- Full Text
- View/download PDF
21. Parametric Optimization on HPC Clusters with Geneva
- Author
-
Weßner, Jonas, Berlich, Rüdiger, Schwarz, Kilian, and Lutz, Matthias F. M.
- Published
- 2023
- Full Text
- View/download PDF
22. Software Defined Data Center for High Performance Computing Applications
- Author
-
Lozano-Rizk, J. E., Nieto-Hipolito, J. I., Rivera-Rodriguez, R., Cosio-Leon, M. A., Vazquez-Briseno, M., Chimal-Eguia, J. C., Rico-Rodriguez, V., Martinez-Martinez, E., Barbosa, Simone Diniz Junqueira, Editorial Board Member, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Kotenko, Igor, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Torres, Moisés, editor, and Klapp, Jaime, editor
- Published
- 2019
- Full Text
- View/download PDF
23. The Multi-level Adaptive Approach for Efficient Execution of Multi-scale Distributed Applications with Dynamic Workload
- Author
-
Nasonov, Denis, Butakov, Nikolay, Melnik, Michael, Visheratin, Alexandr, Linev, Alexey, Shvets, Pavel, Sobolev, Sergey, Mukhina, Ksenia, Barbosa, Simone Diniz Junqueira, Series Editor, Filipe, Joaquim, Series Editor, Kotenko, Igor, Series Editor, Sivalingam, Krishna M., Series Editor, Washio, Takashi, Series Editor, Yuan, Junsong, Series Editor, Zhou, Lizhu, Series Editor, Ghosh, Ashish, Series Editor, Voevodin, Vladimir, editor, and Sobolev, Sergey, editor
- Published
- 2019
- Full Text
- View/download PDF
24. Modeling and Evaluation of Application-Aware Dynamic Thermal Control in HPC Nodes
- Author
-
Cesarini, Daniele, Bartolini, Andrea, Benini, Luca, Rannenberg, Kai, Editor-in-Chief, Sakarovitch, Jacques, Editorial Board Member, Goedicke, Michael, Editorial Board Member, Tatnall, Arthur, Editorial Board Member, Neuhold, Erich J., Editorial Board Member, Pras, Aiko, Editorial Board Member, Tröltzsch, Fredi, Editorial Board Member, Pries-Heje, Jan, Editorial Board Member, Kreps, David, Editorial Board Member, Reis, Ricardo, Editorial Board Member, Furnell, Steven, Editorial Board Member, Furbach, Ulrich, Editorial Board Member, Winckler, Marco, Editorial Board Member, Malaka, Rainer, Editorial Board Member, Maniatakos, Michail, editor, Elfadel, Ibrahim (Abe) M., editor, Sonza Reorda, Matteo, editor, Ugurdag, H. Fatih, editor, and Monteiro, José, editor
- Published
- 2019
- Full Text
- View/download PDF
25. Legio: fault resiliency for embarrassingly parallel MPI applications.
- Author
-
Rocco, Roberto, Gadioli, Davide, and Palermo, Gianluca
- Subjects
- *
FAULT-tolerant computing , *SCALABILITY - Abstract
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their high frequency. Natively, MPI cannot handle faults and it stops the execution prematurely when it finds one. With the introduction of ULFM, it is possible to continue the execution, but it requires complex integration with the application. In this paper we propose Legio, a framework that introduces fault resiliency in embarrassingly parallel MPI applications. Legio exposes its features to the application transparently, removing any integration difficulty. After a fault, the execution continues only with the non-failed processes. We also propose a hierarchical alternative, which features lower repair costs on large communicators. We evaluated our solutions on the Marconi100 cluster at CINECA with benchmarks and real-world applications, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
26. MPI parameter optimization during debugging phase of HPC system.
- Author
-
Du, Qi and Huang, Hui
- Subjects
- *
DEBUGGING , *SYSTEMS engineering , *ENGINEERING systems - Abstract
Before the HPC system is delivered to the user, system debugging engineers need to optimize the configuration of all system parameters, including the MPI runtime parameters. This process usually follows a trial-and-error approach, takes time and requires expert insight into the subtle interactions between the software and the underlying hard fight. With the expansion of system and application scale, this work becomes more and more challenging. This paper presents a method to select MPI runtime parameters, which can be used to find the optimal MPI runtime parameters for most applications in a relatively short time. We test our approach on the SPEC MPI2007. Experimental results show that our approach achieves up to 11.93% improvement over the default setting. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
27. Deployment and Analysis of a Hybrid Shared/Distributed-Memory Parallel Visualization Tool for 3-D Oil Reservoir Grid on OpenStack Cloud Computing
- Author
-
Ali A. El-Moursy, Fadi N. Sibai, Hanan Khaled, Salwa M. Nassar, and Mohamed Taher
- Subjects
Cloud computing ,data visualization tool ,HPC ,hybrid (distributed/shared)-memoryparallel programming ,MPI ,multi-threading ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The main goal of oil reservoir management is to provide more efficient, price-effective and environmentally more secure oil production. Oil production management includes an accurate characterization of the reservoir and strategies that involve interactions between reservoir data and human assessment. Hence, it is important to graphically visualize and handle massive data sets of oil and gas pressure / saturation levels to help decision makers in statistical analysis, history matching and recovery of hydrocarbons of the reservoir. In this article, we experimentally study the parallelization of intensive computation for a 3-D (three dimensional) oil reservoir data visualization tool. For this tool, we develop and implement a transformation and lighting model to visualize and react with the grid. Herein, we propose a hybrid (shared memory and distributed memory) parallelization technique to adapt with the data processing scalability. We tested these implementations on OpenStack Cloud Virtual Cluster. Our results indicate that although the virtual platform adds overhead for running parallel implementations, utilizing knowledge of the VM location on the compute host and network traffic among VMs to deploy the virtual environment can achieve significant performance enhancements. Hybrid parallel implementation using large data size can achieve $70\times $ speedup over serial execution without owning a costly HPC infrastructure as the conventional parallel processing deployment model.
- Published
- 2020
- Full Text
- View/download PDF
28. Multiprocessor Scheduling Based on Evolutionary Technique for Solving Permutation Flow Shop Problem
- Author
-
Annu Priya and Sudip Kumar Sahana
- Subjects
Average relative percentage deviation (RPD) ,computational time (CT) ,flowshop ,GA_PO_MPI ,HPC ,MPI ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Multiprocessor scheduling is one of the thrust areas in the field of computational science. There are various traditional scheduling techniques exist for the allocation and processing of jobs. But the performance of these techniques reduce in terms of makespan and waiting time when a large number of jobs are allocated to multiprocessors. In this paper, a new stochastic evolutionary technique is proposed based on the Genetic Algorithm and Pareto optimality. The new technique is implemented in a high-performance computing (HPC) environment using a Message passing interface (MPI) to resolve the permutation flow shop scheduling problem. Pareto optimality technique is used for sample distribution and the basis of the decision to select the lower bound of the makespan, instead of selecting the makespan directly for the best solution. The performance and quality evaluation of proposed techniques (GA_PO_MPI, GA) are compared with traditional techniques (FCFS, FCFS_MPI, TSAB, TSGP, TSGW) on the basis of Relative Percentage Deviation (RPD), Computational Time (CT) and Average Waiting Time and found satisfactory.
- Published
- 2020
- Full Text
- View/download PDF
29. Accelerated Purge Processes of Parallel File System on HPC by Using MPI Programming
- Author
-
Kwon, Min-Woo, Yoon, JunWeon, Hong, TaeYoung, Park, ChanYeol, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Ruediger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Liang, Qilian, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Möller, Sebastian, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zhang, Junjie James, Series Editor, Park, James J., editor, Loia, Vincenzo, editor, Yi, Gangman, editor, and Sung, Yunsick, editor
- Published
- 2018
- Full Text
- View/download PDF
30. Betweenness Propagation
- Author
-
Hanzelka, Jiří, Běloch, Michal, Křenek, Jan, Martinovič, Jan, Slaninová, Kateřina, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Saeed, Khalid, editor, and Homenda, Władysław, editor
- Published
- 2018
- Full Text
- View/download PDF
31. Parallel Version of the Framework for Clustering Error Messages.
- Author
-
Vorobyov, M., Zhukov, K., Grigorieva, M., and Korobkov, S.
- Abstract
Distributed computing environments execute great amount of various computing jobs that can fail or break for some reason. The analysis of the error messages describing the reasons of failures has become one of the most crucial parts of the existing monitoring systems. This analysis is complicated by the presence of a large number of messages, especially in the case of the retrospective analysis. ClusterLogs framework was developed as a modular and flexible tool for the clustering of error messages of computing jobs in distributed computing infrastructures. The general purpose of this tool is to simplify the error messages analysis by grouping together messages that share similar failure reasons and textual patterns. Proposed clustering method includes a set of sequential data processing stages and provides various clustering options: deterministic similarity-based clustering and unsupervised multiple machine learning methods with preliminary vectorization of error messages using the word embedding technique. The performance tests had revealed the most time consuming stages. In this paper we describe the parallelilzing method for these stages and demonstrate how it has allowed the increased performance of the whole clustering pipeline. The performance tests were executed on the HPC system Polus. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
32. McMPI : a managed-code message passing interface library for high performance communication in C#
- Author
-
Holmes, Daniel John, Booth, Stephen, Hardy, Judy, and Trew, Arthur
- Subjects
005.7 ,MPI ,Message-Passing Interface ,HPC ,high-performance computing ,.net ,dotNET ,multi-threaded - Abstract
This work endeavours to achieve technology transfer between established best-practice in academic high-performance computing and current techniques in commercial high-productivity computing. It shows that a credible high-performance message-passing communication library, with semantics and syntax following the Message-Passing Interface (MPI) Standard, can be built in pure C# (one of the .Net suite of computer languages). Message-passing has been the dominant paradigm in high-performance parallel programming of distributed-memory computer architectures for three decades. The MPI Standard originally distilled architecture-independent and language-agnostic ideas from existing specialised communication libraries and has since been enhanced and extended. Object-oriented languages can increase programmer productivity, for example by allowing complexity to be managed through encapsulation. Both the C# computer language and the .Net common language runtime (CLR) were originally developed by Microsoft Corporation but have since been standardised by the European Computer Manufacturers Association (ECMA) and the International Standards Organisation (ISO), which facilitates portability of source-code and compiled binary programs to a variety of operating systems and hardware. Combining these two open and mature technologies enables mainstream programmers to write tightly-coupled parallel programs in a popular standardised object-oriented language that is portable to most modern operating systems and hardware architectures. This work also establishes that a thread-to-thread delivery option increases shared-memory communication performance between MPI ranks on the same node. This suggests that the thread-as-rank threading model should be explicitly specified in future versions of the MPI Standard and then added to existing MPI libraries for use by thread-safe parallel codes. This work also ascertains that the C# socket object suffers from undesirable characteristics that are critical to communication performance and proposes ways of improving the implementation of this object.
- Published
- 2012
33. COUNTDOWN: A Run-Time Library for Performance-Neutral Energy Saving in MPI Applications.
- Author
-
Cesarini, Daniele, Bartolini, Andrea, Bonfa, Pietro, Cavazzoni, Carlo, and Benini, Luca
- Subjects
- *
ENERGY consumption , *SYNCHRONIZATION , *ELECTRIC power conservation , *ESPRESSO - Abstract
Power and energy consumption are becoming key challenges for the supercomputers’ exascale race. HPC systems’ processors waist active power during communication and synchronization among the MPI processes in large-scale HPC applications. However, due to the time scale at which communication happens, transitioning into low-power states while waiting for the completion of each communication may introduce unacceptable overhead. In this article, we present COUNTDOWN, a run-time library for identifying and automatically reducing the power consumption of the CPUs during communication and synchronization. COUNTDOWN saves energy without penalizing the time-to-completion by lowering CPUs power consumption only during idle times for which power state transition overhead is negligible. This is done transparently to the user, without requiring labor-intensive and error-prone application code modifications, nor requiring recompilation of the application. We test our methodology on a production Tier-1 system. For the NAS benchmarks, COUNTDOWN saves between 6 and 50 percent energy, with a time-to-solution penalty lower than 5 percent. In a complete production—Quantum ESPRESSO—for a 3.5K cores run, COUNTDOWN saves 22.36 percent energy, with a performance penalty below 3 percent. Energy saving increases to 37 percent with a performance penalty of 6.38 percent, if the application is executed without communication tuning. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
34. Parallel Programming in Biological Sciences, Taking Advantage of Supercomputing in Genomics
- Author
-
Orozco-Arias, Simon, Tabares-Soto, Reinel, Ceballos, Diego, Guyot, Romain, Diniz Junqueira Barbosa, Simone, Series editor, Chen, Phoebe, Series editor, Du, Xiaoyong, Series editor, Filipe, Joaquim, Series editor, Kotenko, Igor, Series editor, Liu, Ting, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Solano, Andrés, editor, and Ordoñez, Hugo, editor
- Published
- 2017
- Full Text
- View/download PDF
35. Performance Evaluation of Multiple Cloud Data Centers Allocations for HPC
- Author
-
Roloff, Eduardo, Diaz Carreño, Emmanuell, Valverde-Sánchez, Jimmy K. M., Diener, Matthias, da Silva Serpa, Matheus, Houzeaux, Guillaume, Schnorr, Lucas M., Maillard, Nicolas, Gaspary, Luciano Paschoal, Navaux, Philippe, Diniz Junqueira Barbosa, Simone, Series editor, Chen, Phoebe, Series editor, Du, Xiaoyong, Series editor, Filipe, Joaquim, Series editor, Kara, Orhun, Series editor, Kotenko, Igor, Series editor, Liu, Ting, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Barrios Hernández, Carlos Jaime, editor, Gitler, Isidoro, editor, and Klapp, Jaime, editor
- Published
- 2017
- Full Text
- View/download PDF
36. Flexible Neural Trees—Parallel Learning on HPC
- Author
-
Hanzelka, Jiří, Dvorský, Jiří, Kacprzyk, Janusz, Series editor, Pal, Nikhil R., Advisory editor, Bello Perez, Rafael, Advisory editor, Corchado, Emilio S., Advisory editor, Hagras, Hani, Advisory editor, Kóczy, László T., Advisory editor, Kreinovich, Vladik, Advisory editor, Lin, Chin-Teng, Advisory editor, Lu, Jie, Advisory editor, Melin, Patricia, Advisory editor, Nedjah, Nadia, Advisory editor, Nguyen, Ngoc Thanh, Advisory editor, Wang, Jun, Advisory editor, Chaki, Rituparna, editor, Saeed, Khalid, editor, Cortesi, Agostino, editor, and Chaki, Nabendu, editor
- Published
- 2017
- Full Text
- View/download PDF
37. An Application-Level Solution for the Dynamic Reconfiguration of MPI Applications
- Author
-
Cores, Iván, González, Patricia, Jeannot, Emmanuel, Martín, María J., Rodríguez, Gabriel, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Dutra, Inês, editor, Camacho, Rui, editor, Barbosa, Jorge, editor, and Marques, Osni, editor
- Published
- 2017
- Full Text
- View/download PDF
38. Modelling the Earth's geomagnetic environment on Cray machines using PETSc and SLEPc.
- Author
-
Brown, Nick, Bainbridge, Brian, Beggan, Ciarán, Brown, William, Hamilton, Brian, and Macmillan, Susan
- Subjects
GEOMAGNETISM ,SYMMETRIC matrices ,GEOLOGICAL surveys ,DATA structures - Abstract
Summary: The British Geological Survey's global geomagnetic model, Model of the Earth's Magnetic Environment (MEME), is an important tool for calculating the strength and direction of the Earth's magnetic field, which is continually in flux. While the ability to collect data from ground‐based observation sites and satellites has grown rapidly, the memory bound nature of the original code has proved a significant limitation on the size of the modelling problem required. In this paper, we describe work done replacing the bespoke, sequential, eigensolver with that of the PETSc/SLEPc package for solving the system of normal equations. Adopting PETSc/SLEPc also required fundamental changes in how we built and distributed the data structures, and as such, we describe an approach for building symmetric matrices that provides good load balance and avoids the need for close coordination between the processes or replication of work. We also study the memory bound nature of the code from an irregular memory accesses perspective and combine detailed profiling with software cache prefetching to significantly optimise this. Performance and scaling characteristics are explored on ARCHER, a Cray XC30, where we achieved a speed up for the solver of 294 times by replacing the model's bespoke approach with SLEPc. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
39. Parallel performance of molecular dynamics trajectory analysis.
- Author
-
Khoshlessan, Mahzad, Paraskevakos, Ioannis, Fox, Geoffrey C., Jha, Shantenu, and Beckstein, Oliver
- Subjects
MESSAGE passing (Computer science) ,FILES (Records) ,MAGNITUDE (Mathematics) ,PYTHON programming language ,PERFORMANCES ,MOLECULAR dynamics ,HIGH performance computing ,ITERATIVE learning control - Abstract
Summary: The performance of biomolecular molecular dynamics simulations has steadily increased on modern high‐performance computing resources but acceleration of the analysis of the output trajectories has lagged behind so that analyzing simulations is becoming a bottleneck. To close this gap, we studied the performance of trajectory analysis with message passing interface (MPI) parallelization and the Python MDAnalysis library on three different Extreme Science and Engineering Discovery Environment (XSEDE) supercomputers where trajectories were read from a Lustre parallel file system. Strong scaling performance was impeded by stragglers, MPI processes that were slower than the typical process. Stragglers were less prevalent for compute‐bound workloads, thus pointing to file reading as a bottleneck for scaling. However, a more complicated picture emerged in which both the computation and the data ingestion exhibited close to ideal strong scaling behavior whereas stragglers were primarily caused by either large MPI communication costs or long times to open the single shared trajectory file. We improved overall strong scaling performance by either subfiling (splitting the trajectory into separate files) or MPI‐IO with parallel HDF5 trajectory files. The parallel HDF5 approach resulted in near ideal strong scaling on up to 384 cores (16 nodes), thus reducing trajectory analysis times by two orders of magnitude compared with the serial approach. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
40. FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures.
- Author
-
Hashmi, Jahanzeb Maqbool, Chu, Ching-Hsiang, Chakraborty, Sourav, Bayatpour, Mohammadreza, Subramoni, Hari, and Panda, Dhabaleswar K.
- Subjects
- *
GRAPHICS processing units , *BANDWIDTHS , *STENCIL work , *TRANSLATIONS - Abstract
This paper addresses the challenges of MPI derived datatype processing and proposes FALCON-X — A Fast and Low-overhead Communication framework for optimized zero-copy intra-node derived datatype communication on emerging CPU/GPU architectures. We quantify various performance bottlenecks such as memory layout translation and copy overheads for highly fragmented MPI datatypes and propose novel pipelining and memoization-based designs to achieve efficient derived datatype communication. In addition, we also propose enhancements to the MPI standard to address the semantic limitations. The experimental evaluations show that our proposed designs significantly improve the intra-node communication latency and bandwidth over state-of-the-art MPI libraries on modern CPU and GPU systems. By using representative application kernels such as MILC, WRF, NAS_MG, Specfem3D, and Stencils on three different CPU architectures and two different GPU systems including DGX-2, we demonstrate up to 5.5x improvement on multi-core CPUs and 120x benefits on DXG-2 GPU system over state-of-the-art designs in other MPI libraries. • Identified the challenges involved in zero-copy MPI derived datatypes processing • Proposed novel designs to overcome semantic limitations and performance overheads • Demonstrated the efficacy of proposed designs on state-of-the-art CPU and GPU systems • Performance evaluation on diverse CPU and GPU architectures (e.g., DGX-2) • Achieved significant improvement over other MPI libraries on modern HPC hardware [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
41. Massively parallel numerical simulation using up to 36,000 CPU cores of an industrial-scale polydispersed reactive pressurized fluidized bed with a mesh of one billion cells.
- Author
-
Neau, Hervé, Pigou, Maxime, Fede, Pascal, Ansart, Renaud, Baudry, Cyril, Mérigoux, Nicolas, Laviéville, Jérome, Fournier, Yvan, Renon, Nicolas, and Simonin, Olivier
- Subjects
- *
COMPUTER simulation , *FLUIDIZED bed reactors , *COMPUTATIONAL fluid dynamics , *POLYMERIZATION reactors , *HIGH performance computing , *MULTICORE processors , *SUPERCOMPUTERS - Abstract
For the last 30 years, experimental and modeling studies have been carried out on fluidized bed reactors from laboratory up to industrial scales. The application of developed models for predictive simulations has however been strongly limited by the available computational power and the capability of computational fluid dynamics software to handle large enough simulations. In recent years, both aspects have made significant advances and we thus now demonstrate the feasibility of a massively parallel simulation, on whole supercomputers using NEPTUNE_CFD, of an industrial-scale polydispersed fluidized-bed reactor. This simulation of an olefin polymerization reactor makes use of an unsteady Eulerian multi-fluid approach and relies on a billion cells meshing. This is a worldwide premiere as the obtained accuracy is yet unmatched for such a large-scale system. The interest of this work is two-fold. In terms of High Performance Computation (HPC), all steps of setting-up the simulation, running it with NEPTUNE_CFD, and post-processing results induce multiple challenges due to the scale of the simulation. The simulation ran using 1260 up to 36,000 cores on supercomputers, used 15 millions of CPU hours and generated 200 TB of raw data for a simulated physical time of 25 s. This article details the methodology applied to handle this simulation, and also focuses on computation performances in terms of profiling, code efficiency and partitioning method suitability. Though being by itself interesting, the HPC challenge is not the only goal of this work as performing this highly-resolved simulation will benefit chemical engineering and CFD communities. Indeed, this computation enables the possibility to account, in a realistic way, for complex flows in an industrial-scale reactor. The predicted behavior is described, and results are post-processed to develop sub-grid models. These will allow for lower-cost simulations with coarser meshes while still encompassing local phenomenon. Unlabelled Image • An industrial-scale reactive polydispersed gas-solid fluidized bed is simulated. • The mesh reaches a size of a billion cells; the simulation ran on 36,000 CPU cores. • Focus is made on HPC challenges at all steps: simulation setup, run, post-processing. • Massively parallel performances of NEPTUNE_CFD are evaluated on two supercomputers. • Sub-grid models are developed from the highly resolved simulation. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
42. To improve scalability with Boolean matrix using efficient gossip failure detection and consensus algorithm for PeerSim simulator in IoT environment
- Author
-
Kumar, Surendra, Samriya, Jitendra Kumar, Yadav, Arun Singh, and Kumar, Mohit
- Published
- 2022
- Full Text
- View/download PDF
43. Experimenting Large Prime Numbers Generation in MPI Cluster
- Author
-
Nilesh Maltare, Chetan Chudasama, Kacprzyk, Janusz, Series editor, Satapathy, Suresh Chandra, editor, Bhatt, Yogesh Chandra, editor, Joshi, Amit, editor, and Mishra, Durgesh Kumar, editor
- Published
- 2016
- Full Text
- View/download PDF
44. Porting the MPI Parallelized LES Model PALM to Multi-GPU Systems – An Experience Report
- Author
-
Knoop, Helge, Gronemeier, Tobias, Knigge, Christoph, Steinbach, Peter, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Taufer, Michela, editor, Mohr, Bernd, editor, and Kunkel, Julian M., editor
- Published
- 2016
- Full Text
- View/download PDF
45. SONAR: Automated Communication Characterization for HPC Applications
- Author
-
Lammel, Steffen, Zahn, Felix, Fröning, Holger, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Taufer, Michela, editor, Mohr, Bernd, editor, and Kunkel, Julian M., editor
- Published
- 2016
- Full Text
- View/download PDF
46. Scaling bioinformatics applications on HPC
- Author
-
Mike Mikailov, Fu-Jyh Luo, Stuart Barkley, Lohit Valleru, Stephen Whitney, Zhichao Liu, Shraddha Thakkar, Weida Tong, and Nicholas Petrick
- Subjects
HPC ,Blast ,Parallelization ,MPI ,Multi-threading ,Bioinformatics ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Recent breakthroughs in molecular biology and next generation sequencing technologies have led to the expenential growh of the sequence databases. Researchrs use BLAST for processing these sequences. However traditional software parallelization techniques (threads, message passing interface) applied in newer versios of BLAST are not adequate for processing these sequences in timely manner. Methods A new method for array job parallelization has been developed which offers O(T) theoretical speed-up in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. (The number of CPUs that will be used to complete the job equals the product of T multiplied by the number of CPUs used by a single task.) The approach is based on segmentation of both input datasets to the BLAST process, combining partial solutions published earlier (Dhanker and Gupta, Int J Comput Sci Inf Technol_5:4818-4820, 2014), (Grant et al., Bioinformatics_18:765-766, 2002), (Mathog, Bioinformatics_19:1865-1866, 2003). It is accordingly referred to as a “dual segmentation” method. In order to implement the new method, the BLAST source code was modified to allow the researcher to pass to the program the number of records (effective number of sequences) in the original database. The team also developed methods to manage and consolidate the large number of partial results that get produced. Dual segmentation allows for massive parallelization, which lifts the scaling ceiling in exciting ways. Results BLAST jobs that hitherto failed or slogged inefficiently to completion now finish with speeds that characteristically reduce wallclock time from 27 days on 40 CPUs to a single day using 4104 tasks, each task utilizing eight CPUs and taking less than 7 minutes to complete. Conclusions The massive increase in the number of tasks when running an analysis job with dual segmentation reduces the size, scope and execution time of each task. Besides significant speed of completion, additional benefits include fine-grained checkpointing and increased flexibility of job submission. “Trickling in” a swarm of individual small tasks tempers competition for CPU time in the shared HPC environment, and jobs submitted during quiet periods can complete in extraordinarily short time frames. The smaller task size also allows the use of older and less powerful hardware. The CDRH workhorse cluster was commissioned in 2010, yet its eight-core CPUs with only 24GB RAM work well in 2017 for these dual segmentation jobs. Finally, these techniques are excitingly friendly to budget conscious scientific research organizations where probabilistic algorithms such as BLAST might discourage attempts at greater certainty because single runs represent a major resource drain. If a job that used to take 24 days can now be completed in less than an hour or on a space available basis (which is the case at CDRH), repeated runs for more exhaustive analyses can be usefully contemplated.
- Published
- 2017
- Full Text
- View/download PDF
47. Investigating Dependency Graph Discovery Impact on Task-based MPI+OpenMP Applications Performances
- Author
-
Pereira, Romain, Roussel, Adrien, Carribault, Patrick, Gautier, Thierry, DAM Île-de-France (DAM/DIF), Direction des Applications Militaires (DAM), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA), Algorithms and Software Architectures for Distributed and HPC Platforms (AVALON), Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Lyon, Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire en Informatique Haute Performance pour le Calcul et la simulation (LIHPC), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Direction des Applications Militaires (DAM), and Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay
- Subjects
HPC ,Task ,Dependency ,OpenMP ,MPI ,[INFO]Computer Science [cs] ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,Graph - Abstract
International audience; The architecture of supercomputers is evolving to expose massive parallelism. MPI and OpenMP are widely used in application codes on the largest supercomputers in the world. The community primarily focused on composing MPI with OpenMP before its version 3.0 introduced task-based programming. Recent advances in OpenMP task model and its interoperability with MPI enabled fine model composition and seamless support for asynchrony. Yet, OpenMP tasking overheads limit the gain of task-based applications over their historical loop parallelization (parallel for construct).This paper identifies the OpenMP task dependency graph discovery speed as a limiting factor in the performance of task-based applications.We study its impact on intra and inter-node performances over two benchmarks (Cholesky, HPCG) and a proxy-application (LULESH). We evaluate the performance impacts of several discovery optimizations, and introduce a persistent task dependency graph reducing overheads by a factor upto 15 at run-time. We measure 2x speedup over parallel for versions weak scaled to 16K cores, due to improved cache memory use and communication overlap, enabled by task refinement and depth-first scheduling.
- Published
- 2023
- Full Text
- View/download PDF
48. An introspection monitoring library to improve MPI communication time
- Author
-
Emmanuel Jeannot, Richard Sartori, Topology-Aware System-Scale Data Management for High-Performance Computing (TADAAM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), and Bull atos technologies
- Subjects
communication optimization ,monitoring ,Hardware and Architecture ,HPC ,MPI ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,Software ,Information Systems ,Theoretical Computer Science - Abstract
In this paper we describe how to improve communication time of MPI parallel applications with the use of a library that enables to monitor MPI applications and allows for introspection (the program itself can query the state of the monitoring system). Based on previous work, this library is able to see how collective communications are decomposed into point-to-point messages. It also features monitoring sessions that allow suspending and restarting the monitoring, limiting it to specific portions of the code. Experiments show that the monitoring overhead is very small and that the proposed features allow for dynamic and efficient rank reordering enabling up to 2-time reduction of communication parts of some program.
- Published
- 2023
- Full Text
- View/download PDF
49. Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable Processors - Computational Artifacts
- Author
-
Katevenis, George, Ploumidis, Manolis, and Marazakis, Manolis
- Subjects
shared-memory ,multi-core ,cache coherency ,HPC ,MPI ,collectives ,Intel Xeon Scalable ,broadcast ,intra-node - Abstract
Collection of computationtal artifacts (source code, scripts, datasets, instructions) for reproducibility of experiments featured in the associated paper: Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable Processors George Katevenis, Manolis Ploumidis, and Manolis Marazakis ICPP 2023, Salt Lake City, Utah, USA
- Published
- 2023
- Full Text
- View/download PDF
50. OpenCS: a framework for parallelisation of equation-based simulation programs
- Author
-
Nikolić, Dragan D.
- Subjects
modelling ,ODE ,OpenCL ,parallel computing ,HPC ,MPI ,OpenMP ,DAE ,simulation ,heterogeneous computing - Abstract
In this work, the main ideas, the key concepts and the implementation details of the Open Compute Stack (OpenCS) framework are presented. The OpenCS framework is a common platform for modelling of problems described by large-scale systems of differential and algebraic equations, parallel evaluation of model equations on diverse types of computing devices (including heterogeneous setups), parallel simulation on shared and distributed memory systems, and model exchange. The main components and the methodology of OpenCS are described: (1) model specification data structures for a description of general systems of differential and algebraic equations, (2) a method to describe, store in computer memory and evaluate model equations on general purpose and streaming processors, (3) algorithms for partitioning of general systems of equations in the presence of multiple load balancing constraints and for inter-process data exchange, (4) an Application Programming Interface (API), and (5) a cross-platform generic simulation software. The benefits provided by the framework are discussed in detail. The model specification data structures provide a simple platform-independent binary interface for model exchange and allow the same model representation to be used on different high-performance computing systems and architectures. Model equations are stored as an array of binary data (bytecode instructions) which can be directly evaluated on virtually all computing devices with no additional processing. The capabilities of the framework are illustrated using two large-scale problems. Simulations are performed sequentially on a single processor and in parallel using the MPI interface. Multi-core CPU, discrete GPU and heterogeneous CPU/GPU setups are used for the evaluation of model equations utilising the OpenMP API and the OpenCL framework for parallelism. The overall performance and performance of four main and four sub-phases of the numerical solution are analysed and compared to the maximum theoretical.
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.