12 results on '"Massively parallel and high-performance simulations"'
Search Results
2. Compressed Neighbour Lists for SPH.
- Author
-
Band, Stefan, Gissler, Christoph, and Teschner, Matthias
- Subjects
- *
MESSAGE passing (Computer science) , *NEIGHBORS , *HYDRODYNAMICS , *DATA compression , *DATA mapping , *IMAGE compression - Abstract
We propose a novel compression scheme to store neighbour lists for iterative solvers that employ Smoothed Particle Hydrodynamics (SPH). The compression scheme is inspired by Stream VByte, but uses a non‐linear mapping from data to data bytes, yielding memory savings of up to 87%. It is part of a novel variant of the Cell‐Linked‐List (CLL) concept that is inspired by compact hashing with an improved processing of the cell‐particle relations. We show that the resulting neighbour search outperforms compact hashing in terms of speed and memory consumption. Divergence‐Free SPH (DFSPH) scenarios with up to 1.3 billion SPH particles can be processed on a 24‐core PC using 172 GB of memory. Scenes with more than 7 billion SPH particles can be processed in a Message Passing Interface (MPI) environment with 112 cores and 880 GB of RAM. The neighbour search is also useful for interactive applications. A DFSPH simulation step for up to 0.2 million particles can be computed in less than 40 ms on a 12‐core PC. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
3. Fast Fluid Simulations with Sparse Volumes on the GPU.
- Author
-
Wu, Kui, Truong, Nghia, Yuksel, Cem, and Hoetzlein, Rama
- Subjects
- *
COMPUTER simulation of fluid dynamics , *GRAPHICS processing units , *SPARSE matrix software , *COMPUTATIONAL hydrodynamics software , *GRID computing - Abstract
Abstract: We introduce efficient, large scale fluid simulation on GPU hardware using the fluid‐implicit particle (FLIP) method over a sparse hierarchy of grids represented in NVIDIA® GVDB Voxels. Our approach handles tens of millions of particles within a virtually unbounded simulation domain. We describe novel techniques for parallel sparse grid hierarchy construction and fast incremental updates on the GPU for moving particles. In addition, our FLIP technique introduces sparse, work efficient parallel data gathering from particle to voxel, and a matrix‐free GPU‐based conjugate gradient solver optimized for sparse grids. Our results show that our method can achieve up to an order of magnitude faster simulations on the GPU as compared to FLIP simulations running on the CPU. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
4. Massively Parallel Large Scale Inundation Modelling
- Author
-
Rak, Arne-Tobias, Guthe, Stefan, and Mewis, Peter
- Subjects
Scientific visualization ,Realtime simulation ,Massively parallel algorithms ,Massively parallel and high ,Human centered computing ,Geographic visualization ,performance simulations ,+Scientific+visualization%22">CCS Concepts: Human-centered computing --> Scientific visualization ,+Realtime+simulation%22">Computing methodologies --> Realtime simulation ,Massively parallel and high-performance simulations ,Computing methodologies - Abstract
Over the last 20 years, flooding has been the most common natural disaster, accounting for 44.7% of all disasters, affecting about 1.65 billion people worldwide and causing roughly 105 thousand deaths†. In contrast to other natural disasters, the impact of floods is preventable through affordable structures such as dams, dykes and drainage systems. To be most effective, however, these structures have to be planned and evaluated using the highest precision data of the underlying terrain and current weather conditions. Modern laser scanning techniques provide very detailed and reliable terrain information that may be used for flood inundation modelling in planning and hazard warning systems. These warning systems become more important since flood hazards increase in recent years due to ongoing climate change. In contrast to simulations in planning, simulations in hazard warning systems are time critical due to potentially fast changing weather conditions and limited accuracy in forecasts. In this paper we present a highly optimized CUDA implementation of a numerical solver for the hydraulic equations. Our implementation maximizes the GPU's memory throughput, achieving up to 80% utilization. A speedup of a factor of three is observed in comparison to previous work. Furthermore, we present a low-overhead, in-situ visualization of the simulated data running entirely on the GPU. With this, an area of 15 km2 with a resolution of 1 m can be visualized hundreds of times faster than real time on consumer grade hardware. Furthermore, the flow settings can be changed interactively during computation., Large Scale Visualization, Arne Rak, Stefan Guthe, and Peter Mewis
- Published
- 2022
- Full Text
- View/download PDF
5. Fixed-radius Near Neighbors Searching for 2D Simulations on the GPU using Delaunay Triangulations
- Author
-
Porro, Heinich, Crespin, Benoît, Hitschfeld-Kahler, Nancy, and Navarro, Cristobal
- Subjects
Physical simulation ,+Physical+simulation%22">CCS Concepts: Computing methodologies --> Physical simulation ,Massively parallel and high-performance simulations ,Massively parallel and high ,performance simulations ,Computer Science::Computational Geometry ,Computing methodologies ,MathematicsofComputing_DISCRETEMATHEMATICS ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
We propose to explore a GPU solution to the fixed-radius nearest-neighbor problem in 2D based on Delaunay triangulations. This problem is crucial for many particle-based simulation techniques for collision detection or momentum exchange between particles. Our method computes the neighborhood of each particle at each iteration without neighbor lists or grids, using a Delaunay triangulation whose consistency is preserved by edge flipping. We study how this approach compares to a grid-based implementation on a flocking simulation with variable parameters., Posters, Heinich Porro, Benoît Crespin, Nancy Hitschfeld-Kahler, and Cristobal Navarro
- Published
- 2022
- Full Text
- View/download PDF
6. Accelerator Programming Using Directives 7th International Workshop, WACCPD 2020, Virtual Event, November 20, 2020, Proceedings
- Author
-
(0000-0003-1084-5683) Bhalachandra, S., (0000-0002-5794-3662) Wienke, S., (0000-0002-3560-9428) Chandrasekaran, S., (0000-0002-9935-4428) Juckeland, G., (0000-0003-1084-5683) Bhalachandra, S., (0000-0002-5794-3662) Wienke, S., (0000-0002-3560-9428) Chandrasekaran, S., and (0000-0002-9935-4428) Juckeland, G.
- Abstract
This book constitutes the proceedings of the 7th International Workshop on Accelerator Programming Using Directives, WACCPD 2020, which took place on November 20, 2021. The workshop was initially planned to take place in Atlanta, GA, USA, and changed to an online format due to the COVID-19 pandemic. WACCPD is one of the major forums for bringing together users, developers, and the software and tools community to share knowledge and experiences when programming emerging complex parallel computing systems. The 5 papers presented in this volume were carefully reviewed and selected from 7 submissions. They were organized in topical sections named: OpenMP; OpenACC; and Domain-specific Solvers.
- Published
- 2021
7. Accelerator Programming Using Directives 7th International Workshop, WACCPD 2020, Virtual Event, November 20, 2020, Proceedings
- Author
-
Bhalachandra, S., Wienke, S., Chandrasekaran, S., and Juckeland, G.
- Subjects
Heterogeneous (hybrid) systems ,Massively parallel and high-performance simulations ,Graphics Processing Unit (GPU) ,Compilers ,Massively parallel algorithms ,CUDA ,embedded systems ,computer networks ,Hardware accelerators ,distributed computer systems - Abstract
This book constitutes the proceedings of the 7th International Workshop on Accelerator Programming Using Directives, WACCPD 2020, which took place on November 20, 2021. The workshop was initially planned to take place in Atlanta, GA, USA, and changed to an online format due to the COVID-19 pandemic. WACCPD is one of the major forums for bringing together users, developers, and the software and tools community to share knowledge and experiences when programming emerging complex parallel computing systems. The 5 papers presented in this volume were carefully reviewed and selected from 7 submissions. They were organized in topical sections named: OpenMP; OpenACC; and Domain-specific Solvers.
- Published
- 2021
8. Optimizing the Data Movement in Quantum Transport Simulations via Data-Centric Parallel Programming
- Author
-
Torsten Hoefler, Mathieu Luisier, Guillermo Indalecio Fernández, Alexandros Nikolaos Ziogas, Timo Schneider, and Tal Ben-Nun
- Subjects
FOS: Computer and information sciences ,020203 distributed computing ,Phonon ,Computer science ,Transistor ,02 engineering and technology ,Parallel computing ,Integrated circuit ,Solver ,01 natural sciences ,Database-centric architecture ,law.invention ,Computational Engineering, Finance, and Science (cs.CE) ,Quantum transport ,Computer Science - Distributed, Parallel, and Cluster Computing ,law ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Computer Science - Computational Engineering, Finance, and Science ,010306 general physics ,Transport phenomena ,Quantum ,Massively parallel and high-performance simulations ,Parallel computing methodologies ,Quantum mechanic simulation - Abstract
Designing efficient cooling systems for integrated circuits (ICs) relies on a deep understanding of the electro-thermal properties of transistors. To shed light on this issue in currently fabricated FinFETs, a quantum mechanical solver capable of revealing atomically-resolved electron and phonon transport phenomena from first-principles is required. In this paper, we consider a global, data-centric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarse- and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication-avoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. The presented results make ab initio device simulation enter a new era, where nanostructures composed of over 10,000 atoms can be investigated at an unprecedented level of accuracy, paving the way for better heat management in next-generation ICs., Comment: 12 pages, 18 figures, SC19
- Published
- 2019
- Full Text
- View/download PDF
9. A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations
- Author
-
Torsten Hoefler, Timo Schneider, Guillermo Indalecio Fernández, Mathieu Luisier, Alexandros Nikolaos Ziogas, and Tal Ben-Nun
- Subjects
FOS: Computer and information sciences ,020203 distributed computing ,Dataflow ,Computer science ,Ab initio ,Double-precision floating-point format ,02 engineering and technology ,Solver ,01 natural sciences ,7. Clean energy ,Database-centric architecture ,Computational science ,Computational Engineering, Finance, and Science (cs.CE) ,Computer Science - Distributed, Parallel, and Cluster Computing ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Dissipative system ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Computer Science - Computational Engineering, Finance, and Science ,010306 general physics ,Order of magnitude ,Massively parallel and high-performance simulations ,Parallel computing methodologies ,Quantum mechanic simulation - Abstract
The computational efficiency of a state of the art ab initio quantum transport (QT) solver, capable of revealing the coupled electro-thermal properties of atomically-resolved nano-transistors, has been improved by up to two orders of magnitude through a data centric reorganization of the application. The approach yields coarse-and fine-grained data-movement characteristics that can be used for performance and communication modeling, communication-avoidance, and dataflow transformations. The resulting code has been tuned for two top-6 hybrid supercomputers, reaching a sustained performance of 85.45 Pflop/s on 4,560 nodes of Summit (42.55% of the peak) in double precision, and 90.89 Pflop/s in mixed precision. These computational achievements enable the restructured QT simulator to treat realistic nanoelectronic devices made of more than 10,000 atoms within a 14$\times$ shorter duration than the original code needs to handle a system with 1,000 atoms, on the same number of CPUs/GPUs and with the same physical accuracy., Comment: 13 pages, 13 figures, SC19
- Published
- 2019
10. Main memory latency simulation: the missing link
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Sánchez Verdejo, Rommel, Asifuzzaman, Kazi, Radulović, Milan, Radojković, Petar, Ayguadé Parra, Eduard, Jacob, Bruce, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Sánchez Verdejo, Rommel, Asifuzzaman, Kazi, Radulović, Milan, Radojković, Petar, Ayguadé Parra, Eduard, and Jacob, Bruce
- Abstract
The community accepted the need for a detailed simulation of main memory. Currently, the CPU simulators are usually coupled with the cycle-accurate main memory simulators. However, coupling CPU and memory simulators is not a straight-forward task because some pieces of the circuitry between the last level cache and the memory DIMMs could be easily overlooked and therefore not accounted for. In this paper, we take an approach to quantify the missing cycles in the main memory simulation. To that end, we execute a memory intensive microbenchmark to validate a simulation infrastructure based on ZSim and DRAMsim2 modeling an Intel Sandy Bridge E5-2670 system. We execute the same microbenchmark on a real Sandy Bridge E5-2670 machine identifying a missing 20 ns in the simulator measurements. This is a huge difference that, in the system under study, corresponds to one-third of the overall main memory latency. We propose multiple schemes to add an extra delay in the simulation model to account for the missing cycles. Furthermore, we validate the proposals using the SPEC CPU2006 benchmarks. Finally, we repeat the main memory latency measurements on seven mainstream and emerging computing platforms. Our results show that latency between the Last Level Cache (LLC) and the main memory ranges between tens and hundreds of nanoseconds, so we emphasize on properly adjust and validate these parameters in system simulators before any measurements are performed. Overall, we believe this study would improve main memory simulation leading to the better overall system analysis and explorations performed in the computer architecture community., This work was supported by the Collaboration Agreement between Samsung Electronics Co. Ltd. and BSC, Spanish Ministry of Science and Technology (project TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and the Severo Ochoa Programme (SEV-2015-0493) of the Spanish Government., Peer Reviewed, Postprint (author's final draft)
- Published
- 2018
11. Main memory latency simulation: the missing link
- Author
-
Milan Radulovic, Kazi Asifuzzaman, Rommel Sánchez Verdejo, Petar Radojković, Eduard Ayguadé, Bruce Jacob, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Computer science ,Spec# ,02 engineering and technology ,01 natural sciences ,Processors and memory ,CAS latency ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Latency (engineering) ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,computer.programming_language ,010302 applied physics ,Hardware_MEMORYSTRUCTURES ,business.industry ,Architectures ,Gestió de memòria (Informàtica) ,DIMM ,Supercomputer ,020202 computer hardware & architecture ,Massively parallel and high-performance simulations ,Memory management (Computer science) ,Embedded system ,Central processing unit ,Cache ,business ,computer ,Dram - Abstract
The community accepted the need for a detailed simulation of main memory. Currently, the CPU simulators are usually coupled with the cycle-accurate main memory simulators. However, coupling CPU and memory simulators is not a straight-forward task because some pieces of the circuitry between the last level cache and the memory DIMMs could be easily overlooked and therefore not accounted for. In this paper, we take an approach to quantify the missing cycles in the main memory simulation. To that end, we execute a memory intensive microbenchmark to validate a simulation infrastructure based on ZSim and DRAMsim2 modeling an Intel Sandy Bridge E5-2670 system. We execute the same microbenchmark on a real Sandy Bridge E5-2670 machine identifying a missing 20 ns in the simulator measurements. This is a huge difference that, in the system under study, corresponds to one-third of the overall main memory latency. We propose multiple schemes to add an extra delay in the simulation model to account for the missing cycles. Furthermore, we validate the proposals using the SPEC CPU2006 benchmarks. Finally, we repeat the main memory latency measurements on seven mainstream and emerging computing platforms. Our results show that latency between the Last Level Cache (LLC) and the main memory ranges between tens and hundreds of nanoseconds, so we emphasize on properly adjust and validate these parameters in system simulators before any measurements are performed. Overall, we believe this study would improve main memory simulation leading to the better overall system analysis and explorations performed in the computer architecture community. This work was supported by the Collaboration Agreement between Samsung Electronics Co. Ltd. and BSC, Spanish Ministry of Science and Technology (project TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and the Severo Ochoa Programme (SEV-2015-0493) of the Spanish Government.
- Published
- 2018
12. Performance impact of a slower main memory: a case study of STT-MRAM in HPC
- Author
-
Barcelona Supercomputing Center, Asifuzzaman, Kazi, Pavlovic, Milan, Radulović, Milan, Zaragoza, David, Kwon, Ohseong, Ryoo, Kyung-Chang, Radojković, Petar, Barcelona Supercomputing Center, Asifuzzaman, Kazi, Pavlovic, Milan, Radulović, Milan, Zaragoza, David, Kwon, Ohseong, Ryoo, Kyung-Chang, and Radojković, Petar
- Abstract
In high-performance computing (HPC), significant effort is invested in research and development of novel memory technologies. One of them is Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) --- byte-addressable, high-endurance non-volatile memory with slightly higher access time than DRAM. In this study, we conduct a preliminary assessment of HPC system performance impact with STT-MRAM main memory with recent industry estimations. Reliable timing parameters of STT-MRAM devices are unavailable, so we also perform a sensitivity analysis that correlates overall system slowdown trend with respect to average device latency. Our results demonstrate that the overall system performance of large HPC clusters is not particularly sensitive to main-memory latency. Therefore, STT-MRAM, as well as any other emerging non-volatile memories with comparable density and access time, can be a viable option for future HPC memory system design., This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union's Horizon 2020 research and innovation programme under ExaNoDe project (grant agreement No 671578)., Peer Reviewed, Postprint (author's final draft)
- Published
- 2016
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.