6 results on '"Marazakis, Manolis"'
Search Results
2. The ExaNeSt Prototype: Evaluation of Efficient HPC Communication Hardware in an ARM-based Multi-FPGA Rack
- Author
-
Ploumidis, Manolis, Chaix, Fabien, Chrysos, Nikolaos, Assiminakis, Marios, Flouris, Vassilis, Kallimanis, Nikolaos, Kossifidis, Nikolaos, Nikoloudakis, Michael, Petrakis, Polydoros, Dimou, Nikolaos, Gianioudis, Michael, Ieronymakis, George, Ioannou, Aggelos, Kalokerinos, George, Xirouchakis, Pantelis, Ailamakis, George, Damianakis, Astrinos, Ligerakis, Michael, Makris, Ioannis, Vavouris, Theocharis, Katevenis, Manolis, Papaefstathiou, Vassilis, Marazakis, Manolis, and Mavroidis, Iakovos
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
We present and evaluate the ExaNeSt Prototype, a liquid-cooled rack prototype consisting of 256 Xilinx ZU9EG MPSoCs, 4 TBytes of DRAM, 16 TBytes of SSD, and configurable interconnection 10-Gbps hardware. We developed this testbed in 2016-2019 to validate the flexibility of FPGAs for experimenting with efficient hardware support for HPC communication among tens of thousands of processors and accelerators in the quest towards Exascale systems and beyond. We present our key design choices reagrding overall system architecture, PCBs and runtime software, and summarize insights resulting from measurement and analysis. Of particular note, our custom interconnect includes a low-cost low-latency network interface, offering user-level zero-copy RDMA, which we have tightly coupled with the ARMv8 processors in the MPSoCs. We have developed a system software runtime on top of these features, and have been able to run MPI. We have evaluated our testbed through MPI microbenchmarks, mini, and full MPI applications. Single hop, one way latency is $1.3$~$\mu$s; approximately $0.47$~$\mu$s out of these are attributed to network interface and the user-space library that exposes its functionality to the runtime. Latency over longer paths increases as expected, reaching $2.55$~$\mu$s for a five-hop path. Bandwidth tests show that, for a single hop, link utilization reaches $82\%$ of the theoretical capacity. Microbenchmarks based on MPI collectives reveal that broadcast latency scales as expected when the number of participating ranks increases. We also implemented a custom Allreduce accelerator in the network interface, which reduces the latency of such collectives by up to $88\%$. We assess performance scaling through weak and strong scaling tests for HPCG, LAMMPS, and the miniFE mini application; for all these tests, parallelization efficiency is at least $69\%$, or better., Comment: 45 pages, 23 figures
- Published
- 2023
3. Frisbee: automated testing of Cloud-native applications in Kubernetes
- Author
-
Nikolaidis, Fotis, Chazapis, Antony, Marazakis, Manolis, and Bilas, Angelos
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
As more and more companies are migrating (or planning to migrate) from on-premise to Cloud, their focus is to find anomalies and deficits as early as possible in the development life cycle. We propose Frisbee, a declarative language and associated runtime components for testing cloud-native applications on top of Kubernetes. Given a template describing the system under test and a workflow describing the experiment, Frisbee automatically interfaces with Kubernetes to deploy the necessary software in containers, launch needed sidecars, execute the workflow steps, and perform automated checks for deviation from expected behavior. We evaluate Frisbee through a series of tests, to demonstrate its role in designing, and evaluating cloud-native applications; Frisbee helps in testing uncertainties at the level of application (e.g., dynamically changing request patterns), infrastructure (e.g., crashes, network partitions), and deployment (e.g., saturation points). Our findings have strong implications for the design, deployment, and evaluation of cloud applications. The most prominent is that: erroneous benchmark outputs can cause an apparent performance improvement, automated failover mechanisms may require interoperability with clients, and that a proper placement policy should also account for the clock frequency, not only the number of cores.
- Published
- 2021
4. Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement
- Author
-
Vardas, Ioannis, Ploumidis, Manolis, and Marazakis, Manolis
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,C.4 - Abstract
HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Apart from increased performance, large scale systems face two challenges that hinder further growth: energy efficiency and resiliency. At the same time, applications seeking increased performance rely on advanced parallelism for exploiting system resources, which leads to increased pressure on system interconnects. At large system scales, increased communication locality can be beneficial both in terms of application performance and energy consumption. Towards this direction, several studies focus on deriving a mapping of an application's processes to system nodes in a way that communication cost is reduced. A common approach is to express both the application's communication patterns and the system architecture as graphs and then solve the corresponding mapping problem. Apart from communication cost, the completion time of a job can also be affected by node failures. Node failures may result in job abortions, requiring job restarts. In this paper, we address the problem of assigning processes to system resources with the goal of reducing communication cost while also taking into account node failures. The proposed approach is integrated into the Slurm resource manager. Evaluation results show that, in scenarios where few nodes have a low outage probability, the proposed process placement approach achieves a notable decrease in the completion time of batches of MPI jobs. Compared to the default process placement approach in Slurm, the reduction is 18.9% and 31%, respectively for two different MPI applications., Comment: 21 pages, 8 figures, added Acknowledgements section
- Published
- 2020
5. Power and Performance Analysis of Persistent Key-Value Stores
- Author
-
Mikrou, Stella, Papagiannis, Anastasios, Saloustros, Giorgos, Marazakis, Manolis, and Bilas, Angelos
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Performance - Abstract
With the current rate of data growth, processing needs are becoming difficult to fulfill due to CPU power and energy limitations. Data serving systems and especially persistent key-value stores have become a substantial part of data processing stacks in the data center, providing access to massive amounts of data for applications and services. Key-value stores exhibit high CPU and I/O overheads because of their constant need to reorganize data on the devices. In this paper, we examine the efficiency of two key-value stores on four servers of different generations and with different CPU architectures. We use RocksDB, a key-value that is deployed widely, e.g. in Facebook, and Kreon, a research key-value store that has been designed to reduce CPU overhead. We evaluate their behavior and overheads on an ARM-based microserver and three different generations of x86 servers. Our findings show that microservers have better power efficiency in the range of 0.68-3.6x with a comparable tail latency.
- Published
- 2020
6. Shall numerical astrophysics step into the era of Exascale computing?
- Author
-
Taffoni, Giuliano, Murante, Giuseppe, Tornatore, Luca, Goz, David, Borgani, Stefano, Katevenis, Manolis, Chrysos, Nikolaos, and Marazakis, Manolis
- Subjects
Astrophysics - Instrumentation and Methods for Astrophysics ,Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
High performance computing numerical simulations are today one of the more effective instruments to implement and study new theoretical models, and they are mandatory during the preparatory phase and operational phase of any scientific experiment. New challenges in Cosmology and Astrophysics will require a large number of new extremely computationally intensive simulations to investigate physical processes at different scales. Moreover, the size and complexity of the new generation of observational facilities also implies a new generation of high performance data reduction and analysis tools pushing toward the use of Exascale computing capabilities. Exascale supercomputers cannot be produced today. We discuss the major technological challenges in the design, development and use of such computing capabilities and we will report on the progresses that has been made in the last years in Europe, in particular in the framework of the ExaNeSt European funded project. We also discuss the impact of this new computing resources on the numerical codes in Astronomy and Astrophysics., Comment: 3 figures, invited talk for proceedings of ADASS XXVI, accepted by ASP Conference Series
- Published
- 2019
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.