21 results on '"task-based programming model"'
Search Results
2. Post-cloud Computing: Addressing Resource Management in the Resource Continuum
- Author
-
Zanella, Michele and Riva, Carlo G., editor
- Published
- 2023
- Full Text
- View/download PDF
3. Runtime-Assisted Shared Cache Insertion Policies Based on Re-reference Intervals
- Author
-
Dimić, Vladimir, Moretó, Miquel, Casas, Marc, Valero, Mateo, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Rivera, Francisco F., editor, Pena, Tomás F., editor, and Cabaleiro, José C., editor
- Published
- 2017
- Full Text
- View/download PDF
4. Abstraction Layer For Standardizing APIs of Task-Based Engines.
- Author
-
Alomairy, Rabab, Ltaief, Hatem, Abduljabbar, Mustafa, and Keyes, David
- Subjects
- *
DYNAMICAL systems , *ENGINES , *APPLICATION program interfaces , *TASK analysis , *COMPILERS (Computer programs) - Abstract
We introduce AL4SAN, a lightweight library for abstracting the APIs of task-based runtime engines. AL4SAN unifies the expression of tasks and their data dependencies. It supports various dynamic runtime systems relying on compiler technology and user-defined APIs. It enables a single application to employ different runtimes and their respective scheduling components, while providing user-obliviousness to the underlying hardware configurations. AL4SAN exposes common front-end APIs and connects to different back-end runtimes. Experiments on performance and overhead assessments are reported on various shared- and distributed-memory systems, possibly equipped with hardware accelerators. A range of workloads, from compute-bound to memory-bound regimes, are employed as proxies for current scientific applications. The low overhead (less than 10 percent) achieved using a variety of workloads enables AL4SAN to be deployed for fast development of task-based numerical algorithms. More interestingly, AL4SAN enables runtime interoperability by switching runtimes at runtime. Blending runtime systems permits to achieve a twofold speedup on a task-based generalized symmetric eigenvalue solver, relative to state-of-the-art implementations. The ultimate goal of AL4SAN is not to create a new runtime, but to strengthen co-design of existing runtimes/applications, while facilitating user productivity and code portability. The code of AL4SAN is freely available at https://github.com/ecrc/al4san , with extensions in progress. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
5. Enabling Model-Centric Debugging for Task-Based Programming Models—A Tasking Control Interface
- Author
-
Nachtmann, Mathias, Gracia, José, Knüpfer, Andreas, editor, Hilbrich, Tobias, editor, Niethammer, Christoph, editor, Gracia, José, editor, Nagel, Wolfgang E., editor, and Resch, Michael M., editor
- Published
- 2016
- Full Text
- View/download PDF
6. Unified fault-tolerance framework for hybrid task-parallel message-passing applications.
- Author
-
Subasi, Omer, Martsinkevich, Tatiana, Zyulkyarov, Ferad, Unsal, Osman, Labarta, Jesus, and Cappello, Franck
- Subjects
- *
APPLICATION software , *FAULT-tolerant computing , *MESSAGE passing (Computer science) , *COMPUTER network protocols , *PARALLEL programs (Computer programs) - Abstract
We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
7. Task-level checkpointing system for task-based parallel workflows
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Vergés Boncompte, Pere, Lordan Gomis, Francesc, Ejarque Artigas, Jorge, Badia Sala, Rosa Maria, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Vergés Boncompte, Pere, Lordan Gomis, Francesc, Ejarque Artigas, Jorge, and Badia Sala, Rosa Maria
- Abstract
Scientific applications are large and complex; task-based programming models are a popular approach to developing these applications due to their ease of programming and ability to handle complex workflows and distribute their workload across large infrastructures. In these environments, either the hardware or the software may lead to failures from a myriad of origins: application logic, system software, memory, network, or disk. Re-executing a failed application can take hours, days, or even weeks, thus, dragging out the research. This article proposes a recovery system for dynamic task-based models to reduce the re-execution time of failed runs. The design encapsulates in a checkpointing manager the automatic checkpointing of the execution, leveraging different mechanisms that can be arbitrarily defined and tuned to fit the needs of each performance. Additionally, it offers an API call to establish snapshots of the execution from the application code. The experiments executed on a prototype implementation have reached a speedup of 1.9× after re-execution and shown no overhead on the execution time on successful first runs of specific applications., This work has been supported by the Spanish Government (PID2019-107255GB), by Generalitat de Catalunya (contract 2017-SGR-01414), and by the European Commission through the Horizon 2020 Research and Innovation program under Grant Agreement No. 955558 (eFlows4HPC- project). This work has partially been co-funded with 50% by the European Regional Development Fund under the framework of the ERFD Operative Programme for Catalunya 2014-2020., Peer Reviewed, Postprint (author's final draft)
- Published
- 2022
8. Performance testing of ML and HDC : parallelized applications on top of RISC-V architecture
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Badia Sala, Rosa Maria, Nicolau, Alexandru, Veidenbaum, Alex, Vergés Boncompte, Pere, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Badia Sala, Rosa Maria, Nicolau, Alexandru, Veidenbaum, Alex, and Vergés Boncompte, Pere
- Abstract
The economic impact that proprietary ISA has on the market increased the interest in using Open Source ISA. More specifically RISC-V has been getting a lot of traction in the research community. The Open Source environment allowed for the development of software and hardware stack for Exascale computations. To take advantage of these resources and allow for executions of large and complex applications, task-based programming models have become more popular, thanks to their ease when handling composite workflows that require a large amount of data and computation time. Moreover, most of the applications being developed nowadays are related to Machine Learning in general, and in the context of RISC-V, there is a lot of interest in developing applications for Embedded Systems, where the framework of Hyperdimensional Computing is becoming more popular. For these reasons in we present this study in the scope of the MareNostrum Experimental Exascale Platform (MEEP), which is a flexible FPGA-based emulation platform designed for future RISC-V supercomputers. This study evaluates Machine Learning algorithms, classical Linear Algebra algorithms used for ML, and Hyperdimensional Computing Algorithms using COMPSs, a task-based programming model for the development of applications for distributed infrastructures, in different RISC-V boards being developed in the MEEP project and different mathematical libraries.
- Published
- 2022
9. Task-based Runtime Optimizations Towards High Performance Computing Applications
- Author
-
Cao, Qinglei
- Subjects
- Low-rank approximations, Mixed-precision, Cholesky factorization, Data redistribution, Task-based programming model, Dynamic runtime system, Numerical Analysis and Scientific Computing, Programming Languages and Compilers, Software Engineering
- Abstract
The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity. In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications.
- Published
- 2022
10. RICH: implementing reductions in the cache hierarchy
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Dimic, Vladimir, Moretó Planas, Miquel, Casas, Marc, Ciesko, Jan, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Dimic, Vladimir, Moretó Planas, Miquel, Casas, Marc, Ciesko, Jan, and Valero Cortés, Mateo
- Abstract
Reductions constitute a frequent algorithmic pattern in high-performance and scientific computing. Sophisticated techniques are needed to ensure their correct and scalable concurrent execution on modern processors. Reductions on large arrays represent the most demanding case where traditional approaches are not always applicable due to low performance scalability. To address these challenges, we propose RICH, a runtime-assisted solution that relies on architectural and parallel programming model extensions. RICH updates the reduction variable directly in the cache hierarchy with the help of added in-cache functional units. Our programming model extensions fit with the most relevant parallel programming solutions for shared memory environments like OpenMP. RICH does not modify the ISA, which allows the use of algorithms with reductions from pre-compiled external libraries. Experiments show that our solution achieves the performance improvements of 11.2% on average, compared to the state-of-the-art hardware-based approaches, while it introduces 2.4% area and 3.8% power overhead., This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by Generalitat de Catalunya (contracts 2017- SGR-1414 and 2017-SGR-1328). V. Dimić has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship number 2017 FI_B 00855. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramón y Cajal fellowship number RYC-2016-21104. M. Casas has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2017-23269. This manuscript has been co-authored by National Technology & Engineering Solutions of Sandia, LLC. under Contract No. DENA0003525 with the U.S. Department of Energy/National Nuclear Security Administration, Peer Reviewed, Postprint (author's final draft)
- Published
- 2020
11. RICH: implementing reductions in the cache hierarchy
- Author
-
Jan Ciesko, Mateo Valero, Marc Casas, Vladimir Dimić, Miquel Moreto, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Computer science ,Parallel programming (Computer science) ,02 engineering and technology ,Parallel computing ,Programació en paral·lel (Informàtica) ,01 natural sciences ,Reduction (complexity) ,Shared memory ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Task-based programming model ,Cache hierarchy ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,010302 applied physics ,Gestió de memòria (Informàtica) ,Caches ,020202 computer hardware & architecture ,Variable (computer science) ,Memory management (Computer science) ,Parallel programming model ,Scalability ,Programming paradigm ,Superordinadors ,High performance computing ,Reductions - Abstract
Reductions constitute a frequent algorithmic pattern in high-performance and scientific computing. Sophisticated techniques are needed to ensure their correct and scalable concurrent execution on modern processors. Reductions on large arrays represent the most demanding case where traditional approaches are not always applicable due to low performance scalability. To address these challenges, we propose RICH, a runtime-assisted solution that relies on architectural and parallel programming model extensions. RICH updates the reduction variable directly in the cache hierarchy with the help of added in-cache functional units. Our programming model extensions fit with the most relevant parallel programming solutions for shared memory environments like OpenMP. RICH does not modify the ISA, which allows the use of algorithms with reductions from pre-compiled external libraries. Experiments show that our solution achieves the performance improvements of 11.2% on average, compared to the state-of-the-art hardware-based approaches, while it introduces 2.4% area and 3.8% power overhead. This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by Generalitat de Catalunya (contracts 2017- SGR-1414 and 2017-SGR-1328). V. Dimić has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship number 2017 FI_B 00855. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramón y Cajal fellowship number RYC-2016-21104. M. Casas has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2017-23269. This manuscript has been co-authored by National Technology & Engineering Solutions of Sandia, LLC. under Contract No. DENA0003525 with the U.S. Department of Energy/National Nuclear Security Administration
- Published
- 2020
12. Asynchronous Task-Based Execution of the Reverse Time Migration for the Oil and Gas Industry
- Author
-
I. Said, Samuel Thibault, Amani AlOnazi, David E. Keyes, Hatem Ltaief, King Abdullah University of Science and Technology (KAUST), NVIDIA (NVIDIA), STatic Optimizations, Runtime Methods (STORM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), This research used resources of the Oak Ridge Leader-ship Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of theU.S. Department of Energy under Contract No. DE-AC05-00OR22725., and Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest
- Subjects
Instruction prefetch ,Out-Of-Core Algorithms ,Overlapping I/O with Computation ,Memory hierarchy ,business.industry ,Computer science ,Asynchronous Executions ,010103 numerical & computational mathematics ,Parallel computing ,Task-Based Programming Model ,010502 geochemistry & geophysics ,01 natural sciences ,Reverse Time Migration ,STARPU OOC ,Runtime system ,High memory ,Asynchronous communication ,Computer data storage ,Scalability ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,0101 mathematics ,business ,Massively parallel ,0105 earth and related environmental sciences - Abstract
International audience; We propose a new framework for deploying Reverse Time Migration (RTM) simulations on distributed-memory systems equipped with multiple GPUs. Our software, TB-RTM, infrastructure engine relies on the STARPU dynamic runtime system to orchestrate the asynchronous scheduling of RTM computational tasks on the underlying resources. Besides dealing with the challenging hardware heterogeneity, TB-RTM supports tasks with different workload characteristics, which stress disparate components of the hardware system. RTM is challenging in that it operates intensively at both ends of the memory hierarchy, with compute kernels running at the highest level of the memory system, possibly in GPU main memory, while I/O kernels are saving solution data to fast storage. We consider how to span the wide performance gap between the two extreme ends of the memory system, i.e., GPU memory and fast storage, on which large-scale RTM simulations routinely execute. To maximize hardware occupancy while maintaining high memory bandwidth throughout the memory subsystem, our framework presents the new out-of-core (OOC) feature from STARPU to prefetch data solutions in and out not only from/to the GPU/CPU main memory but also from/to the fast storage system. The OOC technique may trigger opportunities for overlapping expensive data movement with computations. TB-RTM framework addresses this challenging problem of heterogeneity with a systematic approach that is oblivious to the targeted hardware architectures. Our resulting RTM framework can effectively be deployed on massively parallel GPU-based systems, while delivering performance scalability up to 500 GPUs.
- Published
- 2019
- Full Text
- View/download PDF
13. Optimizing computation-communication overlap in asynchronous task-based programs
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Castillo, Emilio, Jain, Nikhil, Casas, Marc, Moretó Planas, Miquel, Schulz, Martin, Beivide Palacio, Julio Ramon, Valero Cortés, Mateo, Bhatele, Abhinav, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Castillo, Emilio, Jain, Nikhil, Casas, Marc, Moretó Planas, Miquel, Schulz, Martin, Beivide Palacio, Julio Ramon, Valero Cortés, Mateo, and Bhatele, Abhinav
- Abstract
Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient interactions between these programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We address this challenge by exposing and exploiting information about MPI internals in a task-based runtime system to make better task-creation and scheduling decisions. In particular, we present two mechanisms for exchanging information between MPI and a task-based runtime, and analyze their trade-offs. Further, we present a detailed evaluation of the proposed mechanisms implemented in MPI and a task-based runtime. We show performance improvements of up to 16.3% and 34.5% for proxy applications with point-to-point and collective communication, respectively., Peer Reviewed, Postprint (author's final draft)
- Published
- 2019
14. PureMEM: A Structured Programming Model for Transiently Powered Computers
- Author
-
Geylani Kardas, Kasim Yildirim, Caglar Durmaz, and Ege Üniversitesi
- Subjects
Data consistency ,Computer science ,Distributed computing ,Control (management) ,020207 software engineering ,Structured Programming Model ,02 engineering and technology ,Transiently Powered Computers ,Structured programming ,Task-Based Programming Model ,Task (project management) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,Embedded Systems and Software - Abstract
EgeUn###, Advances in energy harvesting circuits and energy efficient architecture of processors create the potential for batteryless computing and sensing systems called transiently powered computers. These computers can only operate intermittently due to fluctuating nature of ambient energy. Intermittent operation requires a new programming model that should preserve forward progress and maintain data consistency; which are challenging. We propose a structured task-based programming model; namely PureMEM, to cope with these challenges. We discuss how PureMEM prevents interdependencies caused by the unstructured control encountered in intermittent operation, enables re-usability of the tasks, provides dynamic memory management and supports error handling. We also present intermittent programs to exemplify the features of PureMEM., Assoc Comp Machinery Special Interest Grp Appl Comp
- Published
- 2019
15. Optimizing computation-communication overlap in asynchronous task-based programs
- Author
-
Ramon Beivide, Marc Casas, Emilio Castillo, Miquel Moreto, Abhinav Bhatele, Nikhil Jain, Mateo Valero, Martin Schulz, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, and Barcelona Supercomputing Center
- Subjects
020203 distributed computing ,Exploit ,Parallel processing (Electronic computers) ,Computer science ,Computation-communication overlap ,Distributed computing ,Computation ,Processament en paral·lel (Ordinadors) ,Parallel programming (Computer science) ,02 engineering and technology ,Programació en paral·lel (Informàtica) ,Supercomputer ,Popularity ,020202 computer hardware & architecture ,Scheduling (computing) ,Runtime system ,Asynchronous communication ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,Mpi ,High performance computing ,Task-based programming model ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,Càlcul intensiu (Informàtica) - Abstract
Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient interactions between these programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We address this challenge by exposing and exploiting information about MPI internals in a task-based runtime system to make better task-creation and scheduling decisions. In particular, we present two mechanisms for exchanging information between MPI and a task-based runtime, and analyze their trade-offs. Further, we present a detailed evaluation of the proposed mechanisms implemented in MPI and a task-based runtime. We show performance improvements of up to 16.3% and 34.5% for proxy applications with point-to-point and collective communication, respectively.
- Published
- 2019
- Full Text
- View/download PDF
16. Graph partitioning applied to DAG scheduling to reduce NUMA effects
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Sánchez Barrera, Isaac, Casas, Marc, Moretó Planas, Miquel, Ayguadé Parra, Eduard, Labarta Mancho, Jesús José, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Sánchez Barrera, Isaac, Casas, Marc, Moretó Planas, Miquel, Ayguadé Parra, Eduard, Labarta Mancho, Jesús José, and Valero Cortés, Mateo
- Abstract
The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are typically applied by the system software. We propose techniques at the runtime system level to reduce NUMA effects on parallel applications. We leverage runtime system metadata in terms of a task dependency graph. Our approach, based on graph partitioning methods, is able to provide parallel performance improvements of 1.12X on average with respect to the state-of-the-art., This work has been partially supported by the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Spanish Government (contract TIN2015-65316-P). I. Sánchez Barrera has been supported by the Spanish Government under Formación del Profesorado Universitario fellowship number FPU15/03612., Peer Reviewed, Postprint (published version)
- Published
- 2018
17. Graph partitioning applied to DAG scheduling to reduce NUMA effects
- Author
-
Marc Casas, Jesús Labarta, Miquel Moreto, Eduard Ayguadé, Mateo Valero, Isaac Sánchez Barrera, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
graph partitioning ,Computer science ,Parallel computing ,Thread (computing) ,02 engineering and technology ,01 natural sciences ,Scheduling (computing) ,Runtime system ,NUMA ,Shared memory ,020204 information systems ,0103 physical sciences ,Informàtica::Sistemes d'informació::Emmagatzematge i recuperació de la informació [Àrees temàtiques de la UPC] ,0202 electrical engineering, electronic engineering, information engineering ,Task-based programming model ,010302 applied physics ,Scheduling ,Graph partition ,020207 software engineering ,Gestió de memòria (Informàtica) ,Computer Graphics and Computer-Aided Design ,Metadata ,Memory management (Computer science) ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Software ,System software - Abstract
The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are typically applied by the system software. We propose techniques at the runtime system level to reduce NUMA effects on parallel applications. We leverage runtime system metadata in terms of a task dependency graph. Our approach, based on graph partitioning methods, is able to provide parallel performance improvements of 1.12X on average with respect to the state-of-the-art. This work has been partially supported by the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Spanish Government (contract TIN2015-65316-P). I. Sánchez Barrera has been supported by the Spanish Government under Formación del Profesorado Universitario fellowship number FPU15/03612.
- Published
- 2018
18. Runtime-assisted shared cache insertion policies based on re-reference intervals
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Dimic, Vladimir, Moretó Planas, Miquel, Casas, Marc, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Dimic, Vladimir, Moretó Planas, Miquel, Casas, Marc, and Valero Cortés, Mateo
- Abstract
Processor speed is improving at a faster rate than the speed of main memory, which makes memory accesses increasingly expensive. One way to solve this problem is to reduce miss ratio of the processor’s last level cache by improving its replacement policy. We approach the problem by co-designing the runtime system and hardware and exploiting the semantics of the applications written in data-flow task-based programming models to provide hardware with information about the task types and task data-dependencies. We propose the Task-Type aware Insertion Policy, TTIP, which uses the runtime system to dynamically determine the best probability per task type for bimodal insertion in the recency stack and the static Dependency-Type aware Insertion Policy, DTIP, that inserts cache lines in the optimal position taking into account the dependency types of the current task. TTIP and DTIP perform similarly or better than state-of-the-art replacement policies, while requiring less hardware., This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). V. Dimic has been partially supported by AGAUR of the Government of Catalonia (contract 2017 FI B 00855). M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas has been supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (contract 2013 BP B 00243)., Peer Reviewed, Postprint (author's final draft)
- Published
- 2017
19. Unified fault-tolerance framework for hybrid task-parallel message-passing applications
- Author
-
Barcelona Supercomputing Center, Subasi, Omer, Martsinkevich, Tatiana, Zyulkyarov, Ferad, Unsal, Osman Sabri, Labarta Mancho, Jesús José, Cappello, Franck, Barcelona Supercomputing Center, Subasi, Omer, Martsinkevich, Tatiana, Zyulkyarov, Ferad, Unsal, Osman Sabri, Labarta Mancho, Jesús José, and Cappello, Franck
- Abstract
We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme., The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P., Peer Reviewed, Postprint (author's final draft)
- Published
- 2016
20. Runtime assisted cache memory optimizations
- Author
-
Dimic, Vladimir, Moreto Planas, Miquel, Valero Cortés, Mateo, and Casas Guix, Marc
- Subjects
arquitectura de computadors ,Jerarquia de memòria (Informàtica) ,Memory hierarchy (Computer science) ,sistema operatiu ,runtime ,model de programació basat en tasques ,processor design ,operating system ,cache ,processor cache ,computer architecture ,task-based programming model ,disseny del processador ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] - Published
- 2015
21. Runtime assisted cache memory optimizations
- Author
-
Moretó Planas, Miquel, Valero Cortés, Mateo, Casas, Marc, Dimic, Vladimir, Moretó Planas, Miquel, Valero Cortés, Mateo, Casas, Marc, and Dimic, Vladimir
- Published
- 2015
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.