Author: "Raymond Namyst" / Topic: multi-core processor - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Raymond Namyst"' showing total 27 results

Start Over Author "Raymond Namyst" Topic multi-core processor

27 results on '"Raymond Namyst"'

1. Resource-Management Study in HPC Runtime-Stacking Context

Author: Raymond Namyst, Marc Pérache, Arthur Loussert, Julien Jaeger, Patrick Carribault, Benoit Welterlen, DAM Île-de-France (DAM/DIF), Direction des Applications Militaires (DAM), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Bull atos technologies, and Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)
Subjects: Multi-core processor, Side effect (computer science), Resource (project management), Exploit, Computer science, Distributed computing, Memory footprint, Overhead (computing), Context (language use), Parallel computing, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Resource management (computing)
Abstract: International audience; With the advent of multicore and manycore processors as building blocks of HPC supercomputers, many applications shift from relying solely on a distributed programming model (e.g., MPI) to mixing distributed and shared-memory models (e.g., MPI+OpenMP), to better exploit shared-memory communications and reduce the overall memory footprint. One side effect of this programming approach is runtime stacking: mixing multiple models involve various runtime libraries to be alive at the same time and to share the underlying computing resources. This paper explores different configurations where this stacking may appear and introduces algorithms to detect the misuse of compute resources when running a hybrid parallel application. We have implemented our algorithms inside a dynamic tool that monitors applications and outputs resource usage to the user. We validated this tool on applications from CORAL benchmarks. This leads to relevant information which can be used to improve runtime placement, and to an average overhead lower than 1% of total execution time.
Published: 2017
Full Text: View/download PDF

2. Peppher: Performance Portability and Programmability for Heterogeneous Many-Core Architectures

Author: Christoph Keler, Raymond Namyst, Jesper Larsson Träff, Herbert Cornelius, Cdric Augonnet, George Russell, Philippas Tsigas, Peter Sanders, Siegfried Benkner, Sabri Pllana, David Moloney, Andrew Richards, and Samuel Thibault
Subjects: Software portability, Multi-core processor, Many core, Computer architecture, Computer science, Code (cryptography), Data structure, Extensibility
Abstract: © 2017 by John Wiley & Sons, Inc. All rights reserved. PEPPHER takes a pluralistic and parallelization agnostic approach to programmability and performance portability for heterogeneous many-core architectures. The PEPPHER framework is in principle language independent but focuses on supporting C++ code with PEPPHER-specific annotations as pragmas or external annotations. The framework is open and extensible; the PEPPHER methodology details how new architectures are incorporated. The PEPPHER methodology consists of rules for how to extend the framework for new architectures. This mainly concerns adaptivity and autotuning for algorithm libraries, the necessary hooks and extensions for the run-time system and any supporting algorithms and data structures that this relies on. Offloading is a specific technique for programming heterogeneous platforms that can sometimes be applied with high efficiency. Offload as developed by the PEPPHER partner Codeplay is a particular, nonintrusive C++ extension allowing portable C++ code to support diverse heterogeneous multicore architectures in a single code base.
Published: 2017
Full Text: View/download PDF

3. Resource Aggregation for Task-Based Cholesky Factorization on Top of Heterogeneous Machines

Author: Andra Hugo, Abdou Guermouche, Raymond Namyst, Pierre-André Wacrenier, Terry Cojean, STatic Optimizations, Runtime Methods (STORM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université de Bordeaux (UB), High-End Parallel Algorithms for Challenging Numerical Simulations (HiePACS), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Uppsala University, PLAFRIM, ANR-13-MONU-0007,SOLHAR,Solveurs pour architectures hétérogènes utilisant des supports d'exécution(2013), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Guermouche, Abdou, and Modèles Numériques - Solveurs pour architectures hétérogènes utilisant des supports d'exécution - - SOLHAR2013 - ANR-13-MONU-0007 - MN - VALID
Subjects: accelerator, Computer science, Distributed computing, Computation, GPU, Symmetric multiprocessor system, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, [INFO] Computer Science [cs], 01 natural sciences, Runtime system, [INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], 0202 electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], 0101 mathematics, Implementation, dense linear algebra, 020203 distributed computing, Multi-core processor, runtime system, heterogeneous computing, Multicore, Cholesky, Graph (abstract data type), [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], task DAG, Cholesky decomposition
Abstract: International audience; Hybrid computing platforms are now commonplace, featuring a large number of CPU cores and accelerators. This trend makes balancing computations between these heterogeneous resources performance critical. In this paper we propose aggregating several CPU cores in order to execute larger parallel tasks and thus improve the load balance between CPUs and accelerators. Additionally, we present our approach to exploit internal parallelism within tasks. This is done by combining two runtime systems: one runtime system to handle the task graph and another one to manage the internal parallelism. We demonstrate the relevance of our approach in the context of the dense Cholesky factorization kernel implemented on top of the StarPU task-based runtime system. We present experimental results showing that our solution outperforms state of the art implementations.
Published: 2017
Full Text: View/download PDF

4. Resource aggregation for task-based Cholesky Factorization on top of modern architectures

Author: Andra Hugo, Pierre-André Wacrenier, Terry Cojean, Raymond Namyst, Abdou Guermouche, STatic Optimizations, Runtime Methods (STORM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université de Bordeaux (UB), High-End Parallel Algorithms for Challenging Numerical Simulations (HiePACS), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Uppsala University, ANR Solhar, INRIA, PLAFRIM, ANR-13-MONU-0007,SOLHAR,Solveurs pour architectures hétérogènes utilisant des supports d'exécution(2013), Wacrenier, Pierre André, Modèles Numériques - Solveurs pour architectures hétérogènes utilisant des supports d'exécution - - SOLHAR2013 - ANR-13-MONU-0007 - MN - VALID, Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Uppsala Universitet [Uppsala], and Cojean, Terry
Subjects: Intel Xeon-Phi KNL, accelerator, Computer Networks and Communications, Computer science, GPU, Task parallelism, Symmetric multiprocessor system, 010103 numerical & computational mathematics, Parallel computing, [INFO] Computer Science [cs], 01 natural sciences, Theoretical Computer Science, Runtime system, Artificial Intelligence, [INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], [INFO]Computer Science [cs], 0101 mathematics, dense linear algebra, Multi-core processor, Load balancing (computing), runtime system, Computer Graphics and Computer-Aided Design, heterogeneous computing, 010101 applied mathematics, Hardware and Architecture, Multicore, Graph (abstract data type), [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Software, Xeon Phi, task DAG, Cholesky decomposition, Cholesky factorization
Abstract: This paper is submitted for review to the Parallel Computing special issue for HCW and HeteroPar 16 workshops; International audience; Hybrid computing platforms are now commonplace, featuring a large number of CPU cores and accelerators. This trend makes balancing computations between these heterogeneous resources performance critical. In this paper we propose ag-gregating several CPU cores in order to execute larger parallel tasks and improve load balancing between CPUs and accelerators. Additionally, we present our approach to exploit internal parallelism within tasks, by combining two runtime system schedulers: a global runtime system to schedule the main task graph and a local one one to cope with internal task parallelism. We demonstrate the relevance of our approach in the context of the dense Cholesky factorization kernel implemented on top of the StarPU task-based runtime system. We present experimental results showing that our solution outperforms state of the art implementations on two architectures: a modern heterogeneous machine and the Intel Xeon Phi Knights Landing.
Published: 2016

5. A runtime approach to dynamic resource allocation for sparse direct solvers

Author: Pierre-André Wacrenier, Abdou Guermouche, Raymond Namyst, Andra Hugo, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), High-End Parallel Algorithms for Challenging Numerical Simulations (HiePACS), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest
Subjects: 020203 distributed computing, Multi-core processor, Computer science, Distributed computing, Context management, Memory bus, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, Solver, 01 natural sciences, Scheduling (computing), Runtime system, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Cache, 0101 mathematics, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC]
Abstract: International audience; —To face the advent of multicore processors and the ever increasing complexity of hardware architectures, pro-gramming models based on DAG-of-tasks parallelism regained popularity in the high performance, scientific computing com-munity. In this context, enabling HPC applications to perform efficiently when dealing with graphs of parallel tasks that could potentially run simultaneously is a great challenge. Even if a uniform runtime system is used underneath, scheduling multiple parallel tasks over the same set of hardware resources introduces many issues, such as undesirable cache flushes or memory bus contention. In this paper, we show how runtime system-based scheduling contexts can be used to dynamically enforce locality of parallel tasks on multicore machines. We extend an existing generic sparse direct solver to use our mechanism and introduce a new decomposition method based on proportional mapping that is used to build the scheduling contexts. We propose a runtime-level dynamic context management policy to cope with the very irregular behavior of the application. A detailed performance analysis shows significant performance improvements of the solver over various multicore hardware.
Published: 2014
Full Text: View/download PDF

6. Adaptive Task Size Control on High Level Programming for GPU/CPU Work Sharing

Author: Raymond Namyst, Yuetsu Kodama, Taisuke Boku, Toshihiro Hanawa, Mitsuhisa Sato, Olivier Aumage, Tetsuya Odajima, Samuel Thibault, Graduate School of Systems and Information Engineering [Tsukuba], Université de Tsukuba = University of Tsukuba, Center for Computational Sciences [Tsukuba] (CCS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: 020203 distributed computing, Multi-core processor, Computer science, Distributed computing, Dynamic load balancing, 020206 networking & telecommunications, Symmetric multiprocessor system, 02 engineering and technology, Parallel computing, Load balancing (computing), Runtime system, High-level programming language, 0202 electrical engineering, electronic engineering, information engineering, Partitioned global address space, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC]
Abstract: International audience; On the work sharing among GPUs and CPU cores on GPU equipped clusters, it is a critical issue to keep load balance among these heterogeneous computing resources. We have been developing a runtime system for this problem on PGAS language named XcalableMP- dev/StarPU [1]. Through the development, we found the necessity of adaptive load balancing for GPU/CPU work sharing to achieve the best performance for various application codes. In this paper, we enhance our language system XcalableMP-dev/StarPU to add a new feature which can control the task size to be assigned to these heterogeneous resources dynamically during application execution. As a result of performance evaluation on several benchmarks, we confirmed the proposed feature correctly works and the performance with heterogeneous work sharing provides up to about 40% higher performance than GPU-only utilization even for relatively small size of problems.
Published: 2013
Full Text: View/download PDF

7. Composing multiple StarPU applications over heterogeneous machines: a supervised approach

Author: Abdou Guermouche, Raymond Namyst, Andra Hugo, Pierre-André Wacrenier, Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), High-End Parallel Algorithms for Challenging Numerical Simulations (HiePACS), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, and Hugo, Andra-Ecaterina
Subjects: Computer science, Distributed computing, resource allocation, Memory bus, Thread (computing), 02 engineering and technology, Parallel computing, 01 natural sciences, Theoretical Computer Science, Scheduling (computing), Runtime system, CUDA, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, [INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], scheduling, Parallel composition, 010302 applied physics, heterogeneous architectures, 020203 distributed computing, Multi-core processor, Hypervisor, Partition (database), Hardware and Architecture, Linear algebra, runtime optimisation, Cache, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Software
Abstract: Enabling HPC applications to perform efficiently when invoking multiple parallel libraries simultaneously is a great challenge. Even if a uniform runtime system is used underneath, scheduling tasks or threads coming from different libraries over the same set of hardware resources introduces many issues, such as resource oversubscription, undesirable cache flushes and memory bus contention. This paper presents an extension of StarPU, a runtime system specifically designed for heterogeneous architectures, that allows multiple parallel codes to run concurrently with minimal interference. Such parallel codes run within scheduling contexts that provide confined execution environments which can be used to partition computing resources. Scheduling contexts can be dynamically resized to optimize the allocation of computing resources among concurrently running libraries. We introduce a hypervisor that automatically expands or shrinks contexts using feedback from the runtime system (e.g. resource utilization). We demonstrate the relevance of our approach using benchmarks invoking multiple high performance linear algebra kernels simultaneously on top of heterogeneous multicore machines. We show that our mechanism can dramatically improve the overall application run time (− 34%), most notably by reducing the average cache miss ratio (− 50%).
Published: 2013
Full Text: View/download PDF

8. Poster: Matrices over Runtime Systems at Exascale

Author: Raymond Namyst, Jack Dongarra, Cedric Castagnede, Nathalie Furmento, Bérenger Bramas, George Bosilca, Stanimire Tomov, Olivier Coulaud, Eric Darve, Julien Langou, Xavier Lacoste, Hatem Ltaief, Ichitaro Yamazaki, Emmanuel Agullo, Matthias Messner, Pierre Ramet, Luc Giraud, Mathieu Faverge, Samuel Thibault, and Toru Takahashik
Subjects: Multi-core processor, Computer science, Parallel computing, Magma (computer algebra system), Power (physics), Runtime system, Matrix (mathematics), Direct methods, Linear algebra, Software design, Multipole expansion, computer, computer.programming_language, Abstraction (linguistics)
Abstract: The goal of Matrices Over Runtime Systems at Exascale (MORSE) project is to design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale multicore systems with GPU accelerators, using all the processing power that future high end systems can make available. In this poster, we propose a framework for describing linear algebra algorithms at a high level of abstraction and delegating the actual execution to a runtime system in order to design software whose performance is portable accross architectures. We illustrate our methodology on three classes of problems: dense linear algebra, sparse direct methods and fast multipole methods. The resulting codes have been incorporated into Magma, Pastix and ScalFMM solvers, respectively.
Published: 2012
Full Text: View/download PDF

9. Abstract: Leveraging PEPPHER Technology for Performance Portable Supercomputing

Author: Usman Dastgeer, Siegfried Benkner, Martin Wimmer, Samuel Thibault, Mudassar Majeed, Nathalie Furmento, Christoph Kessler, Jesper Larsson Träff, Raymond Namyst, and Sabri Pllana
Subjects: Multi-core processor, Computer science, 010103 numerical & computational mathematics, 02 engineering and technology, GPU cluster, computer.software_genre, Supercomputer, 01 natural sciences, 020202 computer hardware & architecture, Software portability, Computer architecture, 0202 electrical engineering, electronic engineering, information engineering, Operating system, Code (cryptography), 0101 mathematics, computer
Abstract: PEPPHER is a 3-year EU FP7 project that develops a novel approach and framework to enhance performance portability and programmability of heterogeneous multi-core systems. Its primary target is single-node heterogeneous systems, where several CPU cores are supported by accelerators such as GPUs. This poster briefly surveys the PEPPHER framework for single-node systems, and elaborates on the prospectives for leveraging the PEPPHER approach to generate performance-portable code for heterogeneous multi-node systems.
Published: 2012
Full Text: View/download PDF

10. A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs

Author: Stanimire Tomov, Cédric Augonnet, Jack Dongarra, Raymond Namyst, Hatem Ltaief, Emmanuel Agullo, and Samuel Thibault
Subjects: Runtime system, Multi-core processor, Software, Computer science, business.industry, Hybrid system, Linear algebra, Code (cryptography), Parallel computing, Graphics, business, Expression (mathematics)
Abstract: Publisher Summary This chapter presents a hybridization methodology for the development of high-performance linear algebra software for graphics processing units (GPUs). The methodology has been successfully used in MAGMA—a new generation of linear algebra libraries, similar in functionality to LAPACK, but extended for hybrid, GPU-based systems. Algorithms of interest are split into computational tasks. The tasks' execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators using StarPU—a runtime system for accelerator-based multicore architectures. StarPU enables the expression of parallelism through sequential-like code and schedules the different tasks over the hybrid processing units. Using the StarPU framework, development is faster and cheaper than the development of algorithms exclusively for GPUs. Moreover, this framework allows the exploration of the unique strengths of the various hardware components in a hybrid system, resulting in hybrid algorithms that are better performance-wise than corresponding homogeneous algorithms designed exclusively for either GPUs or multicore CPUs.
Published: 2012
Full Text: View/download PDF

11. A sampling-based approach for communication libraries auto-tuning

Author: Raymond Namyst, Alexandre Denis, Elisabeth Brunet, Francois Trahay, Département Informatique (INF), Institut Mines-Télécom [Paris] (IMT)-Télécom SudParis (TSP), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Grid'5000, ANR-09-COSI-0001,COOP,Gestion de ressources coopérative multi niveaux(2009), Département Informatique (TSP - INF), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Denis, Alexandre, and Gestion de ressources coopérative multi niveaux - - COOP2009 - ANR-09-COSI-0001 - COSINUS - VALID
Subjects: Multi-core processor, [INFO.INFO-NI] Computer Science [cs]/Networking and Internet Architecture [cs.NI], business.industry, Computer science, Network packet, Distributed computing, Aggregate (data warehouse), Sampling (statistics), 020206 networking & telecommunications, 02 engineering and technology, Multiplexing, [INFO.INFO-NI]Computer Science [cs]/Networking and Internet Architecture [cs.NI], 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, MPI, NewMadeleine, business, Host (network), MadMPI, Computer network
Abstract: International audience; Communication performance is a critical issue in HPC applications, and many solutions have been proposed on the literature (algorithmic, protocols, etc.) In the meantime, computing nodes become massively multicore, leading to a real imbalance between the number of communication sources and the number of physical communication resources. Thus it is now mandatory to share network boards between computation flows, and to take this sharing into account while performing communication optimizations. In previous papers, we have proposed a model and a framework for on-the-fly optimizations of multiplexed concurrent communication flows, and implemented this model in the \nm communication library. This library features optimization strategies able for example to aggregate several messages to reduce the number of packets emitted on the network, or to split messages to use several NICs at the same time. In this paper, we study the tuning of these dynamic optimization strategies. We show that some parameters and thresholds (\rdv threshold, aggregation packet size) depend on the actual hardware, both host and NICs. We propose and implement a method based on sampling of the actual hardware to auto-tune our strategies. Moreover, we show that multi-rail can greatly benefit from performance predictions. We propose an approach for multi-rail that dynamically balance the data between NICs using predictions based on sampling.
Published: 2011

12. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Author: Raymond Namyst, Pierre-André Wacrenier, Cédric Augonnet, Samuel Thibault, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), and Augonnet, Cédric
Subjects: Coprocessor, Computer Networks and Communications, Computer science, Distributed computing, Data management, Multiprocessing, 02 engineering and technology, Parallel computing, Field (computer science), Theoretical Computer Science, Scheduling (computing), CUDA, Runtime system, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, [INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Execution model, 020203 distributed computing, Multi-core processor, business.industry, 020207 software engineering, Computer Science Applications, Task (computing), Computational Theory and Mathematics, [INFO.INFO-OS] Computer Science [cs]/Operating Systems [cs.OS], 020201 artificial intelligence & image processing, [INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS], [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], business, Software
Abstract: In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run-time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright © 2010 John Wiley & Sons, Ltd.
Published: 2011
Full Text: View/download PDF

13. Data-Aware Task Scheduling on Multi-Accelerator based Platforms

Author: Cédric Augonnet, Jérôme Clet-Ortega, Raymond Namyst, Samuel Thibault, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), ANR-08-COSI-0013,PROHMPT,Programmation des technologies multicoeurs hétérogènes(2008), European Project: 248481,EC:FP7:ICT,FP7-ICT-2009-4,PEPPHER(2010), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Augonnet, Cédric, Programmation des technologies multicoeurs hétérogènes - - PROHMPT2008 - ANR-08-COSI-0013 - COSINUS - VALID, and Performance Portability and Programmability for Heterogeneous Many-core Architectures - PEPPHER - - EC:FP7:ICT2010-01-01 - 2012-12-31 - 248481 - VALID
Subjects: Instruction prefetch, 020203 distributed computing, Multi-core processor, Coprocessor, Computer science, Distributed computing, Parallel algorithm, Processor scheduling, 010103 numerical & computational mathematics, 02 engineering and technology, Dynamic priority scheduling, 01 natural sciences, Scheduling (computing), Runtime system, CUDA, [INFO.INFO-OS] Computer Science [cs]/Operating Systems [cs.OS], 0202 electrical engineering, electronic engineering, information engineering, [INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS], 0101 mathematics
Abstract: International audience; To fully tap into the potential of heterogeneous machines composed of multicore processors and multiple accelerators, simple offloading approaches in which the main trunk of the application runs on regular cores while only specific parts are offloaded on accelerators are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we previously proposed StarPU, a runtime system capable of scheduling tasks over multicore machines equipped with GPU accelerators. StarPU uses a software virtual shared memory (VSM) that provides a high-level programming interface and automates data transfers between processing units so as to enable a dynamic scheduling of tasks. We now present how we have extended StarPU to minimize the cost of transfers between processing units in order to efficiently cope with multi-GPU hardware configurations. To this end, our runtime system implements data prefetching based on asynchronous data transfers, and uses data transfer cost prediction to influence the decisions taken by the task scheduler. We demonstrate the relevance of our approach by benchmarking two parallel numerical algorithms using our runtime system. We obtain significant speedups and high efficiency over multicore machines equipped with multiple accelerators. We also evaluate the behaviour of these applications over clusters featuring multiple GPUs per node, showing how our runtime system can combine with MPI.
Published: 2010

14. Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Author: Brice Goglin, Stéphanie Moreaud, Raymond Namyst, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), Springer, Grid'5000, and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Input/output, Multi-core processor, Computer science, Distributed computing, InfiniBand, 020206 networking & telecommunications, 02 engineering and technology, Network interface, Parallel computing, 0202 electrical engineering, electronic engineering, information engineering, Cluster (physics), 020201 artificial intelligence & image processing, Relevance (information retrieval), [INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS], Output device
Abstract: International audience; Multicore processors have not only reintroduced Non-Uniform Memory Access (NUMA) architectures in nowadays parallel computers, but they are also responsible for non-uniform access times with respect to Input/Output devices (NUIOA). In clusters of multicore machines equipped with several Network Interfaces, performance of communication between processes thus depends on which cores these processes are scheduled on, and on their distance to the Network Interface Cards involved. We propose a technique allowing multirail communication between processes to carefully distribute data among the network interfaces so as to counterbalance NUIOA effects. We demonstrate the relevance of our approach by evaluating its implementation within OpenMPI on a Myri-10G + InfiniBand cluster.
Published: 2010
Full Text: View/download PDF

15. Optimizing MPI Communication within large Multicore nodes with Kernel assistance

Author: Stéphanie Moreaud, David Goodell, Raymond Namyst, Brice Goglin, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), Mathematics and Computer Science Division [ANL] (MCS), Argonne National Laboratory [Lemont] (ANL), IEEE, and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: 020203 distributed computing, Multi-core processor, Kernel (image processing), Computer science, Distributed computing, Message passing, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 02 engineering and technology, Parallel computing, [INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS], Execution time
Abstract: International audience; As the number of cores per node increases in modern clusters, intra-node communication efficiency becomes critical to application performance. We present a study of the traditional double-copy model in MPICH2 and a kernel-assisted single-copy strategy with KNEM on different shared-memory hosts with up to 96 cores. We show that KNEM suffers less from process placement on these complex architectures. It improves throughput up to a factor of 2 for large messages for both point-to-point and collective operations, and significantly improves NPB execution time. We detail when to switch from one strategy to the other depending on the communication pattern and we show that \ioat copy offload only appears to be an interesting solution for older architectures.
Published: 2010
Full Text: View/download PDF

16. Structuring the execution of OpenMP applications for multicore architectures

Author: Olivier Aumage, Brice Goglin, Raymond Namyst, François Broquedis, Samuel Thibault, Pierre-Andr Wacrenier, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), IEEE, and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Multi-core processor, Computer science, Memory bandwidth, 02 engineering and technology, Thread (computing), Parallel computing, Scheduling (computing), Runtime system, Memory bank, Computer architecture, Shared memory, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Concurrent computing, 020201 artificial intelligence & image processing, Cache, [INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS]
Abstract: International audience; The now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the one side, and memory access scalability and efficiency on the other side. However, to get high performance out of such machines requires a dynamic mapping of application tasks and data onto the underlying architecture. Moreover, depending on the application behavior, this mapping should favor cache affinity, memory bandwidth, computation synchrony, or a combination of these. The great challenge is then to perform this hardware-dependent mapping in a portable, abstract way. To meet this need, we propose a new, hierarchical approach to the execution of OpenMP threads onto multicore machines. Our ForestGOMP runtime system dynamically generates structured trees out of OpenMP programs. It collects relationship information about threads and data as well. This information is used together with scheduling hints and hardware counter feedback by the scheduler to select the most appropriate threads and data distribution. ForestGOMP features a high-level platform for developing and tuning portable threads schedulers. We present several applications for which we developed specific scheduling policies that achieve excellent speedups on 16-core machines.
Published: 2010
Full Text: View/download PDF

17. hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications

Author: Samuel Thibault, Stéphanie Moreaud, François Broquedis, Brice Goglin, Guillaume Mercier, Nathalie Furmento, Jérôme Clet-Ortega, Raymond Namyst, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), IEEE, and Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)
Subjects: Hardware architecture, 020203 distributed computing, Multi-core processor, business.industry, Computer science, Distributed computing, 02 engineering and technology, Application software, computer.software_genre, Supercomputer, Runtime system, Software, Computer architecture, Multithreading, 0202 electrical engineering, electronic engineering, information engineering, Hardware compatibility list, 020201 artificial intelligence & image processing, [INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS], business, computer, Computer hardware
Abstract: International audience; The increasing numbers of cores, shared caches and memory nodes within machines introduces a complex hardware topology. High-performance computing applications now have to carefully adapt their placement and behavior according to the underlying hierarchy of hardware resources and their software affinities. We introduce the Hardware Locality (hwloc) software which gathers hardware information about processors, caches, memory nodes and more, and exposes it to applications and runtime systems in a abstracted and portable hierarchical manner. hwloc may significantly help performance by having runtime systems place their tasks or adapt their communication strategies depending on hardware affinities. We show that hwloc can already be used by popular high-performance OpenMP or MPI software. Indeed, scheduling OpenMP threads according to their affinities or placing MPI processes according to their communication patterns shows interesting performance improvement thanks to hwloc. An optimized MPI communication strategy may also be dynamically chosen according to the location of the communicating processes in the machine and its hardware characteristics.
Published: 2010
Full Text: View/download PDF

18. ForestGOMP: an efficient OpenMP environment for NUMA architectures

Author: Pierre-André Wacrenier, Raymond Namyst, Brice Goglin, François Broquedis, Nathalie Furmento, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), forestgomp, and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: 010302 applied physics, Multi-core processor, Computer science, Multiprocessing, 02 engineering and technology, Thread (computing), Parallel computing, 01 natural sciences, 020202 computer hardware & architecture, Theoretical Computer Science, Scheduling (computing), Runtime system, Software portability, 0103 physical sciences, Theory of computation, 0202 electrical engineering, electronic engineering, information engineering, [INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS], Architecture, Software, Information Systems
Abstract: International audience; Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into Scheduling Hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and Next-Touch-based data distribution policies. These techniques provide insights about additional optimizations.
Published: 2010
Full Text: View/download PDF

19. Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures

Author: Cédric Augonnet, Samuel Thibault, Raymond Namyst, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), ANR-08-COSI-0013,PROHMPT,Programmation des technologies multicoeurs hétérogènes(2008), Augonnet, Cédric, Programmation des technologies multicoeurs hétérogènes - - PROHMPT2008 - ANR-08-COSI-0013 - COSINUS - VALID, and Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)
Subjects: 020203 distributed computing, Multi-core processor, Computer science, Distributed computing, Symmetric multiprocessor system, 02 engineering and technology, Supercomputer, Scheduling (computing), Runtime system, [INFO.INFO-OS] Computer Science [cs]/Operating Systems [cs.OS], 0202 electrical engineering, electronic engineering, information engineering, Performance prediction, 020201 artificial intelligence & image processing, [INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS], Programmer
Abstract: International audience; Multicore architectures featuring specialized accelerators are getting an increasing amount of attention, and this success will probably influence the design of future High Performance Computing hardware. Unfortunately, programmers are actually having a hard time trying to exploit all these heterogeneous computing units efficiently, and most existing efforts simply focus on providing tools to offload some computations on available accelerators. Recently, some runtime systems have been designed that exploit the idea of scheduling -- as opposed to offloading -- parallel tasks over the whole set of heterogeneous computing units. Scheduling tasks over heterogeneous platforms makes it necessary to use accurate prediction models in order to assign each task to its most adequate computing unit. A deep knowledge of the application is usually required to model per-task performance models, based on the algorithmic complexity of the underlying numeric kernel. We present an alternate, auto-tuning performance prediction approach based on performance history tables dynamically built during the application run. This approach does not require that the programmer provides some specific information. We show that, thanks to the use of a carefully chosen hash-function, our approach quickly achieves accurate performance estimations automatically. Our approach even outperforms regular algorithmic performance models with several linear algebra numerical kernels.
Published: 2009

20. A Unified Runtime System for Heterogeneous Multi-core Architectures

Author: Cédric Augonnet and Raymond Namyst
Subjects: Multi-core processor, Coprocessor, Speedup, Computer science, Distributed computing, Parallel computing, Execution time, Instruction set, Runtime system, Memory address, CUDA, Memory management, Memory ordering, Programming paradigm, Execution model
Abstract: Approaching the theoretical performance of heterogeneous multicore architectures, equipped with specialized accelerators, is a challenging issue. Unlike regular CPU s that can transparently access the whole global memory address range, accelerators usually embed local memory on which they perform all their computations using a specific instruction set. While many research efforts have been devoted to offloading parts of a program over such coprocessors, the real challenge is to find a programming model providing a unified view of all available computing units. In this paper, we present an original runtime system providing a high-level, unified execution model allowing seamless execution of tasks over the underlying heterogeneous hardware. The runtime is based on a hierarchical memory management facility and on a codelet scheduler. We demonstrate the efficiency of our solution with a LU decomposition for both homogeneous (3.8 speedup on 4 cores) and heterogeneous machines (95 % efficiency). We also show that a "granularity aware" scheduling can improve execution time by 35 %.
Published: 2009
Full Text: View/download PDF

21. Exploiting the Cell/BE architecture with the StarPU unified runtime system

Author: Raymond Namyst, Samuel Thibault, Cédric Augonnet, Maik Nijhuis, Augonnet, Cédric, Springer Verlag, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), Department of Computer Science [Amsterdam], Vrije Universiteit Amsterdam [Amsterdam] (VU), and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Multi-core processor, Exploit, business.industry, Computer science, Data management, 02 engineering and technology, Parallel computing, computer.software_genre, Runtime system, Software portability, Computer architecture, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, [INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Overhead (computing), 020201 artificial intelligence & image processing, Compiler, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], business, computer, Execution model
Abstract: International audience; Core specialization is currently one of the most promising ways for designing power-efficient multicore chips. However, approaching the theoretical peak performance of such heterogeneous multicore architectures with specialized accelerators, is a complex issue. While substantial effort has been devoted to efficiently offloading parts of the computation, designing an execution model that unifies all computing units is the main challenge. We therefore designed the StarPU runtime system for providing portable support for heterogeneous multicore processors to high performance applications and compiler environments. StarPU provides a high-level, unified execution model which is tightly coupled to an expressive data management library. In addition to our previous results on using multicore processors alongside with graphic processors, we show that StarPU is flexible enough to efficiently exploit the heterogeneous resources in the Cell processor. We present a scalable design supporting multiple different accelerators while minimizing the overhead on the overall system. Using experiments with classical linear algebra algorithms, we show that StarPU improves programmability and provides performance portability.
Published: 2009

22. MPC: A Unified Parallel Runtime for Clusters of NUMA Machines

Author: Hervé Jourdren, Raymond Namyst, Marc Pérache, DAM Île-de-France (DAM/DIF), Direction des Applications Militaires (DAM), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Springer, and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: 020203 distributed computing, Multi-core processor, POSIX Threads, Distributed shared memory, Computer science, Distributed computing, Message passing, Message Passing Interface, Cache-only memory architecture, Uniform memory access, Multiprocessing, 02 engineering and technology, Parallel computing, Thread (computing), Runtime system, Shared memory, 0202 electrical engineering, electronic engineering, information engineering, Interleaved memory, 020201 artificial intelligence & image processing, Distributed memory, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], SPMD, ComputingMilieux_MISCELLANEOUS
Abstract: Over the last decade, Message Passing Interface (MPI) has become a very successful parallel programming environment for distributed memory architectures such as clusters. However, the architecture of cluster node is currently evolving from small symmetric shared memory multiprocessors towards massively multicore, Non-Uniform Memory Access (NUMA) hardware. Although regular MPI implementations are using numerous optimizations to realize zero copycache-oblivious data transfers within shared-memory nodes, they might prevent applications from achieving most of the hardware's performance simply because the scheduling of heavyweight processes is not flexible enough to dynamically fit the underlying hardware topology. This explains why several research efforts have investigated hybrid approaches mixing message passing between nodes and memory sharing inside nodes, such as MPI+OpenMP solutions [1,2]. However, these approaches require lots of programming efforts in order to adapt/rewrite existing MPI applications. In this paper, we present the MultiProcessor Communications environnement (MPC), which aims at providing programmers with an efficient runtime system for their existing MPI, POSIX Thread or hybrid MPI+Thread applications. The key idea is to use user-level threads instead of processes over multiprocessor cluster nodes to increase scheduling flexibility, to better control memory allocations and optimize scheduling of the communication flows with other nodes. Most existing MPI applications can run over MPC with no modification. We obtained substantial gains (up to 20%) by using MPC instead of a regular MPI runtime on several scientific applications.
Published: 2008
Full Text: View/download PDF

23. Scheduling Dynamic OpenMP Applications over Multicore Architectures

Author: Pierre-André Wacrenier, Olivier Aumage, Raymond Namyst, François Broquedis, François Diakhaté, Samuel Thibault, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), DAM Île-de-France (DAM/DIF), Direction des Applications Militaires (DAM), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA), ANR-05-CIGC-0001,PARA,Parallélisme et Amélioration du Rendement des Apllications(2005), and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Multi-core processor, Speedup, Computer science, Distributed computing, 020207 software engineering, SMP, OpenMP, 02 engineering and technology, Parallel computing, Load balancing (computing), ComputerSystemsOrganization_PROCESSORARCHITECTURES, Software_PROGRAMMINGTECHNIQUES, Multi-Core, Scheduling (computing), Thread scheduling, Runtime system, NUMA, Nested Parallelism, Work stealing, Hierarchical Thread Scheduling, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Bubbles, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Massively parallel
Abstract: International audience; Approaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache misses and NUMA penalties. While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and data to the underlying runtime system, most OpenMP runtime systems are actually unable to efficiently support highly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to the execution of OpenMP programs featuring irregular and massive nested parallelism over hierarchical architectures. Our policy enforces a distribution of threads that maximizes the proximity of threads belonging to the same parallel section, and uses a NUMA-aware work stealing strategy when load balancing is needed. It has been developed as a plug-in to the ForestGOMP OpenMP platform. We demonstrate the efficiency of our approach with a highly irregular recursive OpenMP program resulting from the generic parallelization of a surface reconstruction application. We achieve a speedup of 14 on a 16-core machine with no application-level optimization.
Published: 2008
Full Text: View/download PDF

24. A multithreaded communication engine for multicore architectures

Author: Alexandre Denis, Francois Trahay, Raymond Namyst, Elisabeth Brunet, Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Efficient runtime systems for parallel architectures (RUNTIME), Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS), ANR CICG05-11 (projet LEGO), and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Overlap, Multi-core processor, Software suite, Speedup, Computer science, Communication, Message passing, 020206 networking & telecommunications, Software performance testing, 02 engineering and technology, Parallel computing, Thread (computing), Thread, Idle, Multithreading, 0202 electrical engineering, electronic engineering, information engineering, NewMadeleine, 020201 artificial intelligence & image processing, Pioman, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Multicore architecture, Critical path method
Abstract: International audience; The current trend in clusters leads towards an increase of the number of cores per node. As a result, an increasing number of parallel applications is mixing message passing and multithreading as an attempt to better match the underlying architecture's structure. This naturally raises the problem of designing efficient, multithreaded implementations of MPI. In this paper, we present the design of a multithreaded communication engine able to exploit idle cores to speed up communications in two ways: it can move CPU-intensive operations out of the critical path (e.g. PIO transfers offload), and is able to let rendezvous transfers progress asynchronously. We have implemented these methods in the PM2 software suite, evaluated their behavior in typical cases, and we have observed good performance results in overlapping communication and computation.
Published: 2008
Full Text: View/download PDF

25. An Efficient OpenMP Runtime System for Hierarchical Architectures

Author: Pierre-André Wacrenier, François Broquedis, Samuel Thibault, Brice Goglin, Raymond Namyst, Efficient runtime systems for parallel architectures (RUNTIME), INRIA Futurs, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Sciences et Technologies - Bordeaux 1-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Sciences et Technologies - Bordeaux 1 (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: [INFO.INFO-AR]Computer Science [cs]/Hardware Architecture [cs.AR], Speedup, Computer science, Multiprocessing, 02 engineering and technology, Thread (computing), Parallel computing, Software_PROGRAMMINGTECHNIQUES, computer.software_genre, Multi-Core, Scheduling (computing), Runtime system, NUMA, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Multi-core processor, SMP, OpenMP, ComputerSystemsOrganization_PROCESSORARCHITECTURES, Nested Parallelism, Hierarchical Thread Scheduling, 020201 artificial intelligence & image processing, Bubbles, Compiler, Cache, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], computer
Abstract: International audience; Exploiting the full computational power of always deeper hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture. The emergence of multi-core chips and NUMA machines makes it important to minimize the number of remote memory accesses, to favor cache affinities, and to guarantee fast completion of synchronization steps. By using the BubbleSched platform as a threading backend for the GOMP OpenMP compiler, we are able to easily transpose affinities of thread teams into scheduling hints using abstractions called bubbles. We then propose a scheduling strategy suited to nested OpenMP parallelism. The resulting preliminary performance evaluations show an important improvement of the speedup on a typical NAS OpenMP benchmark application.
Published: 2007
Full Text: View/download PDF

26. Building Portable Thread Schedulers for Hierarchical Multiprocessors: The BubbleSched Framework

Author: Samuel Thibault, Raymond Namyst, Pierre-André Wacrenier, Efficient runtime systems for parallel architectures (RUNTIME), INRIA Futurs, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Sciences et Technologies - Bordeaux 1-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Sciences et Technologies - Bordeaux 1 (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Computer science, Distributed computing, media_common.quotation_subject, Multiprocessing, 02 engineering and technology, Thread (computing), Multi-Core, Fair-share scheduling, Scheduling (computing), NUMA, Fixed-priority pre-emptive scheduling, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, media_common, 020203 distributed computing, Multi-core processor, Scheduling, SMP, Threads, Round-robin scheduling, Computer Science - Distributed, Parallel, and Cluster Computing, Debugging, SMT, Bubbles, Distributed, Parallel, and Cluster Computing (cs.DC), [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC]
Abstract: International audience; Exploiting full computational power of current more and more hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture. Unfortunately, most operating systems only provide a poor scheduling API that does not allow applications to transmit valuable scheduling hints to the system. In a previous paper, we showed that using a bubble-based thread scheduler can significantly improve applications' performance in a portable way. However, since multithreaded applications have various scheduling requirements, there is no universal scheduler that could meet all these needs. In this paper, we present a framework that allows scheduling experts to implement and experiment with customized thread schedulers. It provides a powerful API for dynamically distributing bubbles among the machine in a high-level, portable, and efficient way. Several examples show how experts can then develop, debug and tune their own portable bubble schedulers.
Published: 2007
Full Text: View/download PDF

27. On Parallelizing On-Line Statistics for Stochastic Biological Simulations

Author: Massimo Torquati, Ferruccio Damiani, Maurizio Drocco, Eva Sciacca, Marco Aldinucci, Salvatore Spinella, Angelo Troina, Mario Coppo, Michael Alexander Pasqua D'Ambra Adam Belloum George Bosilca Mario Cannataro Marco Danelutto Beniamino Di Martino Michael Gerndt Emmanuel Jeannot Raymond Namyst Jean Roman Stephen L. Scott Jesper Larsson Träff Geoffroy Vallée Josef Weidendorfer, Marco Aldinucci, Mario Coppo, Ferruccio Damiani, Maurizio Drocco, Eva Sciacca, Salvatore Spinella, Massimo Torquati, and Angelo Troina
Subjects: parallel simulation, Multi-core processor, multi-core, stochastic simulation, on-line clustering, Computer science, Distributed computing, Monte Carlo method, Stochastic simulation, Line (geometry), Computational science
Abstract: This work concerns a general technique to enrich parallel version of stochastic simulators for biological systems with tools for on- line statistical analysis of the results. In particular, within the FastFlow parallel programming framework, we describe the methodology and the implementation of a parallel Monte Carlo simulation infrastructure ex- tended with user-defined on-line data filtering and mining functions. The simulator and the on-line analysis were validated on large multi-core plat- forms and representative proof-of-concept biological systems.
Published: 2012
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

27 results on '"Raymond Namyst"'

1. Resource-Management Study in HPC Runtime-Stacking Context

2. Peppher: Performance Portability and Programmability for Heterogeneous Many-Core Architectures

3. Resource Aggregation for Task-Based Cholesky Factorization on Top of Heterogeneous Machines

4. Resource aggregation for task-based Cholesky Factorization on top of modern architectures

5. A runtime approach to dynamic resource allocation for sparse direct solvers

6. Adaptive Task Size Control on High Level Programming for GPU/CPU Work Sharing

7. Composing multiple StarPU applications over heterogeneous machines: a supervised approach

8. Poster: Matrices over Runtime Systems at Exascale

9. Abstract: Leveraging PEPPHER Technology for Performance Portable Supercomputing

10. A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs

11. A sampling-based approach for communication libraries auto-tuning

12. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

13. Data-Aware Task Scheduling on Multi-Accelerator based Platforms

14. Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

15. Optimizing MPI Communication within large Multicore nodes with Kernel assistance

16. Structuring the execution of OpenMP applications for multicore architectures

17. hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications

18. ForestGOMP: an efficient OpenMP environment for NUMA architectures

19. Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures

20. A Unified Runtime System for Heterogeneous Multi-core Architectures

21. Exploiting the Cell/BE architecture with the StarPU unified runtime system

22. MPC: A Unified Parallel Runtime for Clusters of NUMA Machines

23. Scheduling Dynamic OpenMP Applications over Multicore Architectures

24. A multithreaded communication engine for multicore architectures

25. An Efficient OpenMP Runtime System for Hierarchical Architectures

26. Building Portable Thread Schedulers for Hierarchical Multiprocessors: The BubbleSched Framework

27. On Parallelizing On-Line Statistics for Stochastic Biological Simulations

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

27 results on '"Raymond Namyst"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources