43 results on '"Montse Farreras"'
Search Results
2. Task Packing: Getting the Best from MPI Unbalanced Applications.
- Author
-
Gladys Utrera, Montse Farreras, and Jordi Fornes
- Published
- 2017
- Full Text
- View/download PDF
3. Performance evaluation of Optical Packet Switches on high performance applications.
- Author
-
Hugo Meyer, José Carlos Sancho, Wang Miao, Harm J. S. Dorren, Nicola Calabretta, and Montse Farreras
- Published
- 2015
- Full Text
- View/download PDF
4. Combining Static and Dynamic Data Coalescing in Unified Parallel C.
- Author
-
Michail Alvanos, Montse Farreras, Ettore Tiotto, José Nelson Amaral, and Xavier Martorell
- Published
- 2016
- Full Text
- View/download PDF
5. Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs.
- Author
-
Michail Alvanos, Ettore Tiotto, José Nelson Amaral, Montse Farreras, and Xavier Martorell
- Published
- 2016
- Full Text
- View/download PDF
6. Reducing Compiler-Inserted Instrumentation in Unified-Parallel-C Code Generation.
- Author
-
Michail Alvanos, José Nelson Amaral, Ettore Tiotto, Montse Farreras, and Xavier Martorell
- Published
- 2014
- Full Text
- View/download PDF
7. A novel SDN enabled hybrid optical packet/circuit switched data centre network: The LIGHTNESS approach.
- Author
-
Shuping Peng 0001, Dimitra Simeonidou, George Zervas, Reza Nejabati, Yan Yan 0019, Yi Shu, Salvatore Spadaro, Jordi Perelló, Fernando Agraz, Davide Careglio, Harm J. S. Dorren, Wang Miao, Nicola Calabretta, Giacomo Bernini, Nicola Ciulli, José Carlos Sancho, Steluta Iordache, Yolanda Becerra 0001, Montse Farreras, Matteo Biancani, Alessandro Predieri, Roberto Proietti, Zheng Cao, Lei Liu, and S. J. Ben Yoo
- Published
- 2014
- Full Text
- View/download PDF
8. Efficient parallel construction of suffix trees for genomes larger than main memory.
- Author
-
Matteo Comin and Montse Farreras
- Published
- 2013
- Full Text
- View/download PDF
9. Improving communication in PGAS environments: static and dynamic coalescing in UPC.
- Author
-
Michail Alvanos, Montse Farreras, Ettore Tiotto, José Nelson Amaral, and Xavier Martorell
- Published
- 2013
- Full Text
- View/download PDF
10. Automatic communication coalescing for irregular computations in UPC language.
- Author
-
Michail Alvanos, Montse Farreras, Ettore Tiotto, and Xavier Martorell
- Published
- 2012
11. Productive Cluster Programming with OmpSs.
- Author
-
Javier Bueno, Luis Martinell, Alejandro Duran, Montse Farreras, Xavier Martorell, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta
- Published
- 2011
- Full Text
- View/download PDF
12. Scalable RDMA performance in PGAS languages.
- Author
-
Montse Farreras, George Almási 0001, Calin Cascaval, and Toni Cortes
- Published
- 2009
- Full Text
- View/download PDF
13. Multidimensional Blocking in UPC.
- Author
-
Christopher Barton, Calin Cascaval, George Almási 0001, Rahul Garg 0001, José Nelson Amaral, and Montse Farreras
- Published
- 2007
- Full Text
- View/download PDF
14. Shared memory programming for large scale machines.
- Author
-
Christopher Barton, Calin Cascaval, George Almási 0001, Yili Zheng, Montse Farreras, Siddhartha Chatterjee, and José Nelson Amaral
- Published
- 2006
- Full Text
- View/download PDF
15. Scaling MPI to short-memory MPPs such as BG/L.
- Author
-
Montse Farreras, Toni Cortes, Jesús Labarta, and George Almási 0001
- Published
- 2006
- Full Text
- View/download PDF
16. Predicting MPI Buffer Addresses.
- Author
-
Felix Freitag, Montse Farreras, Toni Cortes, and Jesús Labarta
- Published
- 2004
- Full Text
- View/download PDF
17. Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes.
- Author
-
Matteo Comin and Montse Farreras
- Published
- 2014
- Full Text
- View/download PDF
18. All-optical packet/circuit switching-based data center network for enhanced scalability, latency, and throughput.
- Author
-
Jordi Perelló, Salvatore Spadaro, Sergio Ricciardi, Davide Careglio, Shuping Peng 0001, Reza Nejabati, Georgios Zervas, Dimitra Simeonidou, Alessandro Predieri, Matteo Biancani, Harm J. S. Dorren, Stefano Di Lucente, Jun Luo, Nicola Calabretta, Giacomo Bernini, Nicola Ciulli, José Carlos Sancho, Steluta Iordache, Montse Farreras, Yolanda Becerra 0001, Chris Liou, Iftekhar Hussain, Yawei Yin, Lei Liu, and Roberto Proietti
- Published
- 2013
- Full Text
- View/download PDF
19. Game-based Learning vs Gamification: A Hands-On
- Author
-
Montse Farreras, Jesús Armengol, Pau Bofill, and Angels Hernández
- Subjects
Project Approaches ,Game-based Learning ,Active Learning ,Engineering Education ,Gamification - Abstract
An example of gamification is a contest where students get points for solving the usual exercises of the subject matter. An example of game-based learning is an escape room where students get involved in studying and solving subject matter problems to get the required hints to continue the game. In this sense, game-based learning is an instance of problem-based learning. We propose a hand-on sesion where participants will get engaged into: first, a gamification activity; and later a game-based learning (GBL) activity. They will be encouraged to notice the differences and make a distinction between them. Afterwards, participants will be required to design a simple escape room situation involving problems for their own courses.
- Published
- 2022
- Full Text
- View/download PDF
20. An Escape Room For Learning Computer Programming
- Author
-
Pau Bofill, Montse Farreras, Jesús Armengol, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Òptica i Optometria, and Universitat Politècnica de Catalunya. GOAPI - Grup d'Òptica Aplicada i Processament d'Imatge
- Subjects
Learning Computer Programming ,Aprenentatge actiu ,Game-based Learning ,Active Learning ,Gamification ,Learning Phases ,Ludificació ,Ensenyament i aprenentatge::Metodologies docents [Àrees temàtiques de la UPC] ,Escape Room - Abstract
Game-based learning is a strategy where games are used as a challenge for students to learn and apply the contents of a subject matter. In this sense, game-based learning is an instance of problem-based learning. In this paper we discuss how game based strategies can be used to motivate students to perform the actions required for each of the learning phases. Namely: motivation, information, understanding, application and validation (feed-back). Then we present the application of those strategies to the design of an escape room where computer programs are required to solve the puzzles of the game. The designed escape room is then used as a game-based strategy in an introductory seminar on the Python programming language.
- Published
- 2022
- Full Text
- View/download PDF
21. A high-productivity task-based programming model for clusters.
- Author
-
Enric Tejedor, Montse Farreras, David Grove, Rosa M. Badia, Gheorghe Almási 0001, and Jesús Labarta
- Published
- 2012
- Full Text
- View/download PDF
22. Improving performance of all-to-all communication through loop scheduling in PGAS environments.
- Author
-
Michail Alvanos, Gabriel Tanase, Montse Farreras, Ettore Tiotto, José Nelson Amaral, and Xavier Martorell
- Published
- 2013
- Full Text
- View/download PDF
23. ClusterSs: a task-based programming model for clusters.
- Author
-
Enric Tejedor, Montse Farreras, David Grove, Rosa M. Badia, Gheorghe Almási 0001, and Jesús Labarta
- Published
- 2011
- Full Text
- View/download PDF
24. Asynchronous PGAS runtime for Myrinet networks.
- Author
-
Montse Farreras and George Almási 0001
- Published
- 2010
- Full Text
- View/download PDF
25. Exploring the Predictability of MPI Messages.
- Author
-
Felix Freitag, Jordi Caubet, Montse Farreras, Toni Cortes, and Jesús Labarta
- Published
- 2003
- Full Text
- View/download PDF
26. Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs
- Author
-
José Nelson Amaral, Ettore Tiotto, Xavier Martorell, Michail Alvanos, Montse Farreras, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Computer Networks and Communications ,Computer science ,Parallel programming (Computer science) ,Optimizing compiler ,02 engineering and technology ,Parallel computing ,Programació en paral·lel (Informàtica) ,Theoretical Computer Science ,Artificial Intelligence ,Unified Parallel C ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Compiler optimization ,Instrumentation (computer programming) ,Partitioned global address space ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,computer.programming_language ,Address space ,Communication ,Locality ,Unified parallel C ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Hardware and Architecture ,Programming paradigm ,020201 artificial intelligence & image processing ,computer ,Software - Abstract
We improve performance of fine-grain UPC applications by orders of magnitude.We introduce a novel shared-data localization transformation.We present a thorough performance analysis and evaluation.We show that reducing run-time calls is crucial for performance.We achieve performance comparable to C and MPI using the UPC programming model. Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector-executor transformation results in excessive instrumentation that hinders performance.This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003., the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 × their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 × for applications with irregular accesses.
- Published
- 2016
- Full Text
- View/download PDF
27. Hybrid/Heterogeneous Programming with OMPSS and Its Software/Hardware Implications
- Author
-
Alex Ramirez, Yoav Etsion, Josep M. Perez, Montse Farreras, Javier Bueno, Mateo Valero, Eduard Ayguadé, Alejandro Duran, Rosa M. Badia, Judit Planas, Ioanna Tsalouchidou, Pieter Bellens, Jesús Labarta, Vladimir Marjanovic, Roger Ferrer, Xavier Martorell, Xavier Teruel, and Lluis Martinell
- Subjects
Software ,Heterogeneous programming ,Computer architecture ,Computer science ,business.industry ,business - Published
- 2017
- Full Text
- View/download PDF
28. Task Packing: Getting the Best from MPI Unbalanced Applications
- Author
-
Montse Farreras, Jordi Fornes, and Gladys Utrera
- Subjects
020203 distributed computing ,Computer science ,020209 energy ,Computation ,Distributed computing ,Task mapping ,02 engineering and technology ,Parallel computing ,Load balancing (computing) ,Idle ,Knapsack problem ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Subset sum problem - Abstract
In this work we propose a Taskpacking mechanism that concentrate the idle cycles of unbalanced applications in such a way that one or more cores are freed from execution. To achieve that we stress the cores with just useful work of the parallel application tasks, provided performance is not degraded. Tasks are "packed" in a minimum number of cores using oversubscription. In order to do the task mapping to cores and the computation of the minimum number of cores we apply the Subset Sum algorithm, which is a particular case of the Knapsack problem. Our experiments demonstrate that our task packing using oversubscription without performance degradation is possible. In this sense, the mechanism is able to make accurate allocation decisions leaving room for executing other applications or just keeping other cores idle. Our proposal is scalable as the task allocation decisions are based just on local information and task migrations are performed only within each node.
- Published
- 2017
- Full Text
- View/download PDF
29. Reducing Compiler-Inserted Instrumentation in Unified-Parallel-C Code Generation
- Author
-
Xavier Martorell, Ettore Tiotto, Montse Farreras, José Nelson Amaral, Michail Alvanosl, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Computer science ,Data localization ,Program compilers ,Llenguatges de programació ,Programming languages (Electronic computers) ,Parallel computing ,computer.software_genre ,Runtime system ,Linear transformations ,Unified Parallel C ,Synchronization (computer science) ,Code generation ,Computer architecture ,Instrumentation (computer programming) ,Partitioned global address space ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,Partitioned Global Address Space ,computer.programming_language ,Metadata ,Performance degradation ,Read/write operations ,Parallel processing (Electronic computers) ,Address space ,Processament en paral·lel (Ordinadors) ,Supercomputers Communication mechanisms ,Informàtica::Llenguatges de programació [Àrees temàtiques de la UPC] ,Prototype implementations ,Transformation based ,Operating system ,Mathematical transformations ,Compiler ,computer ,Synchronization primitive - Abstract
Programs written in Partitioned Global Address Space (PGAS) languages can access any location of the entire address space via standard read/write operations. However, the compiler have to create the communication mechanisms and the runtime system to use synchronization primitives to ensure the correct execution of the programs. However, PGAS programs may have fine-grained shared accesses that lead to performance degradation. One solution is to use the inspector-executor technique to determine which accesses are indeed remote and which accesses may be coalesced in larger remote access operations. A straightforward implementation of the inspector-executor in a PGAS system may result in excessive instrumentation that hinders performance. This paper introduces a shared-data localization transformation based on linear memory descriptors (LMADs) that reduces the amount of instrumentation introduced by the compiler into programs written in the UPC language and describes a prototype implementation of the proposed transformation. A performance evaluation, using up to 2048 cores of a POWER 775 supercomputer, allows for a prediction that applications with regular accesses can achieve up to 180% of the performance of handoptimized versions while applications with irregular accesses yield performance gain from 1.12X up to 6.3X speedup.
- Published
- 2014
- Full Text
- View/download PDF
30. Parallel continuous flow: a parallel suffix tree construction tool for whole genomes
- Author
-
Montse Farreras and Matteo Comin
- Subjects
Theoretical computer science ,Computer science ,Suffix tree ,Genome ,Domain (software engineering) ,law.invention ,Software ,law ,Genetics ,Humans ,Computer Simulation ,Molecular Biology ,Research Articles ,Sequence ,Models, Genetic ,business.industry ,Continuous flow ,Genome, Human ,Sequence Analysis, DNA ,Computational Mathematics ,Computational Theory and Mathematics ,Modeling and Simulation ,Human genome ,Suffix ,business ,Algorithms - Abstract
The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes.
- Published
- 2014
31. A novel SDN enabled hybrid oiptical packet/circuit switched data centre network - The LIGHTNESS approach
- Author
-
Davide Careglio, Y. Shu, Lei Liu, Steluta Iordache, Nicola Calabretta, Zheng Cao, Jordi Perello, Montse Farreras, Fernando Agraz, Reza Nejabati, Giacomo Bernini, Roberto Proietti, Yan Yan, Nicola Ciulli, Yolanda Becerra, Jose Carlos Sancho, Dimitra Simeonidou, Matteo Biancani, Alessandro Predieri, Wang Miao, Salvatore Spadaro, Harm J. S. Dorren, George Zervas, S. J. B. Yoo, Shuping Peng, Electro-Optical Communication, and Low Latency Interconnect Networks
- Subjects
Circuit switching ,Dynamic network analysis ,Burst switching ,Computer science ,business.industry ,Circuit Switched Data ,Fast packet switching ,business ,Optical burst switching ,Software-defined networking ,Heterogeneous network ,Computer network - Abstract
Current over-provisioned and multi-tier data centre networks (DCN) deploy rigid control and management platforms, which are not able to accommodate the ever-growing workload driven by the increasing demand of high-performance data centre (DC) and cloud applications. In response to this, the EC FP7 project LIGHTNESS (Low Latency and High Throughput Dynamic Network Infrastructures for High Performance Datacentre Interconnects) is proposing a new flattened optical DCN architecture capable of providing dynamic, programmable, and highly available DCN connectivity services while meeting the requirements of new and emerging DC and cloud applications. LIGHTNESS DCN comprises all-optical switching technologies (Optical Packet Switching (OPS) and Optical Circuit Switching (OCS)) and hybrid Top-of-the-Rack (ToR) switches, controlled and operated by a Software Defined Networking (SDN) based control plane for enhanced programmability of heterogeneous network functions and protocols. Harnessing the power of optics enables DCs to effectively cope with the high-performance applications' demands. The programmability and flexibility provided by the SDN based control plane allow to fully exploit the benefits of the LIGHTNESS multi-technology optical DCN, while provisioning on-demand, dynamic, flexible and highly resilient network services inside DCs.
- Published
- 2014
32. Improving performance of all-to-all communication through loop scheduling in PGAS environments
- Author
-
Xavier Martorell, Montse Farreras, Michail Alvanos, José Nelson Amaral, Gabriel Tanase, Ettore Tiotto, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Computer science ,education ,Informàtica::Enginyeria del software [Àrees temàtiques de la UPC] ,02 engineering and technology ,Parallel computing ,computer.software_genre ,020204 information systems ,0502 economics and business ,Unified Parallel C ,0202 electrical engineering, electronic engineering, information engineering ,Partitioned global address space ,IBM ,computer.programming_language ,Software engineering ,05 social sciences ,unified parallel c ,Supercomputer ,All-to-all communication ,performance evaluation ,Loop scheduling ,partitioned global address space ,Operating system ,050211 marketing ,one-sided communication ,Enginyeria de programari ,computer ,Research center - Abstract
Michail Alvanos∓ Programming Models Barcelona Supercomputing Center malvanos@bsc.es Gabriel Tanase IBM TJ Watson Research Center Yorktown Heights, NY, US igtanase@us.ibm.com Montse Farreras Dep. of Computer Architecture Universitat Politecnica de Catalunya mfarrera@ac.upc.edu Ettore Tiotto Static Compilation Technology IBM Toronto Laboratory etiotto@ca.ibm.com Jose Nelson Amaral Dep. of Computing Science University of Alberta jamaral@ualberta.ca Xavier Martorell Dep. of Computer Architecture Universitat Politecnica de Catalunya xavim@ac.upc.edu
- Published
- 2013
- Full Text
- View/download PDF
33. Improving communication in PGAS environments: Static and dynamic coalescing in UPC
- Author
-
José Nelson Amaral, Michail Alvanos, Ettore Tiotto, Montse Farreras, Xavier Martorell, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
020203 distributed computing ,Software engineering ,Computer science ,Informàtica::Enginyeria del software [Àrees temàtiques de la UPC] ,One-sided communication ,Optimizing compiler ,02 engineering and technology ,Parallel computing ,Unified parallel c ,020202 computer hardware & architecture ,Data mapping ,Unified Parallel C ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Performance evaluation ,Overhead (computing) ,Partitioned global address space ,Enginyeria de programari ,Programmer ,computer ,computer.programming_language ,Compile time - Abstract
The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead to performance degradation. Manual code transformations or compiler optimizations are required to improve the performance of programs with fine-grained accesses. The downside of manual code transformations is the increased program complexity that hinders programmer productivity. On the other hand, most compiler optimizations of fine-grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for the Unified Parallel C language that combines compile time (static) and runtime (dynamic) coalescing of shared data, without the knowledge of physical data mapping. Larger messages increase the network efficiency and static coalescing decreases the overhead of library calls. The performance evaluation uses two microbenchmarks and three benchmarks to obtain scaling and absolute performance numbers on up to 32768 cores of a Power 775 machine. Our results show that the compiler transformation results in speedups from 1.15X up to 21X compared with the baseline versions and that they achieve up to 63% the performance of the MPI versions.
- Published
- 2013
34. All-optical packet/circuit switching-based data center network for enhanced scalability, latency and throughput
- Author
-
Iftekhar Hussain, Roberto Proietti, Steluta Iordache, Reza Nejabati, George Zervas, Montse Farreras, Sergio Ricciardi, Salvatore Spadaro, Lei Liu, Jun Luo, Shuping Peng, Giacomo Bernini, Stefano Di Lucente, Davide Careglio, Dimitra Simeonidou, Jose Carlos Sancho, Harm J. S. Dorren, Jordi Perello, Yolanda Becerra, Matteo Biancani, Alessandro Predieri, Nicola Calabretta, Yawei Yin, Nicola Ciulli, Chris Liou, Electro-Optical Communication, and Low Latency Interconnect Networks
- Subjects
Circuit switching ,Computer Networks and Communications ,business.industry ,Network packet ,Computer science ,02 engineering and technology ,Optical burst switching ,01 natural sciences ,010309 optics ,LAN switching ,020210 optoelectronics & photonics ,Packet switching ,Burst switching ,Hardware and Architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Forwarding plane ,Fast packet switching ,business ,Software ,Information Systems ,Computer network - Abstract
Applications running inside data centers are enabled through the cooperation of thousands of servers arranged in racks and interconnected together through the data center network. Current DCN architectures based on electronic devices are neither scalable to face the massive growth of DCs, nor flexible enough to efficiently and cost-effectively support highly dynamic application traffic profiles. The FP7 European Project LIGHTNESS foresees extending the capabilities of today's electrical DCNs throPugh the introduction of optical packet switching and optical circuit switching paradigms, realizing together an advanced and highly scalable DCN architecture for ultra-high-bandwidth and low-latency server-to-server interconnection. This article reviews the current DC and high-performance computing (HPC) outlooks, followed by an analysis of the main requirements for future DCs and HPC platforms. As the key contribution of the article, the LIGHTNESS DCN solution is presented, deeply elaborating on the envisioned DCN data plane technologies, as well as on the unified SDN-enabled control plane architectural solution that will empower OPS and OCS transmission technologies with superior flexibility, manageability, and customizability.
- Published
- 2013
35. Efficient parallel construction of suffix trees for genomes larger than main memoryProceedings of the 20th European MPI Users' Group Meeting on - EuroMPI '13
- Author
-
Comin, Matteo and Montse, Farreras
- Published
- 2013
36. Productive cluster programming with OmpSs
- Author
-
Alejandro Duran, Xavier Martorell, Javier Bueno, Jesús Labarta, Rosa M. Badia, Montse Farreras, Eduard Ayguadé, Luis Martinell, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Data parallelism ,Computer science ,Fortran ,media_common.quotation_subject ,Task parallelism ,02 engineering and technology ,Parallel computing ,Remote node ,computer.software_genre ,Runtime system ,0202 electrical engineering, electronic engineering, information engineering ,Master node ,Distribute shared memory ,Programmer ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,computer.programming_language ,media_common ,020203 distributed computing ,Parallel processing (Electronic computers) ,Programming language ,Processament en paral·lel (Ordinadors) ,Debugging ,Programming paradigm ,Address space ,020201 artificial intelligence & image processing ,Compiler ,Instruction-level parallelism ,computer - Abstract
Clusters of SMPs are ubiquitous. They have been traditionally programmed by using MPI. But, the productivity of MPI programmers is low because of the complexity of expressing parallelism and communication, and the difficulty of debugging. To try to ease the burden on the programmer new programming models have tried to give the illusion of a global shared-address space (e.g., UPC, Co-array Fortran). Unfortunately, these models do not support, increasingly common, irregular forms of parallelism that require asynchronous task parallelism. Other models, such as X10 or Chapel, provide this asynchronous parallelism but the programmer is required to rewrite entirely his application. We present the implementation of OmpSs for clusters, a variant of OpenMP extended to support asynchrony, heterogeneity and data movement for task parallelism. As OpenMP, it is based on decorating an existing serial version with compiler directives that are translated into calls to a runtime system that manages the parallelism extraction and data coherence and movement. Thus, the same program written in OmpSs can run in a regular SMP machine, in clusters of SMPs, or even can be used for debugging with the serial version. The runtime uses the information provided by the programmer to distribute the work across the cluster while optimizes communications using affinity scheduling and caching of data. We have evaluated our proposal with a set of kernels and the OmpSs versions obtain a performance comparable, or even superior, to the one obtained by the same version of MPI.
- Published
- 2011
37. Scalable RDMA performance in PGAS languages
- Author
-
Toni Cortes, Calin Cascaval, George Almási, and Montse Farreras
- Subjects
Hardware_MEMORYSTRUCTURES ,Memory management ,Remote direct memory access ,Shared memory ,Computer architecture ,Computer science ,Scalability ,Programming paradigm ,Multiprocessing ,Distributed memory ,Parallel computing ,Partitioned global address space ,Software_PROGRAMMINGTECHNIQUES - Abstract
Partitioned Global Address Space (PGAS) languages provide a unique programming model that can span shared-memory multiprocessor (SMP) architectures, distributed memory machines, or cluster of SMPs. Users can program large scale machines with easy-to-use, shared memory paradigms.
- Published
- 2009
- Full Text
- View/download PDF
38. Multidimensional Blocking in UPC
- Author
-
José Nelson Amaral, Calin Cascaval, Christopher Barton, Montse Farreras, George Almási, and Rahul Garg
- Subjects
Computer science ,Distributed computing ,Locality ,Parallel computing ,computer.software_genre ,Data structure ,Unified Parallel C ,Programming paradigm ,Distributed memory ,Partitioned global address space ,Compiler ,Programmer ,computer ,Direct memory access ,Block (data storage) ,computer.programming_language - Abstract
Partitioned Global Address Space (PGAS) languages offer an attractive, high-productivity programming model for programming large-scale parallel machines. PGAS languages, such as Unified Parallel C (UPC), combine the simplicity of shared-memory programming with the efficiency of the message-passing paradigm by allowing users control over the data layout. PGAS languages distinguish between private, shared-local, and shared-remote memory, with shared-remote accesses typically much more expensive than shared-local and private accesses, especially on distributed memory machines where shared-remote access implies communication over a network. In this paper we present a simple extension to the UPC language that allows the programmer to block shared arrays in multiple dimensions. We claim that this extension allows for better control of locality, and therefore performance, in the language. We describe an analysis that allows the compiler to distinguish between local shared array accesses and remote shared array accesses. Local shared array accesses are then transformed into direct memory accesses by the compiler, saving the overhead of a locality check at runtime. We present results to show that locality analysis is able to significantly reduce the number of shared accesses.
- Published
- 2008
- Full Text
- View/download PDF
39. Scaling MPI to short-memory MPPs such as BG/L
- Author
-
Toni Cortes, Montse Farreras, G. Almasi, and Jesús Labarta
- Subjects
Robustness (computer science) ,Computer science ,Scalability ,Short-term memory ,Parallel computing ,Always true ,Scaling ,Execution time ,Implementation ,Memory problems - Abstract
Scalability to large number of processes is one of the weaknesses of current MPI implementations. Standard implementations are able to scale to hundreds of nodes, but not beyond. The main problem in these implementations is that they assume some resources (for both data and control-data) will always be available to receive/process unexpected messages. As we will show, this is not always true, especially in short-memory machines like the BG/L that has 64K nodes but each node only has 512Mbytes of memory.The objective of this paper is to present one algorithm that improves the robustness of MPI implementations for short-memory MPPs, taking care of data and control-data reception, the system will scale up to any number of nodes. The proposed solution achieves this goal without any observable overhead when there are no memory problems. Furthermore, in the worst case, when memory resources are extremely scarce, the overhead will never double the execution time (and we should never forget that in this extreme situation, traditional MPI implementations would fail to execute).
- Published
- 2006
- Full Text
- View/download PDF
40. Efficient parallel construction of suffix trees for genomes larger than main memory
- Author
-
Matteo Comin, Montse Farreras, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Compressed suffix array ,Sequence ,Theoretical computer science ,Whole genome indexing ,Computer science ,Parallel algorithms ,Bioinformatics ,Suffix tree ,String (computer science) ,Generalized suffix tree ,Parallel algorithm ,Genome ,law.invention ,Informàtica::Informàtica teòrica::Algorísmica i teoria de la complexitat [Àrees temàtiques de la UPC] ,Algorismes paral·lels ,Bioinformàtica ,law ,Informàtica [Àrees temàtiques de la UPC] ,Suffix - Abstract
The construction of suffix tree for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become everyday more complex, requiring fast queries to multiple genomes. In this paper we presented Parallel Continuous Flow PCF, a parallel suffix tree construction method that is suitable for very long strings. We tested our method on the construction of suffix tree of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input string grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the Human genome in 7 minutes using 172 nodes.
41. Predicting MPI buffer addresses
- Author
-
Jesús Labarta, Felix Freitag, Montse Farreras, Toni Cortes, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. DSG - Distributed Systems Group
- Subjects
CATNETS ,Informàtica::Arquitectura de computadors::Arquitectures distribuïdes [Àrees temàtiques de la UPC] ,MPI messages ,Workstation ,Computer science ,Message passing ,Message passing applications ,Workstation clusters ,Parallel computing ,Distributed systems ,Communication latency ,law.invention ,Zero (linguistics) ,Performance limiting factors ,Graph based prediction ,Parallel processing (DSP implementation) ,law ,Buffer storage ,Message-copying operations ,Parallel processing ,Sistemes distribuïts ,Implementation ,Message size ,Multiprocessor cluster ,Periodicity detection - Abstract
Communication latencies have been identified as one of the performance limiting factors of message passing applications in clusters of workstations/multiprocessors. On the receiver side, message-copying operations contribute to these communication latencies. Recently, prediction of MPI messages has been proposed as part of the design of a zero message-copying mechanism. Until now, prediction was only evaluated for the next message. Predicting only the next message, however, may not be enough for real implementations, since messages do not arrive in the same order as they are requested. In this paper, we explore long-term prediction of MPI messages for the design of a zero message-copying mechanism. To achieve long-term prediction we evaluate two prediction schemes, the first based on graphs, and the second based on periodicity detection. Our experiments indicate that with both prediction schemes the buffer addresses and message sizes of several future MPI messages (up to +10) can be predicted successfully.
42. Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization
- Author
-
Gladys Utrera, Jordi Fornes, Montse Farreras, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Combinatorial optimization ,Computer Networks and Communications ,Computer science ,Computation ,CPU time ,02 engineering and technology ,Parallel computing ,Optimització combinatòria ,Theoretical Computer Science ,Scheduling (computing) ,Idle ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Parallel processing (Electronic computers) ,Processament en paral·lel (Ordinadors) ,Knapsack algorithm ,020206 networking & telecommunications ,Data structure ,Hardware and Architecture ,Knapsack problem ,HPC ,Oversubscription ,020201 artificial intelligence & image processing ,MPI ,High performance computing ,Load balancing ,Càlcul intensiu (Informàtica) ,Software - Abstract
Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing “in place”, taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.
43. Combining static and dynamic data coalescing in unified parallel C
- Author
-
Michail Alvanos, Ettore Tiotto, Xavier Martorell, José Nelson Amaral, Montse Farreras, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Computer science ,One-sided communication ,02 engineering and technology ,Parallel computing ,C (Llenguatge de programació) ,Unified Parallel C ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Code generation ,Partitioned global address space ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,computer.programming_language ,Informàtica::Arquitectura de computadors::Arquitectures distribuïdes [Àrees temàtiques de la UPC] ,Distributed database ,Parallel processing (Electronic computers) ,Dynamic data ,Processament en paral·lel (Ordinadors) ,Unified parallel C ,Supercomputer ,020202 computer hardware & architecture ,Computational Theory and Mathematics ,Hardware and Architecture ,C (Computer program language) ,Signal Processing ,Performance evaluation ,020201 artificial intelligence & image processing ,computer - Abstract
Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through a network. This paper addresses important limitations in the code generation for partitioned global address space (PGAS) languages. These languages allow fine-grained communication and lead to programs that perform many fine-grained accesses to data. When the data is distributed to remote computing nodes, code transformations are required to prevent performance degradation. Until now code transformations to PGAS programs have been restricted to the cases where both the physical mapping of the data or the number of processing nodes are known at compilation time. In this paper, a novel application of the inspector-executor model overcomes these limitations and allows profitable code transformations, which result in fewer and larger messages sent through the network, when neither the data mapping nor the number of processing nodes are known at compilation time. A performance evaluation reports both scaling and absolute performance numbers on up to 32,768 cores of a Power 775 supercomputer. This evaluation indicates that the compiler transformation results in speedups between 1.15 $\times$ and 21 $\times$ over a baseline and that these automated transformations achieve up to 63 percent the performance of the MPI versions.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.