"Montse Farreras" / Topic: 02 engineering and technology - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Montse Farreras"' showing total 8 results

Start Over "Montse Farreras" Topic 02 engineering and technology

8 results on '"Montse Farreras"'

1. Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

Author: José Nelson Amaral, Ettore Tiotto, Xavier Martorell, Michail Alvanos, Montse Farreras, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Computer Networks and Communications, Computer science, Parallel programming (Computer science), Optimizing compiler, 02 engineering and technology, Parallel computing, Programació en paral·lel (Informàtica), Theoretical Computer Science, Artificial Intelligence, Unified Parallel C, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), Compiler optimization, Instrumentation (computer programming), Partitioned global address space, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], computer.programming_language, Address space, Communication, Locality, Unified parallel C, Computer Graphics and Computer-Aided Design, 020202 computer hardware & architecture, Hardware and Architecture, Programming paradigm, 020201 artificial intelligence & image processing, computer, Software
Abstract: We improve performance of fine-grain UPC applications by orders of magnitude.We introduce a novel shared-data localization transformation.We present a thorough performance analysis and evaluation.We show that reducing run-time calls is crucial for performance.We achieve performance comparable to C and MPI using the UPC programming model. Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector-executor transformation results in excessive instrumentation that hinders performance.This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003., the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 × their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 × for applications with irregular accesses.
Published: 2016
Full Text: View/download PDF

2. Task Packing: Getting the Best from MPI Unbalanced Applications

Author: Montse Farreras, Jordi Fornes, and Gladys Utrera
Subjects: 020203 distributed computing, Computer science, 020209 energy, Computation, Distributed computing, Task mapping, 02 engineering and technology, Parallel computing, Load balancing (computing), Idle, Knapsack problem, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Subset sum problem
Abstract: In this work we propose a Taskpacking mechanism that concentrate the idle cycles of unbalanced applications in such a way that one or more cores are freed from execution. To achieve that we stress the cores with just useful work of the parallel application tasks, provided performance is not degraded. Tasks are "packed" in a minimum number of cores using oversubscription. In order to do the task mapping to cores and the computation of the minimum number of cores we apply the Subset Sum algorithm, which is a particular case of the Knapsack problem. Our experiments demonstrate that our task packing using oversubscription without performance degradation is possible. In this sense, the mechanism is able to make accurate allocation decisions leaving room for executing other applications or just keeping other cores idle. Our proposal is scalable as the task allocation decisions are based just on local information and task migrations are performed only within each node.
Published: 2017
Full Text: View/download PDF

3. Improving performance of all-to-all communication through loop scheduling in PGAS environments

Author: Xavier Martorell, Montse Farreras, Michail Alvanos, José Nelson Amaral, Gabriel Tanase, Ettore Tiotto, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Computer science, education, Informàtica::Enginyeria del software [Àrees temàtiques de la UPC], 02 engineering and technology, Parallel computing, computer.software_genre, 020204 information systems, 0502 economics and business, Unified Parallel C, 0202 electrical engineering, electronic engineering, information engineering, Partitioned global address space, IBM, computer.programming_language, Software engineering, 05 social sciences, unified parallel c, Supercomputer, All-to-all communication, performance evaluation, Loop scheduling, partitioned global address space, Operating system, 050211 marketing, one-sided communication, Enginyeria de programari, computer, Research center
Abstract: Michail Alvanos∓ Programming Models Barcelona Supercomputing Center malvanos@bsc.es Gabriel Tanase IBM TJ Watson Research Center Yorktown Heights, NY, US igtanase@us.ibm.com Montse Farreras Dep. of Computer Architecture Universitat Politecnica de Catalunya mfarrera@ac.upc.edu Ettore Tiotto Static Compilation Technology IBM Toronto Laboratory etiotto@ca.ibm.com Jose Nelson Amaral Dep. of Computing Science University of Alberta jamaral@ualberta.ca Xavier Martorell Dep. of Computer Architecture Universitat Politecnica de Catalunya xavim@ac.upc.edu
Published: 2013
Full Text: View/download PDF

4. Improving communication in PGAS environments: Static and dynamic coalescing in UPC

Author: José Nelson Amaral, Michail Alvanos, Ettore Tiotto, Montse Farreras, Xavier Martorell, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: 020203 distributed computing, Software engineering, Computer science, Informàtica::Enginyeria del software [Àrees temàtiques de la UPC], One-sided communication, Optimizing compiler, 02 engineering and technology, Parallel computing, Unified parallel c, 020202 computer hardware & architecture, Data mapping, Unified Parallel C, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), Performance evaluation, Overhead (computing), Partitioned global address space, Enginyeria de programari, Programmer, computer, computer.programming_language, Compile time
Abstract: The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead to performance degradation. Manual code transformations or compiler optimizations are required to improve the performance of programs with fine-grained accesses. The downside of manual code transformations is the increased program complexity that hinders programmer productivity. On the other hand, most compiler optimizations of fine-grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for the Unified Parallel C language that combines compile time (static) and runtime (dynamic) coalescing of shared data, without the knowledge of physical data mapping. Larger messages increase the network efficiency and static coalescing decreases the overhead of library calls. The performance evaluation uses two microbenchmarks and three benchmarks to obtain scaling and absolute performance numbers on up to 32768 cores of a Power 775 machine. Our results show that the compiler transformation results in speedups from 1.15X up to 21X compared with the baseline versions and that they achieve up to 63% the performance of the MPI versions.
Published: 2013

5. All-optical packet/circuit switching-based data center network for enhanced scalability, latency and throughput

Author: Iftekhar Hussain, Roberto Proietti, Steluta Iordache, Reza Nejabati, George Zervas, Montse Farreras, Sergio Ricciardi, Salvatore Spadaro, Lei Liu, Jun Luo, Shuping Peng, Giacomo Bernini, Stefano Di Lucente, Davide Careglio, Dimitra Simeonidou, Jose Carlos Sancho, Harm J. S. Dorren, Jordi Perello, Yolanda Becerra, Matteo Biancani, Alessandro Predieri, Nicola Calabretta, Yawei Yin, Nicola Ciulli, Chris Liou, Electro-Optical Communication, and Low Latency Interconnect Networks
Subjects: Circuit switching, Computer Networks and Communications, business.industry, Network packet, Computer science, 02 engineering and technology, Optical burst switching, 01 natural sciences, 010309 optics, LAN switching, 020210 optoelectronics & photonics, Packet switching, Burst switching, Hardware and Architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Forwarding plane, Fast packet switching, business, Software, Information Systems, Computer network
Abstract: Applications running inside data centers are enabled through the cooperation of thousands of servers arranged in racks and interconnected together through the data center network. Current DCN architectures based on electronic devices are neither scalable to face the massive growth of DCs, nor flexible enough to efficiently and cost-effectively support highly dynamic application traffic profiles. The FP7 European Project LIGHTNESS foresees extending the capabilities of today's electrical DCNs throPugh the introduction of optical packet switching and optical circuit switching paradigms, realizing together an advanced and highly scalable DCN architecture for ultra-high-bandwidth and low-latency server-to-server interconnection. This article reviews the current DC and high-performance computing (HPC) outlooks, followed by an analysis of the main requirements for future DCs and HPC platforms. As the key contribution of the article, the LIGHTNESS DCN solution is presented, deeply elaborating on the envisioned DCN data plane technologies, as well as on the unified SDN-enabled control plane architectural solution that will empower OPS and OCS transmission technologies with superior flexibility, manageability, and customizability.
Published: 2013

6. Productive cluster programming with OmpSs

Author: Alejandro Duran, Xavier Martorell, Javier Bueno, Jesús Labarta, Rosa M. Badia, Montse Farreras, Eduard Ayguadé, Luis Martinell, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Data parallelism, Computer science, Fortran, media_common.quotation_subject, Task parallelism, 02 engineering and technology, Parallel computing, Remote node, computer.software_genre, Runtime system, 0202 electrical engineering, electronic engineering, information engineering, Master node, Distribute shared memory, Programmer, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], computer.programming_language, media_common, 020203 distributed computing, Parallel processing (Electronic computers), Programming language, Processament en paral·lel (Ordinadors), Debugging, Programming paradigm, Address space, 020201 artificial intelligence & image processing, Compiler, Instruction-level parallelism, computer
Abstract: Clusters of SMPs are ubiquitous. They have been traditionally programmed by using MPI. But, the productivity of MPI programmers is low because of the complexity of expressing parallelism and communication, and the difficulty of debugging. To try to ease the burden on the programmer new programming models have tried to give the illusion of a global shared-address space (e.g., UPC, Co-array Fortran). Unfortunately, these models do not support, increasingly common, irregular forms of parallelism that require asynchronous task parallelism. Other models, such as X10 or Chapel, provide this asynchronous parallelism but the programmer is required to rewrite entirely his application. We present the implementation of OmpSs for clusters, a variant of OpenMP extended to support asynchrony, heterogeneity and data movement for task parallelism. As OpenMP, it is based on decorating an existing serial version with compiler directives that are translated into calls to a runtime system that manages the parallelism extraction and data coherence and movement. Thus, the same program written in OmpSs can run in a regular SMP machine, in clusters of SMPs, or even can be used for debugging with the serial version. The runtime uses the information provided by the programmer to distribute the work across the cluster while optimizes communications using affinity scheduling and caching of data. We have evaluated our proposal with a set of kernels and the OmpSs versions obtain a performance comparable, or even superior, to the one obtained by the same version of MPI.
Published: 2011

7. Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

Author: Gladys Utrera, Jordi Fornes, Montse Farreras, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Combinatorial optimization, Computer Networks and Communications, Computer science, Computation, CPU time, 02 engineering and technology, Parallel computing, Optimització combinatòria, Theoretical Computer Science, Scheduling (computing), Idle, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC], Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), Knapsack algorithm, 020206 networking & telecommunications, Data structure, Hardware and Architecture, Knapsack problem, HPC, Oversubscription, 020201 artificial intelligence & image processing, MPI, High performance computing, Load balancing, Càlcul intensiu (Informàtica), Software
Abstract: Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing “in place”, taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.

8. Combining static and dynamic data coalescing in unified parallel C

Author: Michail Alvanos, Ettore Tiotto, Xavier Martorell, José Nelson Amaral, Montse Farreras, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Computer science, One-sided communication, 02 engineering and technology, Parallel computing, C (Llenguatge de programació), Unified Parallel C, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), Code generation, Partitioned global address space, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], computer.programming_language, Informàtica::Arquitectura de computadors::Arquitectures distribuïdes [Àrees temàtiques de la UPC], Distributed database, Parallel processing (Electronic computers), Dynamic data, Processament en paral·lel (Ordinadors), Unified parallel C, Supercomputer, 020202 computer hardware & architecture, Computational Theory and Mathematics, Hardware and Architecture, C (Computer program language), Signal Processing, Performance evaluation, 020201 artificial intelligence & image processing, computer
Abstract: Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through a network. This paper addresses important limitations in the code generation for partitioned global address space (PGAS) languages. These languages allow fine-grained communication and lead to programs that perform many fine-grained accesses to data. When the data is distributed to remote computing nodes, code transformations are required to prevent performance degradation. Until now code transformations to PGAS programs have been restricted to the cases where both the physical mapping of the data or the number of processing nodes are known at compilation time. In this paper, a novel application of the inspector-executor model overcomes these limitations and allows profitable code transformations, which result in fewer and larger messages sent through the network, when neither the data mapping nor the number of processing nodes are known at compilation time. A performance evaluation reports both scaling and absolute performance numbers on up to 32,768 cores of a Power 775 supercomputer. This evaluation indicates that the compiler transformation results in speedups between 1.15 $\times$ and 21 $\times$ over a baseline and that these automated transformations achieve up to 63 percent the performance of the MPI versions.

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

8 results on '"Montse Farreras"'

1. Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

2. Task Packing: Getting the Best from MPI Unbalanced Applications

3. Improving performance of all-to-all communication through loop scheduling in PGAS environments

4. Improving communication in PGAS environments: Static and dynamic coalescing in UPC

5. All-optical packet/circuit switching-based data center network for enhanced scalability, latency and throughput

6. Productive cluster programming with OmpSs

7. Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

8. Combining static and dynamic data coalescing in unified parallel C

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

8 results on '"Montse Farreras"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources