Descriptor: "Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]"' showing total 304 results

Start Over Descriptor "Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]"

304 results on '"Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]"'

1. Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

Author: Catalán Pallarés, Sandra, Igual Peña, Francisco D., Herrero Zaragoza, José Ramón, Rodríguez Sánchez, Rafael, Quintana Ortí, Enrique Salvador, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. PM - Programming Models
Subjects: NUMA architectures, Computer Networks and Communications, Parallel programming (Computer science), Gestió de memòria (Informàtica), Dense linear algebra, Programació en paral·lel (Informàtica), Theoretical Computer Science, Memory management (Computer science), Artificial Intelligence, Hardware and Architecture, Portability, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Chiplets, Shared memory programming, Software
Abstract: We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming. This research was sponsored by project PID2019-107255GB of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445. This project has also received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. The work is also supported by grants PID2020-113656RB-C22 and PID2021-126576NB-I00 of MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe.
Published: 2023
Full Text: View/download PDF

2. Mitigating the NUMA effect on task-based runtime systems

Author: Maroñas Bravo, Marcos, Navarro Muñoz, Antoni, Ayguadé Parra, Eduard, Beltran Querol, Vicenç, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. PM - Programming Models
Subjects: Application program interfaces (Computer software), Task-aware, Parallel processing (Electronic computers), Scheduling, Parallel programming model, Processament en paral·lel (Ordinadors), Parallel programming (Computer science), Interfícies de programació d'aplicacions (Programari), Programació en paral·lel (Informàtica), Theoretical Computer Science, Hardware and Architecture, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], NUMA-awareness, OmpSs-2, Software, Information Systems
Abstract: Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processors usually expose a single shared address space. However, due to hardware restrictions, they adopt a NUMA approach, where each processor accesses local memory faster than remote memories. Reducing data motion is crucial to improve the overall performance. Thus, computations must run as close as possible to where the data resides. We propose a new approach that mitigates the NUMA effect on NUMA systems. Our solution is based on the OmpSs-2 programming model, a task-based parallel programming model, similar to OpenMP. We first provide a simple API to allocate memory in NUMA systems using different policies. Then, combining user-given information that specifies dependences between tasks, and information collected in a global directory when allocating data, we extend our runtime library to perform NUMA-aware work scheduling. Our heuristic considers data location, distance between NUMA nodes, and the load of each NUMA node to seamlessly minimize data motion costs and load imbalance. Our evaluation shows that our NUMA support can significantly mitigate the NUMA effect by reducing the amount of remote accesses, and so improving performance on most benchmarks, reaching up to 2x speedup in a 2-NUMA machine, and up to 7.1x in a 8-NUMA machine. This research has received funding from the European Union’s Horizon 2020/EuroHPC research and innovation programme under grant agreement No 955606 (DEEP-SEA), project PCI2021121958 financed by the Spanish State Research Agency - Ministry of Science and Innovation, Generalitat de Catalunya (contract 2021-SGR-01007), the Spanish Ministry of Science and Technology (contract PID2019-107255GB), and Severo Ochoa (CEX2021- 001148-S / MCIN/AEI /10.13039/501100011033).
Published: 2023
Full Text: View/download PDF

3. Accelerating Edit-Distance Sequence Alignment on GPU Using the Wavefront Algorithm

Author: Quim Aguado-Puig, Santiago Marco-Sola, Juan Carlos Moure, David Castells-Rufas, Lluc Alvarez, Antonio Espinosa, Miquel Moreto, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Informàtica::Aplicacions de la informàtica::Bioinformàtica [Àrees temàtiques de la UPC], Approximate string matching, General Computer Science, Levenshtein distance, General Engineering, Genomics, Edit-distance, Pairwise sequence alignment, Unitats de processament gràfic, Genòmica, Compute unified device architecture (CUDA), Wavefront alignment algorithm (WFA), General Materials Science, High performance computing, Electrical and Electronic Engineering, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Graphics processing units, Càlcul intensiu (Informàtica)
Abstract: Sequence alignment remains a fundamental problem with practical applications ranging from pattern recognition to computational biology. Traditional algorithms based on dynamic programming are hard to parallelize, require significant amounts of memory, and fail to scale for large inputs. This work presents eWFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute the exact edit-distance sequence alignment based on the wavefront alignment algorithm (WFA). This approach exploits the similarities between the input sequences to accelerate the alignment process while requiring less memory than other algorithms. Our implementation takes full advantage of the massive parallel capabilities of modern GPUs to accelerate the alignment process. In addition, we propose a succinct representation of the alignment data that successfully reduces the overall amount of memory required, allowing the exploitation of the fast shared memory of a GPU. Our results show that our GPU implementation outperforms by 3- 9× the baseline edit-distance WFA implementation running on a 20 core machine. As a result, eWFA-GPU is up to 265 times faster than state-of-the-art CPU implementation, and up to 56 times faster than state-of-the-art GPU implementations. This work was supported in part by the European Unions’s Horizon 2020 Framework Program through the DeepHealth Project under Grant 825111; in part by the European Union Regional Development Fund within the Framework of the European Regional Development Fund (ERDF) Operational Program of Catalonia 2014–2020 with a Grant of 50% of Total Cost Eligible through the Designing RISC-V-based Accelerators for next-generation Computers Project under Grant 001-P-001723; in part by the Ministerio de Ciencia e Innovacion (MCIN) Agencia Estatal de Investigación (AEI)/10.13039/501100011033 under Contract PID2020-113614RB-C21 and Contract TIN2015-65316-P; and in part by the Generalitat de Catalunya (GenCat)-Departament de Recerca i Universitats (DIUiE) (GRR) under Contract 2017-SGR-313, Contract 2017-SGR-1328, and Contract 2017-SGR-1414. The work of Miquel Moreto was supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal Fellowship under Grant RYC-2016-21104.
Published: 2022
Full Text: View/download PDF

4. DynAMO: Improving parallelism through dynamic placement of atomic memory operations

Author: Soria Pardos, Víctor, Armejach Sanosa, Adrià, Mück, Tiago, Suárez Gracía, Dario, Joao, Jose A., Rico, Alejandro, Moreto Planas, Miquel, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Atomic memory operations, Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), Sistemes monoxip, Systems on a chip, Multi-core architectures, Data placement, Microarchitecture, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]
Abstract: With increasing core counts in modern multi-core designs, the overhead of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cache-coherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed near-core (near) or remotely in the on-chip memory hierarchy (far). This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations. Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09× across all workloads and 1.31× on AMO-intensive applications with respect to executing all AMOs near. This research was supported by the Spanish Ministry of Science and Innovation (MCIN) through contracts [PID2019-107255GB-C21], [TED2021-132634A-I00], and [PID2019-105660RB-C21]; the Generalitat of Catalunya through contract [2021-SGR-00763]; the Government of Aragon [T5820R]; the Arm-BSC Center of Excellence, and the European Processor Initiative (EPI) which is part of the European Union’s Horizon 2020 research and innovation program under grant agreement No. 826647. V. Soria-Pardos has been supported through an FPU fellowship [FPU20-02132]; A. Armejach is a Serra Hunter Fellow and has been partially supported by the Grant [IJCI-2017-33945] funded by MCIN/AEI/10.13039/501100011033; M. Moreto through a Ramón y Cajal fellowship [RYC-2016-21104].
Published: 2023

5. OmpSs-2 and OpenACC interoperation

Author: Orestis Korakitis, Simon Garcia de Gonzalo, Nicolas Guidotti, Joao Barreto, Jose Monteiro, Antonio J. Pena, and Barcelona Supercomputing Center
Subjects: Parallel processing (Electronic computers), Task based, Data Flow Paradigm, GPU, Multiprocessors, Parallelism, Runtime Scheduling, Programming Productivity, Code Transformation, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Graphics processing units
Abstract: We propose an interoperation mechanism to enable novel composability across pragma-based programming models. We study and propose a clear separation of duties and implement our approach by augmenting the OmpSs-2 programming model, compiler and runtime system to support OmpSs-2 + OpenACC programming. To validate our proposal we port ZPIC, a kinetic plasma simulator, to leverage our hybrid OmpSs-2 + OpenACC implementation. We compare our approach against OpenACC versions of ZPIC on a multi-GPU HPC system. We show that our approach manages to provide automatic asynchronous and multi-GPU execution, removing significant burden from the application’s developer, while also being able to outperform manually programmed versions, thanks to a better utilization of the hardware. This work has been part of the EPEEC project. The EPEEC project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 801051. This paper was also partially funded by the Ministerio de Ciencia e Innovación Agencia Estatal de Investigación (PID2019-107255GB-C21/AEI/10.13039/501100011033). We gratefully acknowledge the support of NVIDIA AI Technology Center (NVAITC) Europe who provided us the remote access to NVIDIA DGX-1
Published: 2023

6. GPU acceleration of Levenshtein distance computation between long strings

Author: David Castells-Rufas and Barcelona Supercomputing Center
Subjects: Computer Networks and Communications, Edit distance, Levenshtein distance, GPU, Computer Graphics and Computer-Aided Design, Theoretical Computer Science, WFA algorithm, Artificial Intelligence, Hardware and Architecture, Parallel processing, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Microprocessors, Graphics processing units, Software
Abstract: Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity. This research was supported by the European Union Regional Development Fund (ERDF) within the framework of the ERDF Operational Program of Catalonia 2014–2020 with a grant of 50% of the total cost eligible under the Designing RISC-V based Accelerators for next generation computers project (DRAC) [001-P-001723], in part by the Catalan Government under grant 2017-SGR-1624, and in part by the Spanish Ministry of Science, Innovation and Universities under grant RTI2018-095209-B-C22.
Published: 2023

7. Improving the performance of classical linear algebra iterative methods via hybrid parallelism

Author: Pedro J. Martinez-Ferrer, Tufan Arslan, Vicenç Beltran, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. PM - Programming Models
Subjects: FOS: Computer and information sciences, Algebras, Linear, Computer Networks and Communications, G.1.3, Distributed-memory, Theoretical Computer Science, Shared-memory, Artificial Intelligence, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Linear algebra, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], 15-04, Computer Science - Performance, Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), I.6.3, Hybrid parallelism, Informàtica::Informàtica teòrica::Algorísmica i teoria de la complexitat [Àrees temàtiques de la UPC], Performance (cs.PF), Computer Science - Distributed, Parallel, and Cluster Computing, Hardware and Architecture, MPI, Distributed, Parallel, and Cluster Computing (cs.DC), Àlgebra lineal, Software
Abstract: We propose fork-join and task-based hybrid implementations of four classical linear algebra iterative methods (Jacobi, Gauss-Seidel, conjugate gradient and biconjugate gradient stabilised) as well as variations of them. Algorithms are duly documented and the corresponding source code is made publicly available for reproducibility. Both weak and strong scalability benchmarks are conducted to statistically analyse their relative efficiencies. The weak scalability results assert the superiority of a task-based hybrid parallelisation over MPI-only and fork-join hybrid implementations. Indeed, the task-based model is able to achieve speedups of up to 25% larger than its MPI-only counterpart depending on the numerical method and the computational resources used. For strong scalability scenarios, hybrid methods based on tasks remain more efficient with moderate computational resources where data locality does not play an important role. Fork-join hybridisation often yields mixed results and hence does not present a competitive advantage over a much simpler MPI approach., Comment: 33 pages, 6 figures, accepted manuscript in Journal of Parallel and Distributed Computing
Published: 2023
Full Text: View/download PDF

8. Dynamic spawning of MPI processes applied to malleability

Author: Iker Martín-Álvarez, José I Aliaga, Maribel Castillo, Sergio Iserte, Rafael Mayo, and Barcelona Supercomputing Center
Subjects: Hardware and Architecture, Application reconfiguration, Threading, MPI, Malleability, High performance computing, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Process spawning, Software, Theoretical Computer Science
Abstract: Malleability allows computing facilities to adapt their workloads through resource management systems to maximize the throughput of the facility and the efficiency of the executed jobs. This technique is based on reconfiguring a job to a different resource amount during execution and then continuing with it. One of the stages of malleability is the dynamic spawning of processes in execution time, where different decisions in this stage will affect how the next stage of data redistribution is performed, which is the most time-consuming stage. This paper describes different methods and strategies, defining eight different alternatives to spawn processes dynamically and indicates which one should be used depending on whether a strong or weak scaling application is being used. In addition, it is described for both types of applications which strategies benefit most the application performance or the system productivity. The results show that reducing the number of spawning processes by reusing the older ones can reduce reconfiguration time compared to the classical method by up to 2.6 times for expanding and up to 36 times for shrinking. Furthermore, the asynchronous strategy requires analysing the impact of oversubscription on application performance. This work has been funded by the following projects: project PID2020-113656RB-C21 supported by MCIN/AEI/10.13039/501100011033 and project UJI-B2019-36 supported by UniversitatJaume I. Researcher S. Iserte was supported by the postdoctoralfellowship APOSTD/2020/026, and researcher I. Martín- Álvarez was supported by the predoctoral fellowship ACIF/2021/260, both from Valencian Region Government and European Social Funds.
Published: 2023
Full Text: View/download PDF

9. Seamless optimization of the GEMM kernel for task-based programming models

Author: Lorenzon, Arthur F., Marques, Sandro M. V. N., Navarro Muñoz, Antoni, Beltran Querol, Vicenç, and Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors
Subjects: Parallel computing, Energy-efficiency, Parallel processing (Electronic computers), GEMM, Microprocessadors -- Consum d'energia, Processament en paral·lel (Ordinadors), Microprocessors -- Energy consumption, Malleability, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]
Abstract: The general matrix-matrix multiplication (GEMM) kernel is a fundamental building block of many scientific applications. Many libraries such as Intel MKL and BLIS provide highly optimized sequential and parallel versions of this kernel. The parallel implementations of the GEMM kernel rely on the well-known fork-join execution model to exploit multi-core systems efficiently. However, these implementations are not well suited for task-based applications as they break the data-flow execution model. In this paper, we present a task-based implementation of the GEMM kernel that can be seamlessly leveraged by task-based applications while providing better performance than the fork-join version. Our implementation leverages several advanced features of the OmpSs-2 programming model and a new heuristic to select the best parallelization strategy and blocking parameters based on the matrix and hardware characteristics. When evaluating the performance and energy consumption on two modern multi-core systems, we show that our implementations provide significant performance improvements over an optimized OpenMP fork-join implementation, and can beat vendor implementations of the GEMM (e.g., Intel MKL and AMD AOCL). We also demonstrate that a real application can leverage our optimized task-based implementation to enhance performance.
Published: 2022
Full Text: View/download PDF

10. XFeatur: Hardware Feature Extraction for DNN Auto-tuning

Author: Sierra Acosta, Jorge, Diavastos, Andreas, González Colás, Antonio María, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
Subjects: Artificial intelligence, Memory management (Computer science), Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), Machine learning, TVM, Aprenentatge automàtic, Autotuning, Gestió de memòria (Informàtica), Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]
Abstract: In this work, we extend the auto-tuning process of the state-of-the-art TVM framework with XFeatur; a tool that extracts new meaningful hardware-related features that improve the quality of the representation of the search space and consequently improve the accuracy of its prediction algorithm. These new features provide information about the amount of thread-level parallelism, shared memory usage, register usage, dynamic instruction count and memory access dependencies. Optimizing ResNet-18 with the proposed features improves the quality of the search space representation by 63% on average and a maximum of 2× for certain tasks, while it reduces the tuning time by 9% (approximately 1.1 hours) and produces configurations that have equal or better performance (up to 92.7%) than the baseline. This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program and the FPU grant 2019-FPU-998758.
Published: 2022
Full Text: View/download PDF

11. A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU

Author: Mauricio Hanzich, Albert Farrés, Fumihiko Ino, Jingcheng Shen, and Barcelona Supercomputing Center
Subjects: Computer science, Stencil code, Stencil computation, GPU, OpenMP (Interfície de programació d'aplicacions), Parallel computing, Directive, Unitats de processament gràfic, Database-centric architecture, Data-centric optimizations, Artificial Intelligence, Hardware and Architecture, Data transmission systems, Out-of-core algorithm, Out-of-core computation, Computer Vision and Pattern Recognition, pipelined accelerator, Electrical and Electronic Engineering, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Graphics processing units, Software
Abstract: Special Section on Parallel, Distributed, and Reconfigurable Computing, and Networking Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation. This study was supported in part by the Japan Society for the Promotion of Science KAKENHI Grant Numbers JP15H01687, JP16H02801, and JP20K21794.
Published: 2020
Full Text: View/download PDF

12. Fine‐grain task‐parallel algorithms for matrix factorizations and inversion on many‐threaded CPUs

Author: Sandra Catalán, José R. Herrero, Francisco D. Igual, Enrique S. Quintana‐Ortí, Rafael Rodríguez‐Sánchez, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Informática, Sistemas expertos, Matrix factorizations, Algebras, Linear, Parallel processing (Electronic computers), Computer Networks and Communications, CPUs, Matrix inversion, Processament en paral·lel (Ordinadors), Task parallelism, OpenMP, Computer Science Applications, Theoretical Computer Science, High performance, Computational Theory and Mathematics, High performance computing, Àlgebra lineal, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Càlcul intensiu (Informàtica), Software
Abstract: We extend a two-level task partitioning previously applied to the inversion of dense matrices via Gauss–Jordan elimination to the more challenging QR factorization as well as the initial orthogonal reduction to band form found in the singular value decomposition. Our new task-parallel algorithms leverage the tasking mechanism currently available in OpenMP to exploit “nested” task parallelism, with a first outer level that operates on matrix panels and a second inner level that processes the matrix either by µ -panels or by tiles, in order to expose a large number of independent tasks. We present a detailed performance analysis, including execution traces, which shows that the two-level refinement into fine grain tasks allows for an improved load balancing and delivers high performance on current general-purpose many-core processors (CPUs) from Intel and AMD. This research was sponsored by projects RTI2018-093684-B-I00, PID2019-107255GB andTIN2017-82972-R of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445.
Published: 2022

13. Towards OmpSs-2 and OpenACC interoperation

Author: Orestis Korakitis, Simon Garcia De Gonzalo, Nicolas Guidotti, João Pedro Barreto, José C. Monteiro, Antonio J. Peña, and Barcelona Supercomputing Center
Subjects: Code transformation, Supercomputadors, Data-flow paradigm, GPU, Parallel programming (Computer science), Parallelism, Programming productivity, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Data flow computing, Runtime scheduling
Abstract: The increasing demand in HPC to utilize accelerators has motivated the development of pragma-based directives to target these devices. OmpSs-2 and OpenACC are both directive-based solutions that allow application programmers to utilize accelerators. The two leverage distinct types of parallelism: task parallelism and data parallelism, respectively. Non-trivial scientific applications can benefit from both types of available parallelism. However, the combination of pragma-based models is difficult to coordinate, as both assume full control and are unaware of each other at runtime. We propose an interoperation mechanism to enable novel composability across pragma-based programming models. We study and propose a clear separation of duties and implement our approach by augmenting the OmpSs-2 programming model, compiler and runtime to support OmpSs-2 + OpenACC programming
Published: 2022
Full Text: View/download PDF

14. Acceleration strategies for large-scale sequential simulations using parallel neighbour search: Non-LVA and LVA scenarios

Author: Oscar Peredo, José R. Herrero, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Parallel computing, Algebras, Linear, Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), Anisotropy, Anisotropia, Geostatistics, Computers in Earth Sciences, Àlgebra lineal, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Algorithms, Information Systems
Abstract: This paper describes the application of acceleration techniques into existing implementations of Sequential Gaussian Simulation and Sequential Indicator Simulation. These implementations might incorporate Locally Varying Anisotropy (LVA) to capture non-linear features of the underlying physical phenomena. The imple- mentation focuses on a novel parallel neighbour search algorithm, which can be used on both non-LVA and LVA codes. Additionally, parallel shortest path executions and optimized linear algebra libraries are applied with focus on LVA codes. Execution time, speedup and accuracy results are presented. Non-LVA codes are benchmarked using two scenarios with approximately 50 million domain points each. Speedup results of 2× and 4× were obtained on SGS and SISIM respectively, where each scenario is compared against a baseline code published in Peredo et al. (2018). The aggregated contribution to speedup of both works results in 12× and 50× respectively. LVA codes are benchmarked using two scenarios with approximately 1.7 million domain points each. Speedup results of 56× and 1822× were obtained on SGS and SISIM respectively, where each scenario is compared against the original baseline sequential codes. The authors acknowledge the donated resources from project PID2019-107255GB of the Spanish Ministerio de Economía y Competitividad, and project 2017-SGR-1414 from the Generalitat de Catalunya, Spain.
Published: 2022

15. A Novel Set of Directives for Multi-device Programming with OpenMP

Author: Torres, Raul, Ferrer, Roger, Teruel, Xavier, and Barcelona Supercomputing Center
Subjects: Offloading, Multiprocessors, LLVM, Processament paral·lel (Ordinadors), Multi-device support, Heterogeneous architectures, OpenMP, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Graphics processing units, Multi-GPU, Language extension, Accelerators
Abstract: This work was supported by MEEP project, which has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 946002. The JU receives support from theEuropean Union’s Horizon 2020 research and innovation programme and Spain, Croatia, Turkey.
Published: 2022

16. Task-based acceleration of bidirectional recurrent neural networks on multi-core architectures

Author: Robin Kumar Sharma, Marc Casas, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Neural networks (Computer science), Parallel processing (Electronic computers), Bidirectional recurrent neural networks (BRNNs), Long-short term memory (LSTM), Gated recurrent units (GRU), Processament en paral·lel (Ordinadors), Task parallelism, Xarxes neuronals (Informàtica), Deep learning, Deep neural network (DNN), Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Aprenentatge profund
Abstract: This paper proposes a novel parallel execution model for Bidirectional Recurrent Neural Networks (BRNNs), B-Par (Bidirectional-Parallelization), which exploits data and control dependencies for forward and reverse input computations. B-Par divides BRNN workloads across different parallel tasks by defining input and output dependencies for each RNN cell in both forward and reverse orders. B-Par does not require per-layer barriers to synchronize the parallel execution of BRNNs. We evaluate B-Par considering the TIDIGITS speech database and the Wikipedia data-set. Our experiments indicate that B-Par outperforms the state-of-the-art deep learning frameworks TensorFlow-Keras and Pytorch by achieving up to 2.34× and 9.16× speed-ups, respectively, on modern multi-core CPU architectures while preserving accuracy. Moreover, we analyze in detail aspects like task granularity, locality, or parallel efficiency to illustrate the benefits of B-Par. This work is partially supported by the Generalitat de Catalunya (contract 2017-SGR-1414) and the Spanish Ministry of Science and Technology through the PID2019- 107255GB project. Marc Casas has been supported by the Spanish Ministry of Economy, Industry and Competitiveness under the Ramon y Cajal fellowship No. RYC-2017-23269.
Published: 2022

17. TD-NUCA: runtime driven management of NUCA caches in task dataflow programming models

Author: Paul Caheny, Lluc Alvarez, Marc Casas, Miquel Moreto, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Energy consumption, Parallel architectures, Memory management (Computer science), Parallel processing (Electronic computers), Cache memory, Processament en paral·lel (Ordinadors), Energia -- Consum, Gestió de memòria (Informàtica), Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Data flow computing
Abstract: In high performance processors, the design of on-chip memory hierarchies is crucial for performance and energy efficiency. Current processors rely on large shared Non-Uniform Cache Architectures (NUCA) to improve performance and reduce data movement. Multiple solutions exploit information available at the microarchitecture level or in the operating system to optimize NUCA performance. However, existing methods have not taken advantage of the information captured by task dataflow programming models to guide the management of NUCA caches. In this paper we propose TD-NUCA, a hardware/software co-designed approach that leverages information present in the runtime system of task dataflow programming models to efficiently manage NUCA caches. TD-NUCA identifies the data access and reuse patterns of parallel applications in the runtime system and guides the operation of the NUCA caches in the hardware. As a result, TD-NUCA achieves a 1.18x average speedup over the baseline S-NUCA while requiring only 0.62x the data movement. This work has been supported by the Spanish Ministry of Science and Technology (contract PID2019-107255GB-C21) and the Generalitat de Catalunya (contract 2017-SGR-1414). M. Casas has been partially supported by the Grant RYC- 2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF ‘Investing in your future’. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship No. RYC-2016-21104.
Published: 2022

18. OmpSs@cloudFPGA: An FPGA task-based programming model with message passing

Author: Juan Miguel de Haro, Ruben Cano, Carlos Alvarez, Daniel Jimenez-Gonzalez, Xavier Martorell, Eduard Ayguade, Jeses Labarta, Francois Abel, Burkhard Ringlein, Beat Weiss, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Application program interfaces (Computer software), Parallel processing (Electronic computers), Heterogeneous programming, Processament en paral·lel (Ordinadors), Programming models, OpenMP, Interfícies de programació d'aplicacions (Programari), Network-attached FPGA, Stand-alone FPGA, High-level synthesis, Supercomputers -- Energy consumption, MPI, Supercomputadors -- Consum d'energia, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], High-performance computing, FPGA
Abstract: Nowadays, a new parallel paradigm for energy-efficient heterogeneous hardware infrastructures is required to achieve better performance at a reasonable cost on high-performance computing applications. Under this new paradigm, some application parts are offloaded to specialized accelerators that run faster or are more energy-efficient than CPUs. Field-Programmable Gate Arrays (FPGA) are one of those types of accelerators that are becoming widely available in data centers. This paper proposes OmpSs@cloudFPGA, which includes novel extensions to parallel task-based programming models that enable easy and efficient programming of heterogeneous clusters with FPGAs. The programmer only needs to annotate, with OpenMP-like pragmas, the tasks of the application that should be accelerated in the cluster of FPGAs. Next, the proposed programming model framework automatically extracts parts annotated with High-Level Synthesis (HLS) pragmas and synthesizes them into hardware accelerator cores for FPGAs. Additionally, our extensions include and support two novel features: 1) FPGA-to-FPGA direct communication using a Message Passing Interface (MPI) similar Application Programming Interface (API) with one-to-one and collective communications to alleviate host communication channel bottleneck, and 2) creating and spawning work from inside the FPGAs to their own accelerator cores based on an MPI rank-like identification. These features break the classical host-accelerator model, where the host (typically the CPU) generates all the work and distributes it to each accelerator. We also present an evaluation of OmpSs@cloudFPGA for different parallel strategies of the N-Body application on the IBM cloudFPGA research platform. Results show that for cluster sizes up to 56 FPGAs, the performance scales linearly. To the best of our knowledge, this is the best performance obtained for N-body over FPGA platforms, reaching 344 Gpairs/s with 56 FPGAs. Finally, we compare the performance and power consumption of the proposed approach with the ones obtained by a classical execution on the MareNostrum 4 supercomputer, demonstrating that our FPGA approach reduces power consumption by an order of magnitude. This work has been done in the context of the IBM/BSC Deep Learning Center initiative. This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 754337 (EuroEXA), from Spanish Government (PID2019-107255GBC21/AEI/10.13039/501100011033), and from Generalitat de Catalunya (2017-SGR-1414 and 2017-SGR-1328).
Published: 2022

19. Sargantana: A 1 GHz+ in-order RISC-V processor with SIMD vector extensions in 22nm FD-SOI

Author: Victor Soria-Pardos, Max Doblas, Guillem Lopez-Paradis, Gerard Candon, Narcis Rodas, Xavier Carril, Pau Fontova-Muste, Neiel Leyva, Santiago Marco-Sola, Miquel Moreto, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), RISC-V, Vector instructions, High performance computing, Computer architecture, Domain-specific accelerators, Microprocessadors -- Disseny i construcció, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Microprocessors -- Design and construction, Càlcul intensiu (Informàtica)
Abstract: The RISC-V open Instruction Set Architecture (ISA) has proven to be a solid alternative to licensed ISAs. In the past 5 years, a plethora of industrial and academic cores and accelerators have been developed implementing this open ISA. In this paper, we present Sargantana, a 64-bit processor based on RISC-V that implements the RV64G ISA, a subset of the vector instructions extension (RVV 0.7.1), and custom application-specific instructions. Sargantana features a highly optimized 7-stage pipeline implementing out-of-order write-back, register renaming, and a non-blocking memory pipeline. Moreover, Sar-gantana features a Single Instruction Multiple Data (SIMD) unit that accelerates domain-specific applications. Sargantana achieves a 1.26 GHz frequency in the typical corner, and up to 1.69 GHz in the fast corner using 22nm FD-SOI commercial technology. As a result, Sargantana delivers a 1.77× higher Instructions Per Cycle (IPC) than our previous 5-stage in-order DVINO core, reaching 2.44 CoreMark/MHz. Our core design delivers comparable or even higher performance than other state-of-the-art academic cores performance under Autobench EEMBC benchmark suite. This way, Sargantana lays the foundations for future RISC-V based core designs able to meet industrial-class performance requirements for scientific, real-time, and high-performance computing applications. This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contract PID2019- 107255GB-C21), by the Generalitat de Catalunya (contract 2017-SGR-1328), by the European Union within the framework of the ERDF of Catalonia 2014-2020 under the DRAC project [001-P-001723], and by Lenovo-BSC Contract-Framework (2020). The Spanish Ministry of Economy, Industry and Competitiveness has partially supported M. Doblas and V. Soria-Pardos through a FPU fellowship no. FPU20-04076 and FPU20-02132 respectively. G. Lopez-Paradis has been supported by the Generalitat de Catalunya through a FI fellowship 2021FI-B00994. S. Marco-Sola was supported by Juan de la Cierva fellowship grant IJC2020-045916-I funded by MCIN/AEI/10.13039/501100011033 and by “European Union NextGenerationEU/PRTR”, and M. Moretó through a Ramon y Cajal fellowship no. RYC-2016-21104.
Published: 2022

20. A model of checkpoint behavior for applications that have I/O

Author: DOLORES REXACHS, Betzabeth León, Daniel Franco, Emilio Luque, SANDRA MENDEZ, and Barcelona Supercomputing Center
Subjects: Checkpoint, Protocols, Computer network, I/O applications, HPC (Computer science), Storage, Fault tolerance, Fault-tolerant computing, Theoretical Computer Science, Supercomputadors, Hardware and Architecture, HPC, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Software, Information Systems
Abstract: Due to the increase and complexity of computer systems, reducing the overhead of fault tolerance techniques has become important in recent years. One technique in fault tolerance is checkpointing, which saves a snapshot with the information that has been computed up to a specific moment, suspending the execution of the application, consuming I/O resources and network bandwidth. Characterizing the files that are generated when performing the checkpoint of a parallel application is useful to determine the resources consumed and their impact on the I/O system. It is also important to characterize the application that performs checkpoints, and one of these characteristics is whether the application does I/O. In this paper, we present a model of checkpoint behavior for parallel applications that performs I/O; this depends on the application and on other factors such as the number of processes, the mapping of processes and the type of I/O used. These characteristics will also influence scalability, the resources consumed and their impact on the IO system. Our model describes the behavior of the checkpoint size based on the characteristics of the system and the type (or model) of I/O used, such as the number I/O aggregator processes, the buffering size utilized by the two-phase I/O optimization technique and components of collective file I/O operations. The BT benchmark and FLASH I/O are analyzed under different configurations of aggregator processes and buffer size to explain our approach. The model can be useful when selecting what type of checkpoint configuration is more appropriate according to the applications’ characteristics and resources available. Thus, the user will be able to know how much storage space the checkpoint consumes and how much the application consumes, in order to establish policies that help improve the distribution of resources. This publication is supported under contract PID2020-112496GB-I00, funded by the Agencia Estatal de Investigación (AEI), Spain and the Fondo Europeo de Desarrollo Regional (FEDER) UE and partially funded by a research collaboration agreement with the Fundación Escuelas Universitarias Gimbernat (EUG).
Published: 2022

21. Resiliency in numerical algorithm design for extreme scale simulations

Author: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N Gansterer, Luc Giraud, Dominik Göddeke, Marco Heisig, Fabienne Jézéquel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S Quintana-Ortí, Francesco Rizzi, Ulrich Rüde, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thönnes, Andreas Wagner, Barbara Wohlmuth, High-End Parallel Algorithms for Challenging Numerical Simulations (HiePACS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Universität Stuttgart [Stuttgart], Karlsruher Institut für Technologie (KIT), Barcelona Supercomputing Center - Centro Nacional de Supercomputacion (BSC - CNS), Politecnico di Milano [Milan] (POLIMI), Technische Universität Munchen - Université Technique de Munich [Munich, Allemagne] (TUM), NVIDIA Corporation [Bangalore], NVIDIA Research [Austin], University Hospital Basel [Basel], Los Alamos National Laboratory (LANL), Friedrich-Alexander Universität Erlangen-Nürnberg (FAU), Oak Ridge National Laboratory [Oak Ridge] (ORNL), UT-Battelle, LLC, University of Vienna [Vienna], Performance et Qualité des Algorithmes Numériques (PEQUAN), LIP6, Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Université Panthéon-Assas (UP2), Lawrence Berkeley National Laboratory [Berkeley] (LBNL), Université de Bordeaux (UB), CERFACS, Universitat Politècnica de València (UPV), NexGen Analytics (NGA), Australian National University (ANU), Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich GmbH | Centre de recherche de Juliers, Helmholtz-Gemeinschaft = Helmholtz Association-Helmholtz-Gemeinschaft = Helmholtz Association, Sandia National Laboratories - Corporation, Technische Universität München [München] (TUM), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, STatic Optimizations, Runtime Methods (STORM), Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique (CERFACS), and Barcelona Supercomputing Center
Subjects: G.4, FOS: Computer and information sciences, Numerical algorithms, Large scale systems, G.1, 010103 numerical & computational mathematics, 02 engineering and technology, 01 natural sciences, Theoretical Computer Science, Simulació per ordinador, 0202 electrical engineering, electronic engineering, information engineering, parallel computer architecture, 0101 mathematics, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], resilience, D.4.5, D.4.4, 020203 distributed computing, Parallel computer architecture, Resilience, Fault tolerance, Computer Science - Distributed, Parallel, and Cluster Computing, Hardware and Architecture, Fault tolerance (Engineering), Distributed, Parallel, and Cluster Computing (cs.DC), fault tolerance, ddc:004, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Software, [MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA]
Abstract: This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge., Comment: 45 pages, 3 figures, submitted to The International Journal of High Performance Computing Applications
Published: 2021
Full Text: View/download PDF

22. Aging-aware parallel execution

Author: Antoni Navarro, Marcelo Caggiani Luizelli, Michael Hübner, Marcelo Brandalero, Gustavo Berned, Fábio Diniz Rossi, Antonio Carlos Schneider Beck, Thiarles S. Medeiros, Arthur Francisco Lorenzon, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Parallel computing, Multi-core processor, Aging, HCI, Negative-bias temperature instability, General Computer Science, Exploit, Edge device, business.industry, Computer science, NBTI, Computation, Distributed computing, 020206 networking & telecommunications, Cloud computing, Microprocessors -- Energy consumption, 02 engineering and technology, Thermal management of electronic devices and systems, Control and Systems Engineering, Microprocessadors -- Consum d'energia, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Latency (engineering), business, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]
Abstract: Computation has been pushed to the edge to decrease latency and alleviate the computational burden of the IoT applications in the cloud. However, the increasing processing demands of Edge Applications make necessary the employment of platforms that exploit thread-level parallelism (TLP). Yet, power and heat dissipation rise as TLP inadvertently increases or when parallelism is not cleverly exploited, which may be the result of the non-ideal use of a given PPI (Parallel Program Interface). Besides the common issues, such as the need for more robust power sources and better cooling, heat also adversely affects aging, accelerating phenomenons such as negative bias temperature instability (NBTI) and hot-carrier injection (HCI), which further reduces processor lifetime. Hence, considering that increasing the lifespan of an edge device is key, so the number of times the application set may execute until its end-of-life is maximized, we propose BALDER. It is a learning framework capable of automatically choosing optimal configuration executions (PPI and number of threads) according to the parallel application at hand, aiming to maximize the trade-off between aging and performance. When executing ten well-known applications on two multicore embedded architectures, we show that BALDER can find a nearly-optimal configuration for all our experiments.
Published: 2021

23. Implementation of a parallel tridiagonal solver for linear system of equations arising in Physicell-BioFVM

Author: Kulkarni, Shardool, Universitat Politècnica de Catalunya. Departament d'Enginyeria Civil i Ambiental, Rossi, Riccardo, Saxena, Gaurav, and Ponce de Leon, Miguel
Subjects: Anàlisi numèrica, Enginyeria civil [Àrees temàtiques de la UPC], Matrices, Tridiagonal Matrices, Direct Solvers, Linear Equations, Cyclic Reduction, OpenMP, MPI, Thomas Algorithm, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Matrius (Matemàtica), Numerical analysis
Abstract: PhysiCell/BioFVM is an opopen-sourcend agent-based software package supporting 2D/3D simulations for multi-cellular biological systems. It is completely written in C++ and enjoys shared-memory parallelization through OpenMP. An attempt is being made to re-structure the code-base to add support for Distributed parallelization through MPI. The biggest bottleneck in the current version of the simulation package is the serial Thomas solver that is used to solve the triadiagonal system of linear equations resulting from the Finite Volume Discretization Method (FVM) of the reaction-diffusion equations modelling the secretion, ingestion of substrates from cells/agents. Our aim in this project is to replace the serial solver with an efficient distributed-parallel solver. For this purpose we experiment with the Cyclic Reduction (CR) algorithm on a shared-memory system but after understanding its limitations with regard to the problem size, we settle on a modified version of the Thomas solver as our preferred choice in distributed-parallel settings. Our experiments show that the Cyclic Reduction algorithm implemented using OpenMP is able to outperform the serial Thomas solver at a certain thread count on a single node. However, we do not extend this algorithm to support distributed parallelism due to the aforementioned problem. Further, we implement an MPI-only version of the modified Thomas algorithm that promises good scalability on multiple nodes of our HPC cluster - the MareNostrum 4 (MN4) supercomputer at the Barcelona Supercomputing Center (BSC). We project and optimistically conclude that for large problem sizes and a high core count, the parallel modified Thomas algorithm can offer significant reduction in the time to solution for complex 3D simulations in PhysiCell/BioFVM.
Published: 2021

24. Arbitration Policies for On-Demand User-Level I/O Forwarding on HPC Platforms

Author: Ramon Nou, Jean Luca Bez, Alberto Miranda, Toni Cortes, Philippe O. A. Navaux, Francieli Zanon Boito, Universidade Federal do Rio Grande do Sul [Porto Alegre] (UFRGS), Barcelona Supercomputing Center - Centro Nacional de Supercomputacion (BSC - CNS), Topology-Aware System-Scale Data Management for High-Performance Computing (TADAAM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB), Universitat Politècnica de Catalunya [Barcelona] (UPC), This study was financed by the Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior - Brasil (CAPES) - Finance Code 001. It has also received support from the Conselho Nacional de Desenvolvimento Cientıfico e Tecnologico (CNPq), Brazil. It is also partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) undergrants PID2019-107255GB, and the Generalitat de Catalunya under contract 2014–SGR–1051., GRID'5000, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS), and BOITO, Francieli Zanon
Subjects: File system, I/O forwarding, Service (systems architecture), Parallel processing (Electronic computers), Computer science, business.industry, Processament en paral·lel (Ordinadors), Telecommunication -- Traffic, Dynamic priority scheduling, computer.software_genre, Set (abstract data type), MCKP, Assignació de recursos, Server, [INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Arbitration, Bandwidth (computing), Telecomunicació -- Tràfic, Allocation policy, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], business, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Resource allocation, computer, Computer network
Abstract: I/O forwarding is a well-established and widely-adopted technique in HPC to reduce contention in the access to storage servers and transparently improve I/O performance. Rather than having applications directly accessing the shared parallel file system, the forwarding technique defines a set of I/O nodes responsible for receiving application requests and forwarding them to the file system, thus reshaping the flow of requests. The typical approach is to statically assign I/O nodes to applications depending on the number of compute nodes they use, which is not always necessarily related to their I/O requirements. Thus, this approach leads to inefficient usage of these resources. This paper investigates arbitration policies based on the applications I/O demands, represented by their access patterns. We propose a policy based on the Multiple-Choice Knapsack problem that seeks to maximize global bandwidth by giving more I/O nodes to applications that will benefit the most. Furthermore, we propose a user-level I/O forwarding solution as an on-demand service capable of applying different allocation policies at runtime for machines where this layer is not present. We demonstrate our approach's applicability through extensive experimentation and show it can transparently improve global I/O bandwidth by up to 85% in a live setup compared to the default static policy. This study was financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Supenor - Brasil (CAPES) - Finance Code 001. It has also received support from the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil. It is also partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grants PID2019-107255GB; and the Generalitat de Catalunya under contract 2014-SGR-1051. The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the Barcelona Supercomputing Center. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).
Published: 2021

25. Parallelware Tools: An Experimental Evaluation on POWER Systems

Author: Xavier Martorell, Manuel Arenaz, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: FOS: Computer and information sciences, Exploit, Computer science, Concurrency, Informàtica::Enginyeria del software [Àrees temàtiques de la UPC], Parallel programming (Computer science), Static code analysis, POWER systems, Static program analysis, Computer software -- Quality control, Programari -- Control de qualitat, 010103 numerical & computational mathematics, Programació en paral·lel (Informàtica), 01 natural sciences, Concurrency and parallelism, Software development process, Software, 0101 mathematics, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Tasking, business.industry, Software architecture, Parallelware tools, Detection of software defects, OpenMP, 010101 applied mathematics, Computer Science - Distributed, Parallel, and Cluster Computing, Systems development life cycle, Programari -- Disseny, Distributed, Parallel, and Cluster Computing (cs.DC), Quality assurance and testing, Software engineering, business
Abstract: Static code analysis tools are designed to aid software developers to build better quality software in less time, by detecting defects early in the software development life cycle. Even the most experienced developer regularly introduces coding defects. Identifying, mitigating and resolving defects is an essential part of the software development process, but frequently defects can go undetected. One defect can lead to a minor malfunction or cause serious security and safety issues. This is magnified in the development of the complex parallel software required to exploit modern heterogeneous multicore hardware. Thus, there is an urgent need for new static code analysis tools to help in building better concurrent and parallel software. The paper reports preliminary results about the use of Appentra’s Parallelware technology to address this problem from the following three perspectives: finding concurrency issues in the code, discovering new opportunities for parallelization in the code, and generating parallel-equivalent codes that enable tasks to run faster. The paper also presents experimental results using well-known scientific codes and POWER systems. This work has been partly funded from the Spanish Ministry of Science and Technology (TIN2015-65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya (MPEXPAR: Models de Programació i Entorns d’Execució Parallels, 2014-SGR-1051), and the European Union’s Horizon 2020 research and innovation program throughgrant agreements MAESTRO (801101) and EPEEC (801051).
Published: 2021

26. An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

Author: Truong Thao Nguyen, Albert Njoroge Kahira, Rosa M. Badia, Mohamed Wahib, Ryousei Takano, Leonardo Bautista Gomez, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Data parallelism, Computer science, 02 engineering and technology, Machine learning, computer.software_genre, Convolutional neural network, Oracle, Model parallelism, Performance modeling, Machine Learning (cs.LG), Neural networks (Computer science), 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), Xarxes neuronals (Informàtica), Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], 020203 distributed computing, Artificial neural network, Parallel processing (Electronic computers), business.industry, Deep learning, Processament en paral·lel (Ordinadors), Computer Science - Distributed, Parallel, and Cluster Computing, Scalability, Parallelism (grammar), 020201 artificial intelligence & image processing, Artificial intelligence, Distributed, Parallel, and Cluster Computing (cs.DC), business, computer, Aprenentatge profund
Abstract: Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism., Comment: The International ACM Symposium on High-Performance Parallel and Distributed Computing 2021 (HPDC'21)
Published: 2021
Full Text: View/download PDF

27. Multi-GPU parallelization of the NAS multi-zone parallel benchmarks

Author: Marc Gonzalez, Enric Morancho, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Computer science, Dynamic, Parallel programming (Computer science), Workload, Dynamic priority scheduling, Parallel computing, Load balancing (computing), Programació en paral·lel (Informàtica), Unitats de processament gràfic, Guided schedulings, Load management, Computational Theory and Mathematics, Multi-GPU parallelization, Hardware and Architecture, Signal Processing, Multi gpu, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Graphics processing units, Load balancing, Static
Abstract: GPU-based computing systems have become a widely accepted solution for the high-performance-computing (HPC) domain. GPUs have shown highly competitive performance-per-watt ratios and can exploit an astonishing level of parallelism. However, exploiting the peak performance of such devices is a challenge, mainly due to the combination of two essential aspects of multi-GPU execution. On one hand, the workload should be distributed evenly among the GPUs. On the other hand, communications between GPU devices are costly and should be minimized. Therefore, a trade-of between work-distribution schemes and communication overheads will condition the overall performance of parallel applications run on multi-GPU systems. In this article we present a multi-GPU implementation of NAS Multi-Zone Parallel Benchmarks (which execution alternate communication and computational phases). We propose several work-distribution strategies that try to evenly distribute the workload among the GPUs. Our evaluations show that performance is highly sensitive to this distribution strategy, as the the communication phases of the applications are heavily affected by the work-distribution schemes applied in computational phases. In particular, we consider Static, Dynamic, and Guided schedulers to find a trade-off between both phases to maximize the overall performance. In addition, we compare those schedulers with an optimal scheduler computed offline using IBM CPLEX. On an evaluation environment composed of 2 x IBM Power9 8335-GTH and 4 x GPU NVIDIA V100 (Volta), our multi-GPU parallelization outperforms single-GPU execution from 1.48x to 1.86x (2 GPUs) and from 1.75x to 3.54x (4 GPUs). This article analyses these improvements in terms of the relationship between the computational and communication phases of the applications as the number of GPUs is increased. We prove that Guided schedulers perform at similar level as optimal schedulers. This work was supported by the Spanish Ministry of Science and Technology (TIN2015-65316-P) and by the Generalitat de Catalunya (2014-SGR-1051).
Published: 2021

28. A new generation of task-parallel algorithms for matrix inversion in many-threaded CPUs

Author: Enrique S. Quintana-Ortí, José R. Herrero, Francisco D. Igual, Sandra Catalán, Rafael Rodríguez-Sánchez, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Xeon, Parallel processing (Electronic computers), Computer science, Parallel algorithms, Matrix inversion, Processament en paral·lel (Ordinadors), Task parallelism, Parallel algorithm, OpenMP, Parallel computing, Operand, Matrix (mathematics), Task (computing), Kernel (linear algebra), Algorismes paral·lels, High performance, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Pivot element
Abstract: We take advantage of the new tasking features in OpenMP to propose advanced task-parallel algorithms for the inversion of dense matrices via Gauss-Jordan elimination. Our algorithms perform a partitioning of the matrix operand into two levels of tasks: The matrix is first divided vertically, by column blocks (or panels), in order to accommodate the standard partial pivoting scheme that ensures the numerical stability of the method. In addition, depending on the particular kernel to be applied, each panel is partitioned either horizontally by row blocks (tiles) or vertically by µ-panels (of columns), in order to extract sufficient task parallelism to feed a many-threaded general purpose processor (CPU). The results of the experimental evaluation show the performance benefits of the advanced tasking algorithms on an Intel Xeon Gold processor with 20 cores. This research was sponsored by projects RTI2018-093684-B-I00 and TIN2017-82972-R of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; and project PR65/19-22445 of Universidad Complutense de Madrid.
Published: 2021

29. gem5 + rtl: A framework to enable RTL models inside a full-system simulator

Author: Adria Armejach, Guillem Lopez-Paradis, Miquel Moreto, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: GHDL, Computer science, media_common.quotation_subject, RTL, Symmetric multiprocessor system, Verilator, 02 engineering and technology, 01 natural sciences, gem5, Sistemes monoxip, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Use case, Architecture, Software simulator, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], media_common, 010302 applied physics, Parallel processing (Electronic computers), business.industry, Deep learning, System-on-chip (SoC), Processament en paral·lel (Ordinadors), Càlcul intensiu (Informàtica) -- Consum d'energia, High performance computing -- Energy consumption, 020202 computer hardware & architecture, Debugging, Embedded system, Systems on a chip, Performance monitoring, Artificial intelligence, Heterogeneous computing, business, Accelerators, Simulation
Abstract: In recent years there has been a surge of interest in designing custom accelerators for power-efficient high-performance computing. However, available tools to simulate low-level RTL designs often neglect the target system in which the design will operate. This hinders proper testing and debugging of functionalities, and does not allow co-designing the accelerator to obtain a balanced and efficient architecture. In this paper, we introduce gem5 + rtl, a flexible framework that enables simulation of RTL models inside a full-system software simulator. We present the framework’s functionality that allows easy integration of RTL models on a simulated system-on-chip (SoC) that is able to boot Linux and run complex multi-threaded and multi-programmed workloads. We demonstrate the framework with two relevant use cases that integrate a multi-core SoC with a Performance Monitoring Unit (PMU) and the NVIDIA Deep Learning Accelerator (NVDLA), showcasing how the framework enables testing RTL model features and how it can enable co-design taking into account the entire SoC. This research was supported by the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of total cost eligible under the DRAC project [001-P- 001723], by the Spanish goverment (grant RTI2018-095094- B-C21 CONSENT), by the Spanish Ministry of Science and Innovation (contracts PID2019-107255GB-C21) and by the Catalan Government (contracts 2017-SGR-1414, 2017-SGR705). This work has also been supported by the European Community’s Horizon 2020 Framework Programme under the Mont-Blanc 2020 and EPI projects (grant agreements n. 779877 and n. 826647); and by the Arm-BSC Center of Excellence. G. López-Paradís has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship No. 2021FI B00994. A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva postdoctoral fellowship number IJCI-2017-33945. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramón y Cajal fellowship No. RYC-2016-21104.
Published: 2021

30. Multi-GPU design and performance evaluation of homomorphic encryption on GPU clusters

Author: Nan Xiao, Jie Lin, Bharadwaj Veeravalli, Ahmad Al Badawi, Matsumura Kazuaki, Aung Khin Mi Mi, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Speedup, Multi-GPU clusters, Data parallelism, Computer science, Parallel algorithms, 02 engineering and technology, Parallel computing, Encryption, 0202 electrical engineering, electronic engineering, information engineering, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], 020203 distributed computing, business.industry, Homomorphic encryption, Criptografia, Gestió de memòria (Informàtica), Load balancing (computing), Unitats de processament gràfic, Computational Theory and Mathematics, Memory management (Computer science), Hardware and Architecture, Signal Processing, Scalability, Cryptography, Performance evaluation, business, Graphics processing units
Abstract: We present a multi-GPU design, implementation and performance evaluation of the Halevi-Polyakov-Shoup (HPS) variant of the Fan-Vercauteren (FV) levelled Fully Homomorphic Encryption (FHE) scheme. Our design follows a data parallelism approach and uses partitioning methods to distribute the workload in FV primitives evenly across available GPUs. The design is put to address space and runtime requirements of FHE computations. It is also suitable for distributed-memory architectures, and includes efficient GPU-to-GPU data exchange protocols. Moreover, it is user-friendly as user intervention is not required for task decomposition, scheduling or load balancing. We implement and evaluate the performance of our design on two homogeneous and heterogeneous NVIDIA GPU clusters: K80, and a customized P100. We also provide a comparison with a recent shared-memory-based multi-core CPU implementation using two homomorphic circuits as workloads: vector addition and multiplication. Moreover, we use our multi-GPU Levelled-FHE to implement the inference circuit of two Convolutional Neural Networks (CNNs) to perform homomorphically image classification on encrypted images from the MNIST and CIFAR - 10 datasets. Our implementation provides 1 to 3 orders of magnitude speedup compared with the CPU implementation on vector operations. In terms of scalability, our design shows reasonable scalability curves when the GPUs are fully connected. This work is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Programmtic Programme (Award A19E3b0099).
Published: 2021
Full Text: View/download PDF

31. Efficiently running SpMV on long vector architectures

Author: Marc Casas, Constantino Gómez, Erich Focht, Filippo Mantovani, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: 020203 distributed computing, Long-vector architectures, Parallel processing (Electronic computers), Computer science, Processament en paral·lel (Ordinadors), 020207 software engineering, Context (language use), 02 engineering and technology, Parallel computing, NEC vector engine, Set (abstract data type), Kernel (linear algebra), Vectorization (mathematics), SpMV, 0202 electrical engineering, electronic engineering, information engineering, Parallelism (grammar), Multiplication, SIMD, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Performance optimization, Sparse matrix
Abstract: Sparse Matrix-Vector multiplication (SpMV) is an essential kernel for parallel numerical applications. SpMV displays sparse and irregular data accesses, which complicate its vectorization. Such difficulties make SpMV to frequently experiment non-optimal results when run on long vector ISAs exploiting SIMD parallelism. In this context, the development of new optimizations becomes fundamental to enable high performance SpMV executions on emerging long vector architectures. In this paper, we improve the state-of-the-art SELL-C-s sparse matrix format by proposing several new optimizations for SpMV. We target aggressive long vector architectures like the NEC Vector Engine. By combining several optimizations, we obtain an average 12% improvement over SELL-C-s considering a heterogeneous set of 24 matrices. Our optimizations boost performance in long vector architectures since they expose a high degree of SIMD parallelism. The authors would like to acknowledge the support of NEC Corporation. This work is partially supported by the Spanish Ministry of Science and Technology through PID2019-107255GB project and by the Generalitat de Catalunya (contract 2017-SGR-1414). Marc Casas has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2017-23269.
Published: 2021

32. A low overhead tasking model for OpenMP

Author: Chenle Yu, Sara Royuela, Eduardo Quiñones, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), GPU, OpenMP, CUDA, Software_PROGRAMMINGTECHNIQUES, Ordinadors immersos, Sistemes d', Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Embedded computer systems, Fine-grain parallelism
Abstract: OpenMP is a parallel programming model widely used on shared-memory systems. Over the years, the OpenMP community tries to extend the OpenMP Specification to adapt it to modern architectures and expand its usage to other domains such as Embedded Systems. Our work focuses on improving the OpenMP tasking model by reducing the task runtime overhead. To do so, we propose a new OpenMP framework, namely, taskgraph, based on the concept of task dependency graph, where nodes are OpenMP tasks and edges describe the dependencies among them. The new framework is shown to be particularly suitable for fine-grain parallelism. It can be extended to other programming models with ease, improving the interoperability of OpenMP with different programming models, such as CUDA. This work has been supported by the EU H2020 project AMPERE under the grant agreement no. 871669.
Published: 2021

33. TALP - A Lightweight Tool to Unveil Parallel Efficiency of Large-scale Executions

Author: Marta Garcia-Gasulla, Victor Lopez, Guillem Ramirez Miranda, and Barcelona Supercomputing Center
Subjects: 020203 distributed computing, Computer science, Performance and optimization, Scale (chemistry), Distributed computing, 02 engineering and technology, Supercomputers, Performance Monitoring, Extensibility, Set (abstract data type), Resource (project management), Supercomputadors, Parallel computer programs, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Performance monitoring, 020201 artificial intelligence & image processing, Performance measurement, High performance computing, Distributed computing systems, Heterogeneous, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Informàtica::Programació [Àrees temàtiques de la UPC]
Abstract: This paper presents the design, implementation, and application of TALP, a lightweight, portable, extensible, and scalable tool for online parallel performance measurement. The efficiency metrics reported by TALP allow HPC users to evaluate the parallel efficiency of their executions, both post-mortem and at runtime. The API that TALP provides allows the running application or resource managers to collect performance metrics at runtime. This enables the opportunity to adapt the execution based on the metrics collected dynamically. The set of metrics collected by TALP are well defined, independent of the tool, and consolidated. We extend the collection of metrics with two additional ones that can differentiate between the load imbalance originated from the intranode or internode imbalance. We evaluate the potential of TALP with three parallel applications that present various parallel issues and carefully analyze the overhead introduced to determine its limitations. This work is partially supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and by the European POP CoE (GA n. 824080).
Published: 2021
Full Text: View/download PDF

34. Enhancing OpenMP tasking model: performance and portability

Author: Sara Royuela, Chenle Yu, Eduardo Quinones, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Tasking model, Computer science, Parallel programming (Computer science), 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, Multiprocessadors, Programació en paral·lel (Informàtica), Supercomputers, OpenMP specification, 020202 computer hardware & architecture, Software portability, Supercomputadors, Symmetric multiprocessing, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Parallelism (grammar), Overhead (computing), Multiprocessors, Orchestration (computing), Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Runtime overhead, Scaling
Abstract: OpenMP, as the de-facto standard programming model in symmetric multiprocessing for HPC, has seen its performance boosted continuously by the community, either through implementation enhancements or specification augmentations. Furthermore, the language has evolved from a prescriptive nature, as defined by the thread-centric model, to a descriptive behavior, as defined by the task-centric model. However, the overhead related to the orchestration of tasks is still relatively high. Applications exploiting very fine-grained parallelism and systems with a large number of cores available might fail on scaling. In this work, we propose to include the concept of Task Dependency Graph (TDG) in the specification by introducing a new clause, named taskgraph, attached to task or target directives. By design, the TDG allows alleviating the overhead associated with the OpenMP tasking model, and it also facilitates linking OpenMP with other programming models that support task parallelism. According to our experiments, a GCC implementation of the taskgraph is able to significantly reduce the execution time of fine-grained task applications and increase their scalability with regard to the number of threads. This work has been supported by the EU H2020 project AMPERE under the grant agreement no. 871669.
Published: 2021

35. Implementation of a high-accuracy phase unwrapping algorithm using parallel-hybrid programming approach for displacement sensing using self-mixing interferometry

Author: Saqib Amin, Eduard Ayguadé, Usman Zabit, Tassadaq Hussain, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Interferometria, Application program interfaces (Computer software), Computer science, Message Passing Interface, Theoretical Computer Science, law.invention, Supercomputadors, law, Code (cryptography), Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), Supercomputing, Interfícies de programació d'aplicacions (Programari), Supercomputer, Laser, Supercomputers, Microarchitecture, Interferometry, Self-mixing interferometry, Phase unwrapping, Hardware and Architecture, Scalability, Metric (mathematics), HPC, Algorithm, Software, Information Systems
Abstract: Phase unwrapping is an integral part of multiple algorithms with diverse applications. Detailed phase unwrapping is also necessary for achieving high-accuracy metric sensing using laser feedback-based self-mixing interferometry (SMI). Among SMI specific phase unwrapping approaches, a technique called Improved Phase Unwrapping Method (IPUM) provides the highest accuracy. However, due to its complex, sequential, and compute-intensive nature, this method requires a high-performance computing architecture, capable of scalable parallel processing so that such a high-accuracy algorithm can be used for high-bandwidth sensing applications. In this work, the existing sequential IPUM C program is parallelized by using hybrid OpenMP/MPI (Open Multi-Processing/Message Passing Interface) parallel programming models and tested on Barcelona Supercomputing Center Nord-III Supercomputer. The computational performance of the proposed parallel-hybrid IPUM algorithm is compared with existing IPUM sequential code by executing multi-core and uni-core processor architecture, respectively. While comparing the performance of sequential IPUM with the parallel-hybrid IPUM algorithm on 16 nodes of Nord-III supercomputer, the results show that the parallel-hybrid algorithm gets 345.9x times performance improvement as compared to IPUM’s standard, sequential implementation on a single node system. The results show that the parallel-hybrid version of IPUM gives a scalable performance for different target velocities and a different number of processing cores. The research leading to these results has received fundings from the Higher Education Commission under TDF03-097.
Published: 2021

36. OmpSs@FPGA framework for high performance FPGA computing

Author: Miquel Vidal, Daniel Jiménez-González, Eduard Ayguadé, Antonio Filgueras, Carlos Alvarez, Xavier Martorell, Jaume Bosch, Jesús Labarta, Juan Miguel de Haro, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Computer science, 02 engineering and technology, Parallel computing, computer.software_genre, Porting, Theoretical Computer Science, Runtime system, Parallel architectures, High-level synthesis, 0202 electrical engineering, electronic engineering, information engineering, Field-programmable gate array, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], FPGA, Compilers (Computer programs), Matrius de portes programables per l'usuari, Parallel processing (Electronic computers), Task-based programming models, Processament en paral·lel (Ordinadors), Compiladors (Programes d'ordinador), Local variable, Field programmable gate arrays, Reconfigurable computing, Reconfigurable hardware, 020202 computer hardware & architecture, Computational Theory and Mathematics, Hardware and Architecture, Programming paradigm, Compiler, computer, Software
Abstract: This paper presents the new features of the OmpSs@FPGA framework. OmpSs is a data-flow programming model that supports task nesting and dependencies to target asynchronous parallelism and heterogeneity. OmpSs@FPGA is the extension of the programming model addressed specifically to FPGAs. OmpSs environment is built on top of Mercurium source to source compiler and Nanos++ runtime system. To address FPGA specifics Mercurium compiler implements several FPGA related features as local variable caching, wide memory accesses or accelerator replication. In addition, part of the Nanos++ runtime has been ported to hardware. Driven by the compiler this new hardware runtime adds new features to FPGA codes, such as task creation and dependence management, providing both performance increases and ease of programming. To demonstrate these new capabilities, different high performance benchmarks have been evaluated over different FPGA platforms using the OmpSs programming model. The results demonstrate that programs that use the OmpSs programming model achieve very competitive performance with low to moderate porting effort compared to other FPGA implementations. This work has received funding from EuroEXA project (European Union’s Horizon 2020 Research and Innovation Programme, under grant agreement No 754337), from Spanish Government (projects PID2019-107255GB and SEV-2015- 0493, grant BES-2016-078046), and Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328).
Published: 2021

37. Combining dynamic concurrency throttling with voltage and frequency scaling on task-based programming models

Author: Eduard Ayguadé Parra, Vicenç Beltran Querol, Antoni Navarro Muñoz, Arthur F. Lorenzon, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Optimization, Uncore, Computer science, Distributed computing, Concurrency, Power-performance, 02 engineering and technology, 01 natural sciences, Càlcul intensiu (Informàtica) -- Estalvi d'energia, Scheduling (computing), Runtime system, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), DVFS, Frequency scaling, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], OmpSs-2, 010302 applied physics, Parallel processing (Electronic computers), Scheduling, Processament en paral·lel (Ordinadors), DCT, Modeling and prediction, OpenMP, 020202 computer hardware & architecture, Task (computing), Parallel programming model, High performance computing -- Energy conservation, Energy-awareness
Abstract: Being on the verge of exascale performance has shifted the prioritization of performance in applications to the inclusion of power-performance efficiency as a primary objective in the High Performance Computing (HPC) community. Simultaneously, this has surfaced hardware and software efforts that employ techniques such as dynamic voltage and frequency scaling (DVFS) for core and uncore units or dynamic concurrency throttling (DCT) to exploit hardware resources efficiently, by saving energy while maintaining performance. These techniques are complementary, so they can be used together. However, employing them is not a straightforward task, as they have to be adjusted based on the workload, and it is even more complex to combine them properly. Thus, these techniques should be applied transparently by a runtime system, without relying on application developers. In this paper, we extend a task-based runtime system with an infrastructure that categorizes workloads based on their computational profile – memory-bounded, compute-bounded, or balanced. This categorization is done in an on-line manner and with a negligible overhead. With this additional information, we enhance the CPU-manager and scheduler of OmpSs-2, a task-based parallel programming model, to automatically combine DVFS and DCT techniques based on workloads. Moreover, we show that our heuristics transparently improve energy efficiency on average by 15% with no significant performance loss and either equal or surpass the energy efficiency of the best static configuration available. This research has received funding from the European Union’s Horizon 2020/EuroHPC research and innovation programme under grant agreement N.955606 (DEEP-SEA), and is supported by the Spanish State Research Agency - Ministry of Science and Innovation (contract PID2019-107255GB), and by the Generalitat de Catalunya (2017-SGR-1414). This work was also supported by Project HPC- EUROPA3, with the support of the EC Research Innovation Action under the H2020 Programme.
Published: 2021
Full Text: View/download PDF

38. PH-RLS: A parallel hybrid recursive least square algorithm for self-mixing interferometric laser sensor

Author: Eduard Ayguadé, Zohaib A. Khan, Usman Zabit, Muhammad Usman, Tassadaq Hussain, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Least mean square algorithm, Physics, Interferometria, Barcelona Supercomputing Center CTE-Power9 Supercomputer, Real-time data processing, Parallel programming (Computer science), Real time systems, Programació en paral·lel (Informàtica), Supercomputers, Atomic and Molecular Physics, and Optics, Software testing, TA1501-1820, Interferometry, Supercomputadors, Mixing, Laser sensor, Applied optics. Photonics, Electrical and Electronic Engineering, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Algorithm, Temps real (Informàtica), Mixing (physics)
Abstract: The authors present the parallel-hybrid recursive least square (PH-RLS) algorithm for an accurate self-mixing interferometric laser vibration sensor coupled with an accelerometer under industrial conditions. Previously, this was achieved by using a conventional RLS algorithm to cancel the parasitic vibrations where the sensor itself is not in the stationary environment. This algorithm operates in sequential mode and due to its compute and data-intensive nature, the algorithm does not work for real-time applications, hence requires parallel computing. Therefore, the existing conventional RLS C program is parallelized by using hybrid OpenACe C/MPI (Open Accelerators/Message Passing Interface) parallel programming models and tested on Barcelona Supercomputing Center CTE-Power9 Supercomputer. The computational performance of the proposed PH-RLS algorithm is compared with the existing conventional RLS code by executing on multi distributed processors and uni-core processor architecture, respectively. While comparing the performance of conventional RLS with a PH-RLS algorithm on eight nodes of CTE-Power9 supercomputer, the results show that the PH-RLS algorithm gets 5857 times of performance improvement as compared to the conventional RLS implementation on a single node system. The results show that the proposed PH-RLS also gives a scalable performance for a different range of vibration signals, making it a suitable choice for real-time self-mixing interferometer sensing systems working under industrial conditions.
Published: 2021
Full Text: View/download PDF

39. Improving HPC system throughput and response time using memory disaggregation

Author: Felippe Vieira Zacarias, Paul Carpenter, Vinicius Petrucci, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Disaggregation, Performance degradation, Resource scheduling, Memory management (Computer science), Assignació de recursos, Performance prediction, Gestió de memòria (Informàtica), High performance computing, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Resource allocation, Slurm, Càlcul intensiu (Informàtica)
Abstract: © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. HPC clusters are cost-effective, well understood, and scalable, but the rigid boundaries between compute nodes may lead to poor utilization of compute and memory resources. HPC jobs may vary, by orders of magnitude, in memory consumption per core. Thus, even when the system is provisioned to accommodate normal and large capacity nodes, a mismatch between the system and the memory demands of the scheduled jobs can lead to inefficient usage of both memory and compute resources. Disaggregated memory has recently been proposed as a way to mitigate this problem by flexibly allocating memory capacity across cluster nodes. This paper presents a simulation approach for at-scale evaluation of job schedulers with disaggregated memories and it introduces a new disaggregated-aware job allocation policy for the Slurm resource manager. Our results show that using disaggregated memories, depending on the imbalance between the system and the submitted jobs, a similar throughput and job response time can be achieved on a system with up to 33% less total memory provisioning. This work is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 754337 (EuroEXA); it has been supported by the Spanish Ministry of Science and Innovation (project TIN2015-65316-P and Ramon y Cajal fellowship RYC2018-025628-I), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and the Severo Ochoa Programme (SEV-2015-0493).
Published: 2021

40. Human biventricular electromechanical simulations on the progression of electrocardiographic and mechanical abnormalities in post-myocardial infarction

Author: Mariano Vázquez, C Kelly, Francesc Levrero-Florencio, Erica Dall’Armellina, Zhinuo J. Wang, Lei Wang, Francesca Margara, Arka Das, Blanca Rodriguez, Xin Zhou, Alfonso Santiago, and Barcelona Supercomputing Center
Subjects: Ejection fraction, medicine.medical_specialty, Systole, Diastole, Myocardial Infarction, Infarction, Electrocardiograms, 030204 cardiovascular system & hematology, QT interval, Ventricular Function, Left, 030218 nuclear medicine & medical imaging, 03 medical and health sciences, QRS complex, Electrocardiography, 0302 clinical medicine, Simulació per ordinador, Physiology (medical), Internal medicine, T wave, Medicine, Repolarization, Humans, AcademicSubjects/MED00200, cardiovascular diseases, Myocardial infarction, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], business.industry, Stroke Volume, Computer simulation, medicine.disease, Electrocardiogram, Electromechanical simulations, Supplement Papers, Computer modelling, cardiovascular system, Cardiology, Cardiology and Cardiovascular Medicine, business
Abstract: Aims Develop, calibrate and evaluate with clinical data a human electromechanical modelling and simulation framework for multiscale, mechanistic investigations in healthy and post-myocardial infarction (MI) conditions, from ionic to clinical biomarkers. Methods and results Human healthy and post-MI electromechanical simulations were conducted with a novel biventricular model, calibrated and evaluated with experimental and clinical data, including torso/biventricular anatomy from clinical magnetic resonance, state-of-the-art human-based membrane kinetics, excitation–contraction and active tension models, and orthotropic electromechanical coupling. Electromechanical remodelling of the infarct/ischaemic region and the border zone were simulated for ischaemic, acute, and chronic states in a fully transmural anterior infarct and a subendocardial anterior infarct. The results were compared with clinical electrocardiogram and left ventricular ejection fraction (LVEF) data at similar states. Healthy model simulations show LVEF 63%, with 11% peak systolic wall thickening, QRS duration and QT interval of 100 ms and 330 ms. LVEF in ischaemic, acute, and chronic post-MI states were 56%, 51%, and 52%, respectively. In linking the three post-MI simulations, it was apparent that elevated resting potential due to hyperkalaemia in the infarcted region led to ST-segment elevation, while a large repolarization gradient corresponded to T-wave inversion. Mechanically, the chronic stiffening of the infarct region had the benefit of improving systolic function by reducing infarct bulging at the expense of reducing diastolic function by inhibiting inflation. Conclusion Our human-based multiscale modelling and simulation framework enables mechanistic investigations into patho-physiological electrophysiological and mechanical behaviour and can serve as testbed to guide the optimization of pharmacological and electrical therapies. This work was funded by a Wellcome Trust Fellowship in Basic Biomedical Sciences to B.R. (214290/Z/18/Z), Personalised In-Silico Cardiology (PIC) project, European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement 764738, the CompBioMed 1 and 2 Centre of Excellence in Computational Biomedicine (European Commission Horizon 2020 research and innovation programme, grant agreements No. 675451 and No. 823712), an NC3Rs Infrastructure for Impact Award (NC/P001076/1), the TransQST project (Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No 116030, receiving support from the European Union’s Horizon 2020 research and innovation programme and EFPIA), and the Oxford BHF Centre of Research Excellence (RE/13/1/30181). This paper is part of a supplement supported by an unrestricted grant from the Theo-Rossi di Montelera (TRM) foundation.
Published: 2020

41. OpenMP to CUDA graphs

Author: Chenle Yu, Eduardo Quinones, Sara Royuela, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors
Subjects: CUDA graphs, 050101 languages & linguistics, Computer science, Parallel programming (Computer science), Optimizing compiler, Symmetric multiprocessor system, 02 engineering and technology, Parallel computing, Programació en paral·lel (Informàtica), Software_PROGRAMMINGTECHNIQUES, computer.software_genre, Domain (software engineering), CUDA, Supercomputadors, 0202 electrical engineering, electronic engineering, information engineering, Compiler optimization, 0501 psychology and cognitive sciences, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Field-programmable gate array, Matrius de portes programables in situ, 05 social sciences, Field programmable gate arrays, OpenMP, Supercomputers, Unitats de processament gràfic, Task (computing), Programmability, Programming paradigm, 020201 artificial intelligence & image processing, Heterogeneous computing, High performance computing, Compiler, Graphics processing units, computer
Abstract: Heterogeneous computing is increasingly being used in a diversity of computing systems, ranging from HPC to the real-time embedded domain, to cope with the performance requirements. Due to the variety of accelerators, e.g., FPGAs, GPUs, the use of high-level parallel programming models is desirable to exploit the performance capabilities of them, while maintaining an adequate productivity level. In that regard, OpenMP is a well-known high-level programming model that incorporates powerful task and accelerator models capable of efficiently exploiting structured and unstructured parallelism in heterogeneous computing. This paper presents a novel compiler transformation technique that automatically transforms OpenMP code into CUDA graphs, combining the benefits of programmability of a high-level programming model such as OpenMP, with the performance benefits of a low-level programming model such as CUDA. Evaluations have been performed on two NVIDIA GPUs from the HPC and embedded domains, i.e., the V100 and the Jetson AGX respectively. This work has been supported by the EU H2020 project AMPERE under the grant agreement no. 871669.
Published: 2020
Full Text: View/download PDF

42. Worksharing Tasks: An Efficient Way to Exploit Irregular and Fine-Grained Loop Parallelism

Author: Eduard Ayguadé, Kevin Sala, Marcos Maronas, Vicenç Beltran, Sergi Mateo, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: FOS: Computer and information sciences, Exploit, Computer science, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, 01 natural sciences, Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Multiprocessors, Runtime systems, 0101 mathematics, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Execution model, 020203 distributed computing, Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), Programming models, Fine grained loop parallelism, Multiprocessadors, Structured parallelism, Task (computing), Shared memory, Computer Science - Distributed, Parallel, and Cluster Computing, Programming paradigm, Parallelism (grammar), Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Shared memory programming models usually provide worksharing and task constructs. The former relies on the efficient fork-join execution model to exploit structured parallelism; while the latter relies on fine-grained synchronization among tasks and a flexible data-flow execution model to exploit dynamic, irregular, and nested parallelism. On applications that show both structured and unstructured parallelism, both worksharing and task constructs can be combined. However, it is difficult to mix both execution models without penalizing the data-flow execution model. Hence, on many applications structured parallelism is also exploited using tasks to leverage the full benefits of a pure data-flow execution model. However, task creation and management might introduce a non-negligible overhead that prevents the efficient exploitation of fine-grained structured parallelism, especially on many-core processors. In this work, we propose worksharing tasks. These are tasks that internally leverage worksharing techniques to exploit fine-grained structured loop-based parallelism. The evaluation shows promising results on several benchmarks and platforms. This work is supported by the Spanish Ministerio de Ciencia, Innovacion y Universidades (TIN2015-65316-P), by the Generalitat de Catalunya (2014-SGR-1051) and by the European Union’s Seventh Framework Programme (FP7/2007-2013) and the H2020 funding framework under grant agreement no. H2020-FETHPC-754304 (DEEP-EST).
Published: 2020

43. sLASs: a fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)

Author: Tetsuzo Usui, Pedro Valero-Lara, Xavier Martorell, Sandra Catalán, Jesús Labarta, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Computer Networks and Communications, Computer science, Auto-tuning, Degree of parallelism, Parallel programming (Computer science), 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, LASs, Programació en paral·lel (Informàtica), Execution time, Theoretical Computer Science, Task (computing), Matrix (mathematics), OmpSs, Artificial Intelligence, Hardware and Architecture, Fully automatic, Linear algebra, 0202 electrical engineering, electronic engineering, information engineering, Code (cryptography), 020201 artificial intelligence & image processing, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Software
Abstract: © 2019 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ In this work we have implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak dependencies and regions, together with the final clause for the implementation of auto-tunable code for the BLAS-3 trsm routine and the LAPACK routines npgetrf and npgesv. All these implementations are part of the first prototype of sLASs library, a novel library for auto-tunable codes for linear algebra operations based on LASs library. In all these cases, the use of the OmpSs-2 features presents an improvement in terms of execution time against other reference libraries such as, the original LASs library, PLASMA, ATLAS and Intel MKL. These codes are able to reduce the execution time in about 18% on big matrices, by increasing the IPC on gemm and reducing the time of task instantiation. For a few medium matrices, benefits are also seen. For small matrices and a subset of medium matrices, specific optimizations that allow to increase the degree of parallelism in both, gemm and trsm tasks, are applied. This strategy achieves an increment in performance of up to 40%. This project has received funding from the Spanish Ministry of Economy and Competitiveness, Spain under the project Computación de Altas Prestaciones VII (TIN2015- 65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, Spain, under project MPEXPAR: Models de Programació i Entorns d’Execució Parallels (2014-SGR-1051), and the Juan de la Cierva Grant Agreement No IJCI-2017- 33511. We also acknowledge the funding provided by Fujitsu, Japan under the BSC-Fujitsu joint project: Math Libraries Migration and Optimization.
Published: 2020

44. Shortest path computing in directed graphs with weighted edges mapped on random networks of memristors

Author: Ioannis Vourkas, Antonio Rubio, Carlos Fernandez, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, and Universitat Politècnica de Catalunya. HIPICS - Grup de Circuits i Sistemes Integrats d'Altes Prestacions
Subjects: Shortest path, Computer science, Parallel programming (Computer science), Computational memory, Context (language use), 02 engineering and technology, Memristor, Programació en paral·lel (Informàtica), Topology, Theoretical Computer Science, law.invention, Enginyeria electrònica::Components electrònics [Àrees temàtiques de la UPC], law, 0202 electrical engineering, electronic engineering, information engineering, Resistive switching, Resistive computing, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Memristor network, 020208 electrical & electronic engineering, Directed graph, 021001 nanoscience & nanotechnology, Electrònica -- Aparells i instruments, Resistive random-access memory, Hardware and Architecture, Shortest path problem, Electronic apparatus and appliances, 0210 nano-technology, Software
Abstract: Electronic version of an article published as [Fernandez, Carlos, Ioannis Vourkas, and Antonio Rubio. "Shortest Path Computing in Directed Graphs with Weighted Edges Mapped on Random Networks of Memristors." Parallel Processing Letters 30.01 (2020): 2050002] [https://doi.org/10.1142/S0129626420500024] © [copyright World Scientific Publishing Company] [https://www.worldscientific.com/worldscinet/ppl] To accelerate the execution of advanced computing tasks, in-memory computing with resistive memory provides a promising solution. In this context, networks of memristors could be used as parallel computing medium for the solution of complex optimization problems. Lately, the solution of the shortest-path problem (SPP) in a two-dimensional memristive grid has been given wide consideration. Some still open problems in such computing approach concern the time required for the grid to reach to a steady state, and the time required to read the result, stored in the state of a subset of memristors that represent the solution. This paper presents a circuit simulation-based performance assessment of memristor networks as SPP solvers. A previous methodology was extended to support weighted directed graphs. We tried memristor device models with fundamentally different switching behavior to check their suitability for such applications and the impact on the timely detection of the solution. Furthermore, the requirement of binary vs. analog operation of memristors was evaluated. Finally, the memristor network-based computing approach was compared to known algorithmic solutions to the SPP over a large set of random graphs of different sizes and topologies. Our results contribute to the proper development of bio-inspired memristor network-based SPP solvers. This work was supported by the Chilean research grants CONICYT REDES ETAPA INICIAL Convocatoria 2017 No. REDI170604, CONICYT BASAL FB0008, and by the Spanish MINECO and ERDF (TEC2016-75151-C3-2-R).
Published: 2020

45. Performance and energy effects on task-based parallelized applications

Author: Diego Caballero, Roger Ferrer, Marc Casas, Mateo Valero, Juan M. Cebrian, Helena Caminal, Xavier Martorell, Miquel Moreto, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Data parallelism, Computer science, 020209 energy, Microprocessors -- Energy consumption, Task-level parallelism, 02 engineering and technology, Parallel computing, Data-level parallelism, Theoretical Computer Science, Software portability, Vectorization, 0202 electrical engineering, electronic engineering, information engineering, SIMD, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Programmer, 020203 distributed computing, Parallel processing (Electronic computers), Processament en paral·lel (Ordinadors), Vector processing (Computer science), Task (computing), Energy efficiency, Microprocessadors -- Consum d'energia, Hardware and Architecture, Vectorization (mathematics), Scalability, Programming paradigm, Software, Information Systems
Abstract: Heterogeneity, parallelization and vectorization are key techniques to improve the performance and energy efficiency of modern computing systems. However, programming and maintaining code for these architectures poses a huge challenge due to the ever-increasing architecture complexity. Task-based environments hide most of this complexity, improving scalability and usage of the available resources. In these environments, while there has been a lot of effort to ease parallelization and improve the usage of heterogeneous resources, vectorization has been considered a secondary objective. Furthermore, there has been a swift and unstoppable burst of vector architectures at all market segments, from embedded to HPC. Vectorization can no longer be ignored, but manual vectorization is tedious, error-prone and not practical for the average programmer. This work evaluates the feasibility of user-directed vectorization in task-based applications. Our evaluation is based on the OmpSs programming model, extended to support user-directed vectorization for different SIMD architectures (i.e., SSE, AVX2, AVX512). Results show that user-directed codes achieve manually optimized code performance and energy efficiency with minimal code modifications, favoring portability across different SIMD architectures.
Published: 2018
Full Text: View/download PDF

46. Static Analysis to Enhance Programmability and Performance in OmpSs-2

Author: Sara Royuela, Roger Ferrer, Raul Peñacoba, Adrian Munera, Eduardo Quinones, and Barcelona Supercomputing Center
Subjects: 020203 distributed computing, Real-time programming, Computer science, Performance, Parallel programming (Computer science), 02 engineering and technology, Parallel computing, Static analysis, computer.software_genre, Task (project management), Set (abstract data type), Supercomputadors, Programmability, 0202 electrical engineering, electronic engineering, information engineering, Parallelism (grammar), 020201 artificial intelligence & image processing, High performance computing, Compiler, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], computer, OmpSs-2
Abstract: Task-based parallel programming models based on compiler directives have proved their effectiveness at describing parallelism in High-Performance Computing (HPC) applications. Recent studies show that cutting-edge Real-Time applications, such as those for unmanned vehicles, can successfully exploit these models. In this scenario, OpenMP is a de facto standard for HPC, and is being studied for Real-Time systems due to its time-predictability and delimited functional safety. However, changes in OpenMP take time to be standardized because it sweeps along a large community. OmpSs, instead, is a task-based model for fast-prototyping that has been a forerunner of OpenMP since its inception. OmpSs-2, its successor, aims at the same goal, and defines several features that can be introduced in future versions of OpenMP. This work targets compiler-based optimizations to enhance the programmability and performance of OmpSs-2. Regarding the former, we present an algorithm to determine the data-sharing attributes of OmpSs-2 tasks. Regarding the latter, we introduce a new algorithm to automatically release OmpSs-2 task dependencies before a task has completed. This work evaluates both algorithms in a set of well-known benchmarks, and discusses their applicability to the current and future specifications of OpenMP.
Published: 2020
Full Text: View/download PDF

47. Breaking master-slave model between host and FPGAs

Author: Antonio Filgueras, Xavier Martorell, Miquel Vidal, Daniel Jiménez-González, Jaume Bosch, Eduard Ayguadé, Carlos Alvarez, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: 020203 distributed computing, Heterogeneous (hybrid) systems, Computer science, business.industry, Parallel programming (Computer science), 020207 software engineering, Symmetric multiprocessor system, Master/slave, 02 engineering and technology, Programació en paral·lel (Informàtica), Toolchain, Parallel programming languages, Task (computing), Embedded system, 0202 electrical engineering, electronic engineering, information engineering, Programming paradigm, Code (cryptography), business, Field-programmable gate array, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Host (network)
Abstract: This paper proposes to enhance current task-based programming models by breaking their current master-slave approach between the main processor and its hardware accelerators. As a proof-of-concept, it presents an extension of the OmpSs@FPGA toolchain that allows the tasks offloaded into the FPGA to create and synchronize nested tasks on their own without involving the host. Those FPGA spawned tasks may target the host to execute code not suitable for the FPGA, like system calls or I/O operations; or target other kernel accelerators inside the same FPGA. In addition to the programmability benefits of this new feature, the proposed system presents significant performance improvements and a better productivity over the classical master-slave approach. This work has received funding from EPEEC project (Euro-pean Union’s Horizon 2020 Research and Innovation Pro-gramme, under grant agreement No 801051), from SpanishGovernment (projects SEV-2015-0493 and TIN2015-65316-P,grant BES-2016-078046), and from Generalitat de Catalunya(contracts 2017-SGR-1414 and 2017-SGR-1328).
Published: 2020

48. Wavefront parallelization of recurrent neural networks on multi-core architectures

Author: Robin Kumar Sharma, Marc Casas, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, and Barcelona Supercomputing Center
Subjects: Source code, Computer science, Parallel algorithms, media_common.quotation_subject, Pipeline (computing), Inference, Wavefront Parallelization, 02 engineering and technology, Parallel computing, 010501 environmental sciences, 01 natural sciences, Stencil, Neural networks (Computer science), Deep Neural Network (DNN), 0202 electrical engineering, electronic engineering, information engineering, Xarxes neuronals (Informàtica), Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], 0105 earth and related environmental sciences, media_common, 020203 distributed computing, Multi-core processor, Memory hierarchy, Recurrent Neural Networks (RNNs), CPU Task Parallelism, Long-Short Term Memory (LSTM), Dynamic programming, Algorismes paral·lels, Recurrent neural network, OmpSs, Gated Recurrent Units (GRUs)
Abstract: Recurrent neural networks (RNNs) are widely used for natural language processing, time-series prediction, or text analysis tasks. The internal structure of RNNs inference and training in terms of data or control dependencies across their fundamental numerical kernels complicate the exploitation of model parallelism, which is the reason why just data-parallelism has been traditionally applied to accelerate RNNs. This paper presents W-Par (Wavefront-Parallelization), a comprehensive approach for RNNs inference and training on CPUs that relies on applying model parallelism into RNNs models. We use fine-grained pipeline parallelism in terms of wavefront computations to accelerate multi-layer RNNs running on multi-core CPUs. Wavefront computations have been widely applied in many scientific computing domains like stencil kernels or dynamic programming. W-Par divides RNNs workloads across different parallel tasks by defining input and output dependencies for each RNN cell. Our experiments considering different RNNs models demonstrate that W-Par achieves up to 6.6X speed-up for RNN models inference and training in comparison to current state-of-the-art implementations on modern multi-core CPU architectures. Importantly, W-Par maximizes performance on a wide range of scenarios, including different core counts or memory hierarchy configurations, without requiring any change at the source code level. This work has been supported by the European Union's Horizon 2020 research and innovation program (MB2020 project, grant agreement 779877), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328). M. Casas has been partially supported by the Spanish Ministry of Economy, Industry, and Competitiveness under Ramon y Cajal fellowship number RYC2017-23269.
Published: 2020

49. Towards a qualifiable openMP framework for embedded systems

Author: Eduardo Quinones, Adrian Munera, Sara Royuela, and Barcelona Supercomputing Center
Subjects: Computer science, Memory allocation, Embedded systems, 02 engineering and technology, OpenMP (Interfície de programació d'aplicacions), computer.software_genre, 01 natural sciences, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Qualification, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], Implementation, 010302 applied physics, Parallel processing (Electronic computers), business.industry, Processament en paral·lel (Ordinadors), Integrated software, OpenMP, Data structure, 020202 computer hardware & architecture, System requirements, Task (computing), Memory management, Embedded system, Programming paradigm, Parallelism (grammar), Compiler, High performance computing, business, computer, Parallel execution
Abstract: OpenMP is a very convenient programming model for critical real-time parallel applications due to its powerful tasking model and its proven time predictability. However, current implementations are not suitable for critical environments based on the intensive use of dynamically allocated memory needed to efficiently manage the parallel execution. This jeopardizes the qualification processes needed to ensure that the integrated software stack is compliant with system requirements.This paper proposes a novel OpenMP framework that statically allocates the data structures needed to efficiently manage the parallel execution of OpenMP tasks. Our framework is composed of a compiler that captures the environment of the OpenMP tasks instantiated along the parallel execution and bounds the exposed parallelism, and a runtime implementing a lazy task creation policy that significantly reduces the runtime memory requirements, whilst exploiting parallelism efficiently. The evaluation shows that our tool achieves the same performance as current OpenMP implementations, while bounds and drastically reduces the dynamic memory requirements at run-time.
Published: 2020

50. HRM: Merging Hardware Event Monitors for Improved Timing Analysis of Complex MPSoCs

Author: Jaume Abella, Enrico Mezzetti, Francisco J. Cazorla, Isabel Serra, Roberto Santalla, Sergi Vilardell, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Subjects: Focus (computing), Event (computing), Computer science, business.industry, Reading (computer), Process (computing), Static timing analysis, Multiprocessing, 02 engineering and technology, Embedded computer systems, Computer Graphics and Computer-Aided Design, 020202 computer hardware & architecture, Software, Sistemes incrustats (Informàtica), 0202 electrical engineering, electronic engineering, information engineering, High performance computing, Noise (video), Electrical and Electronic Engineering, Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC], business, Càlcul intensiu (Informàtica), Computer hardware
Abstract: The Performance Monitoring Unit (PMU) in MPSoCs is at the heart of the latest measurement-based timing analysis techniques in Critical Embedded Systems. In particular, hardware event monitors (HEMs) in the PMU are used as building blocks in the process of budgeting and verifying software timing by tracking and controlling access counts to shared resources. While the number of HEMs in current MPSoCs reaches hundreds, they are read via Performance Monitoring Counters whose number is usually limited to 4-8, thus requiring multiple runs of each experiment in order to collect all desired HEMs. Despite the effort of engineers in controlling the execution conditions of each experiment, the complexity of current MPSoCs makes it arguably impossible to completely remove the noise affecting each run. As a result, HEMs read in different runs are subject to different variability, and hence, those HEMs captured in different runs cannot be ‘blindly’ merged. In this work, we focus on the NXP T2080 platform where we observed up to 59% variability across different runs of the same experiment for some relevant HEMs (e.g. processor cycles). We develop a HEM reading and merging (HRM) approach to join reliably HEMs across different runs as a fundamental element of any measurement-based timing budgeting and verification technique. Our method builds on order statistics and the selection of an anchor HEM read in all runs to derive the most plausible combination of HEM readings that keep the distribution of each HEM and their relationship with the anchor HEM intact. This work has been partially supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 772773) and the HiPEAC Network of Excellence.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

304 results on '"Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC]"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources