19 results on '"Soria Pardos, Víctor"'
Search Results
2. GenArchBench: A genomics benchmark suite for arm HPC processors
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. ALBCOM - Algorísmia, Bioinformàtica, Complexitat i Mètodes Formals, López Villellas, Lorien, Langarita Benítez, Rubén, Badouh, Asaf, Soria Pardos, Víctor, Aguado Puig, Quim, López Paradís, Guillem, Doblas Font, Max, Setoain, Javier, Kim, Chulho, Ono, Makoto, Armejach Sanosa, Adrià, Marco Sola, Santiago, Alastruey Benedé, Jesús, Ibáñez Marín, Pablo, Moretó Planas, Miquel, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. ALBCOM - Algorísmia, Bioinformàtica, Complexitat i Mètodes Formals, López Villellas, Lorien, Langarita Benítez, Rubén, Badouh, Asaf, Soria Pardos, Víctor, Aguado Puig, Quim, López Paradís, Guillem, Doblas Font, Max, Setoain, Javier, Kim, Chulho, Ono, Makoto, Armejach Sanosa, Adrià, Marco Sola, Santiago, Alastruey Benedé, Jesús, Ibáñez Marín, Pablo, and Moretó Planas, Miquel
- Abstract
Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list between June 2020 and June 2022, currently sitting in the fourth position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7 g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While most genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines. Moreover, these applications do not exploit the newly introduced Scalable Vector Extensions (SVE). This paper presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. Overall, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. Moreover, the porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. We present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different HPC machines (i.e., A64FX, Graviton3, Intel Xeon, This work has been partially supported by the Spanish Ministry of Science and Innovation MCIN/AEI/10.13039/501100011033 (contracts PID2019-107255GB-C21, PID2019-105660RB-C21, PID2022136454NB-C22, and TED2021-132634A-I00), by the Generalitat de Catalunya, Spain (contract 2021-SGR-763), by the Gobierno de Aragón (T58_23R research group), by the European Union NextGenerationEU/ PRTR, and by Lenovo BSC Contract-Framework Contract (2020)., Peer Reviewed, Postprint (published version)
- Published
- 2024
3. On the use of many-core Marvell ThunderX2 processor for HPC workloads
- Author
-
Soria-Pardos, Víctor, Armejach, Adrià, Suárez, Darío, and Moretó, Miquel
- Published
- 2021
- Full Text
- View/download PDF
4. A Tensor Marshaling Unit for Sparse Tensor Algebra on General-Purpose Processors
- Author
-
Siracusa, Marco, primary, Soria-Pardos, Víctor, additional, Sgherzi, Francesco, additional, Randall, Joshua, additional, Joseph, Douglas J., additional, Moretó Planas, Miquel, additional, and Armejach, Adrià, additional
- Published
- 2023
- Full Text
- View/download PDF
5. DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory Operations
- Author
-
Soria-Pardos, Víctor, primary, Armejach, Adrià, additional, Mück, Tiago, additional, Suárez-Gracia, Dario, additional, Joao, José, additional, Rico, Alejandro, additional, and Moretó, Miquel, additional
- Published
- 2023
- Full Text
- View/download PDF
6. A Tensor Marshaling Unit for sparse tensor algebra on general-purpose processors
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Siracusa, Marco, Soria Pardos, Víctor, Sgherzi, Francesco, Randall, Joshua, Joseph, Douglas J., Moretó Planas, Miquel, Armejach Sanosa, Adrià, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Siracusa, Marco, Soria Pardos, Víctor, Sgherzi, Francesco, Randall, Joshua, Joseph, Douglas J., Moretó Planas, Miquel, and Armejach Sanosa, Adrià
- Abstract
This paper proposes the Tensor Marshaling Unit (TMU), a near-core programmable dataflow engine for multicore architectures that accelerates tensor traversals and merging, the most critical op-erations of sparse tensor workloads running on today’s computing infrastructures. The TMU leverages a novel multi-lane design that enables parallel tensor loading and merging, which naturally pro-duces vector operands that are marshaled into the core for efficient SIMD computation. The TMU supports all the necessary primitives to be tensor-format and tensor-algebra complete. We evaluate the TMU on a simulated multicore system using a broad set of ten-sor algebra workloads, achieving 3.6×, 2.8×, and 4.9× speedups over memory-intensive, compute-intensive, and merge-intensive vectorized software implementations, respectively., This work has been partially supported by the Spanish Ministry of Science and Innovation MCIN/AEI/10.13039/501100011033 (contract PID2019-107255GB-C21), the Generalitat of Catalunya (contract 2021-SGR-00763), the Arm-BSC Center of Excellence, the European HiPEAC Network of Excellence, and the European Processor Initiative (EPI), which is part of the European Union’s Horizon 2020 research and innovation program under grant agreement No. 826647. M. Siracusa has been supported through an FI fellowship [2022FI_B 00969] and V. Soria-Pardos through an FPU fellowship [FPU20-02132]. A. Armejach is a Serra Hunter Fellow., Peer Reviewed, Postprint (author's final draft)
- Published
- 2023
7. Sargantana: an academic SoC RISC-V processor in 22nm FDSOI technology
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. EFRICS - Efficient and Robust Integrated Circuits and Systems, Doblas Font, Max, Candón Arenas, Gerard, Carril Gil, Xavier, Dominguez de la Rocha, Marc, Erra, Enric, González Trejo, Alberto, Jiménez, Víctor, Kostalampros, Ioannis-Vatistas, Langarita Benítez, Rubén, Leyva Santes, Neiel, López Paradís, Guillem, Mendoza Escobar, Jonnatan, Oltra Oltra, Josep Angel, Pavón Rivera, Julián, Ramírez Lazo, Cristóbal, Rodas Quiroga, Narcís, Reggiani, Enrico, Rodriguez, Mario, Rojas Morales, Carlos, Ruiz Ramirez, Abraham Josafat, Safadi Figueroa, Hugo Ernesto, Soria Pardos, Víctor, Vargas Valdivieso, Iván, Arreza, Fernando, Figueras Bagué, Roger, Fontova Muste, Pau, Marimon Illana, Joan, Aragonès Cervera, Xavier, Cristal Kestelman, Adrián, Mateo Peña, Diego, Moll Echeto, Francisco de Borja, Moretó Planas, Miquel, Palomar Pérez, Óscar, Sonmez, Nehir, Unsal, Osman Sabri, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. EFRICS - Efficient and Robust Integrated Circuits and Systems, Doblas Font, Max, Candón Arenas, Gerard, Carril Gil, Xavier, Dominguez de la Rocha, Marc, Erra, Enric, González Trejo, Alberto, Jiménez, Víctor, Kostalampros, Ioannis-Vatistas, Langarita Benítez, Rubén, Leyva Santes, Neiel, López Paradís, Guillem, Mendoza Escobar, Jonnatan, Oltra Oltra, Josep Angel, Pavón Rivera, Julián, Ramírez Lazo, Cristóbal, Rodas Quiroga, Narcís, Reggiani, Enrico, Rodriguez, Mario, Rojas Morales, Carlos, Ruiz Ramirez, Abraham Josafat, Safadi Figueroa, Hugo Ernesto, Soria Pardos, Víctor, Vargas Valdivieso, Iván, Arreza, Fernando, Figueras Bagué, Roger, Fontova Muste, Pau, Marimon Illana, Joan, Aragonès Cervera, Xavier, Cristal Kestelman, Adrián, Mateo Peña, Diego, Moll Echeto, Francisco de Borja, Moretó Planas, Miquel, Palomar Pérez, Óscar, Sonmez, Nehir, Unsal, Osman Sabri, and Valero Cortés, Mateo
- Abstract
This paper describes the Sargantana System on chip (SoC), a 64-bit RISC-V single core processor designed by a number of academic institutions and manufactured in 22 nm FDSOI technology: BSC, UPC, UB, UAB, CIC-IPN and IMB-CNM (CSIC). The SoC includes the processor as well as, among other components, a Phase Locked Loop (PLL) operating up to 2 GHz, interfaces to HyperRAM and a Serdes up to 8 Gbps. The processor has demonstrated experimental correct operation at 800 MHz., The DRAC project is co-financed by the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of total eligible cost. The authors are part of RedRISCV which promotes activities around open hardware. The Lagarto Project is supported by the Research and Graduate Secretary (SIP) of the Instituto Politécnico Nacional (IPN) from Mexico, and by the CONACyT scholarship for Center for Research in Computing (CIC-IPN)., Peer Reviewed, Article signat per 48 autors/es: Max Doblas∗, Gerard Candón∗, Xavier Carril∗, Marc Domínguez∗, Enric Erra∗, Alberto González∗, César Hernández†, Víctor Jiménez∗, Vatistas Kostalampros∗, Rubén Langarita∗, Neiel Leyva†, Guillem López-Paradís∗, Jonnatan Mendoza∗, Josep Oltra∗, Julián Pavón∗, Cristóbal Ramírez∗, Narcís Rodas∗, Enrico Reggiani∗, Mario Rodríguez∗, Carlos Rojas∗, Abraham Ruiz∗, Hugo Safadi∗, Víctor Soria∗, Alejandro Suanes‡, Iván Vargas∗, Fernando Arreza∗, Roger Figueras∗, Pau Fontova-Musté∗, Joan Marimon∗, Ricardo Martínez‡, Sergio Moreno¶, Jordi Sacristán‡, Oscar Alonso¶, Xavier Aragonés§, Adrián Cristal∗, Ángel Diéguez¶, Manuel López¶, Diego Mateo§, Francesc Moll∗§, Miquel Moretó∗§, Oscar Palomar∗, Marco A. Ramírez†, Francesc Serra-Graells∥‡, Nehir Sonmez∗, Lluís Terés‡, Osman Unsal∗, Mateo Valero∗§, Luis Villa† / ∗Barcelona Supercomputing Center (BSC), Barcelona, Spain. Email: name.surname@bsc.es; †Centro de Investigación en Computación, Instituto Politécnico Nacional (CIC-IPN), Mexico City, Mexico; ‡Institut de Microelectrònica de Barcelona, IMB-CNM (CSIC), Spain. Email: name.surname@imb-cnm.csic.es; §Universitat Politècnica de Catalunya (UPC), Barcelona, Spain. Email: name.surname@upc.edu; ¶Universitat de Barcelona (UB), Barcelona, Spain. Email: name.surname@ub.edu; ∥Universitat Autònoma de Barcelona (UAB), Barcelona, Spain. Email: name.surname@uab.cat, Postprint (author's final draft)
- Published
- 2023
8. DynAMO: Improving parallelism through dynamic placement of atomic memory operations
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Soria Pardos, Víctor, Armejach Sanosa, Adrià, Mück, Tiago, Suárez Gracía, Dario, Joao, Jose A., Rico, Alejandro, Moretó Planas, Miquel, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Soria Pardos, Víctor, Armejach Sanosa, Adrià, Mück, Tiago, Suárez Gracía, Dario, Joao, Jose A., Rico, Alejandro, and Moretó Planas, Miquel
- Abstract
With increasing core counts in modern multi-core designs, the overhead of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cache-coherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed near-core (near) or remotely in the on-chip memory hierarchy (far). This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations. Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09× across all workloads and 1.31× on AMO-intensive applications with respect to executing all AMOs near., This research was supported by the Spanish Ministry of Science and Innovation (MCIN) through contracts [PID2019-107255GB-C21], [TED2021-132634A-I00], and [PID2019-105660RB-C21]; the Generalitat of Catalunya through contract [2021-SGR-00763]; the Government of Aragon [T5820R]; the Arm-BSC Center of Excellence, and the European Processor Initiative (EPI) which is part of the European Union’s Horizon 2020 research and innovation program under grant agreement No. 826647. V. Soria-Pardos has been supported through an FPU fellowship [FPU20-02132]; A. Armejach is a Serra Hunter Fellow and has been partially supported by the Grant [IJCI-2017-33945] funded by MCIN/AEI/10.13039/501100011033; M. Moreto through a Ramón y Cajal fellowship [RYC-2016-21104]., Peer Reviewed, Postprint (author's final draft)
- Published
- 2023
9. DynAMO: Improving parallelism through dynamic placement of atomic memory operations
- Author
-
Soria Pardos, Víctor, Armejach Sanosa, Adrià, Mück, Tiago, Suárez Gracía, Dario, Joao, Jose A., Rico, Alejandro, Moreto Planas, Miquel, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Barcelona Supercomputing Center
- Subjects
Atomic memory operations ,Parallel processing (Electronic computers) ,Processament en paral·lel (Ordinadors) ,Sistemes monoxip ,Systems on a chip ,Multi-core architectures ,Data placement ,Microarchitecture ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] - Abstract
With increasing core counts in modern multi-core designs, the overhead of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cache-coherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed near-core (near) or remotely in the on-chip memory hierarchy (far). This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations. Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09× across all workloads and 1.31× on AMO-intensive applications with respect to executing all AMOs near. This research was supported by the Spanish Ministry of Science and Innovation (MCIN) through contracts [PID2019-107255GB-C21], [TED2021-132634A-I00], and [PID2019-105660RB-C21]; the Generalitat of Catalunya through contract [2021-SGR-00763]; the Government of Aragon [T5820R]; the Arm-BSC Center of Excellence, and the European Processor Initiative (EPI) which is part of the European Union’s Horizon 2020 research and innovation program under grant agreement No. 826647. V. Soria-Pardos has been supported through an FPU fellowship [FPU20-02132]; A. Armejach is a Serra Hunter Fellow and has been partially supported by the Grant [IJCI-2017-33945] funded by MCIN/AEI/10.13039/501100011033; M. Moreto through a Ramón y Cajal fellowship [RYC-2016-21104].
- Published
- 2023
10. DVINO: A RISC-V vector processor implemented in 65nm technology
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. EFRICS - Efficient and Robust Integrated Circuits and Systems, Cabo Pitarch, Guillem, Candon, Gerard, Carril, Xavier, Doblas Font, Max, Dominguez de la Rocha, Marc, González Trejo, Alberto, Hernández Calderón, César Alejandro, Jiménez Arador, Víctor, Kostalampros, Ioannis-Vatistas, Langarita Benítez, Rubén, Leyva Santes, Neiel Israel, López Paradís, Guillem, Mendoza Escobar, Jonnatan, Minervini Minervini, Francesco, Pavón Rivera, Julián, Ramírez Lazo, Cristóbal, Rodas, Narcis, Reggiani, Enrico, Rodriguez, Mario, Rojas Morales, Carlos, Ruíz Ramírez, Abraham Josafat, Soria Pardos, Víctor, Vargas Valdivieso, Iván, Figueras Bagué, Roger, Fontova, Pau, Marimon Illana, Joan, Montabes, Víctor, Cristal Kestelman, Adrián, Hernández Luz, Carles, Moretó Planas, Miquel, Moll Echeto, Francisco de Borja, Palomar Pérez, Óscar, Rubio Sola, Jose Antonio, Sonmez, Nehir, Unsal, Osman Sabri, Valero Cortés, Mateo, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. EFRICS - Efficient and Robust Integrated Circuits and Systems, Cabo Pitarch, Guillem, Candon, Gerard, Carril, Xavier, Doblas Font, Max, Dominguez de la Rocha, Marc, González Trejo, Alberto, Hernández Calderón, César Alejandro, Jiménez Arador, Víctor, Kostalampros, Ioannis-Vatistas, Langarita Benítez, Rubén, Leyva Santes, Neiel Israel, López Paradís, Guillem, Mendoza Escobar, Jonnatan, Minervini Minervini, Francesco, Pavón Rivera, Julián, Ramírez Lazo, Cristóbal, Rodas, Narcis, Reggiani, Enrico, Rodriguez, Mario, Rojas Morales, Carlos, Ruíz Ramírez, Abraham Josafat, Soria Pardos, Víctor, Vargas Valdivieso, Iván, Figueras Bagué, Roger, Fontova, Pau, Marimon Illana, Joan, Montabes, Víctor, Cristal Kestelman, Adrián, Hernández Luz, Carles, Moretó Planas, Miquel, Moll Echeto, Francisco de Borja, Palomar Pérez, Óscar, Rubio Sola, Jose Antonio, Sonmez, Nehir, Unsal, Osman Sabri, and Valero Cortés, Mateo
- Abstract
This paper describes the design, verification, implementation and fabrication of the Drac Vector IN-Order (DVINO) processor, a RISC-V vector processor capable of booting Linux jointly developed by BSC, CIC-IPN, IMB-CNM (CSIC), and UPC. The DVINO processor includes an internally developed two-lane vector processor unit as well as a Phase Locked Loop (PLL) and an Analog-to-Digital Converter (ADC). The paper summarizes the design from architectural as well as logic synthesis and physical design in CMOS 65nm technology., The DRAC project is co-financed by the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of total eligible cost. The authors are part of RedRISCV which promotes activities around open hardware. The Lagarto Project is supported by the Research and Graduate Secretary (SIP) of the Instituto Politecnico Nacional (IPN) from Mexico, and by the CONACyT scholarship for Center for Research in Computing (CIC-IPN)., Peer Reviewed, Article signat per 43 autors/es: Guillem Cabo∗, Gerard Candón∗, Xavier Carril∗, Max Doblas∗, Marc Domínguez∗, Alberto González∗, Cesar Hernández†, Víctor Jiménez∗, Vatistas Kostalampros∗, Rubén Langarita∗, Neiel Leyva†, Guillem López-Paradís∗, Jonnatan Mendoza∗, Francesco Minervini∗, Julian Pavón∗, Cristobal Ramírez∗, Narcís Rodas∗, Enrico Reggiani∗, Mario Rodríguez∗, Carlos Rojas∗, Abraham Ruiz∗, Víctor Soria∗, Alejandro Suanes‡, Iván Vargas∗, Roger Figueras∗, Pau Fontova∗, Joan Marimon∗, Víctor Montabes∗, Adrián Cristal∗, Carles Hernández∗, Ricardo Martínez‡, Miquel Moretó∗§, Francesc Moll∗§, Oscar Palomar∗§, Marco A. Ramírez†, Antonio Rubio§, Jordi Sacristán‡, Francesc Serra-Graells‡, Nehir Sonmez∗, Lluís Terés‡, Osman Unsal∗, Mateo Valero∗§, Luís Villa† // ∗Barcelona Supercomputing Center (BSC), Barcelona, Spain. Email: name.surname@bsc.es; †Centro de Investigación en Computación, Instituto Politécnico Nacional (CIC-IPN), Mexico City, Mexico; ‡ Institut de Microelectronica de Barcelona, IMB-CNM (CSIC), Spain. Email: name.surname@imb-cnm.csic.es; §Universitat Politecnica de Catalunya (UPC), Barcelona, Spain. Email: name.surname@upc.edu, Postprint (author's final draft)
- Published
- 2022
11. Sargantana: A 1 GHz+ in-order RISC-V processor with SIMD vector extensions in 22nm FD-SOI
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Soria Pardos, Víctor, Doblas Font, Max, López Paradís, Guillem, Candón Arenas, Gerard, Rodas Quiroga, Narcís, Carril Gil, Xavier, Fontova Muste, Pau, Leyva Santes, Neiel Israel, Marco-Sola, Santiago, Moretó Planas, Miquel, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Soria Pardos, Víctor, Doblas Font, Max, López Paradís, Guillem, Candón Arenas, Gerard, Rodas Quiroga, Narcís, Carril Gil, Xavier, Fontova Muste, Pau, Leyva Santes, Neiel Israel, Marco-Sola, Santiago, and Moretó Planas, Miquel
- Abstract
The RISC-V open Instruction Set Architecture (ISA) has proven to be a solid alternative to licensed ISAs. In the past 5 years, a plethora of industrial and academic cores and accelerators have been developed implementing this open ISA. In this paper, we present Sargantana, a 64-bit processor based on RISC-V that implements the RV64G ISA, a subset of the vector instructions extension (RVV 0.7.1), and custom application-specific instructions. Sargantana features a highly optimized 7-stage pipeline implementing out-of-order write-back, register renaming, and a non-blocking memory pipeline. Moreover, Sar-gantana features a Single Instruction Multiple Data (SIMD) unit that accelerates domain-specific applications. Sargantana achieves a 1.26 GHz frequency in the typical corner, and up to 1.69 GHz in the fast corner using 22nm FD-SOI commercial technology. As a result, Sargantana delivers a 1.77× higher Instructions Per Cycle (IPC) than our previous 5-stage in-order DVINO core, reaching 2.44 CoreMark/MHz. Our core design delivers comparable or even higher performance than other state-of-the-art academic cores performance under Autobench EEMBC benchmark suite. This way, Sargantana lays the foundations for future RISC-V based core designs able to meet industrial-class performance requirements for scientific, real-time, and high-performance computing applications., This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contract PID2019- 107255GB-C21), by the Generalitat de Catalunya (contract 2017-SGR-1328), by the European Union within the framework of the ERDF of Catalonia 2014-2020 under the DRAC project [001-P-001723], and by Lenovo-BSC Contract-Framework (2020). The Spanish Ministry of Economy, Industry and Competitiveness has partially supported M. Doblas and V. Soria-Pardos through a FPU fellowship no. FPU20-04076 and FPU20-02132 respectively. G. Lopez-Paradis has been supported by the Generalitat de Catalunya through a FI fellowship 2021FI-B00994. S. Marco-Sola was supported by Juan de la Cierva fellowship grant IJC2020-045916-I funded by MCIN/AEI/10.13039/501100011033 and by “European Union NextGenerationEU/PRTR”, and M. Moretó through a Ramon y Cajal fellowship no. RYC-2016-21104., Peer Reviewed, Postprint (author's final draft)
- Published
- 2022
12. Characterization and modeling of atomic memory operations in arm based architectures
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universidad de Zaragoza, Armejach Sanosa, Adrià, Moretó Planas, Miquel, Suárez, Darío, Soria Pardos, Víctor, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universidad de Zaragoza, Armejach Sanosa, Adrià, Moretó Planas, Miquel, Suárez, Darío, and Soria Pardos, Víctor
- Abstract
Efficient fine-grain synchronization is a classic computer architecture challenge that has been profusely addressed in the past. Load Link and Store Conditional (LL/SC) became one of the few solutions to this problem and today it is still part of the State-of-the-art. However, as the core count keeps growing many Instruction Set Architectures (ISA) start to support other synchronization instructions that scale better like Atomic Memory Operations (AMO). In this work we present a characterization of LL/SC and AMO instructions in two current Arm-based server machines. Furthermore, Arm has released its Network-on-Chip (NoC) specification enabling different hardware implementations of how AMO are executed in a multicore. Since the adoption of this new standard is still in its first stages, we have modeled six different AMO policies to explore the hardware design trade offs. We find out that there is no single implementation that outperforms the rest. Therefore, we have designed a hardware solution to dynamically select the best configuration obtaining up to 1.15x speed-ups on relevant benchmarks from the Splash-3 benchmark suite.
- Published
- 2022
13. Characterization and modeling of atomic memory operations in arm based architectures
- Author
-
Soria Pardos, Víctor, Armejach Sanosa, Adrià, Moreto Planas, Miquel, Suárez, Darío, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universidad de Zaragoza
- Subjects
Predictors ,Arm ,Computer architecture ,Synchronization ,Multicores ,Atomic ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Arquitectura d'ordinadors - Abstract
Efficient fine-grain synchronization is a classic computer architecture challenge that has been profusely addressed in the past. Load Link and Store Conditional (LL/SC) became one of the few solutions to this problem and today it is still part of the State-of-the-art. However, as the core count keeps growing many Instruction Set Architectures (ISA) start to support other synchronization instructions that scale better like Atomic Memory Operations (AMO). In this work we present a characterization of LL/SC and AMO instructions in two current Arm-based server machines. Furthermore, Arm has released its Network-on-Chip (NoC) specification enabling different hardware implementations of how AMO are executed in a multicore. Since the adoption of this new standard is still in its first stages, we have modeled six different AMO policies to explore the hardware design trade offs. We find out that there is no single implementation that outperforms the rest. Therefore, we have designed a hardware solution to dynamically select the best configuration obtaining up to 1.15x speed-ups on relevant benchmarks from the Splash-3 benchmark suite.
- Published
- 2022
14. On the use of many-core Marvell ThunderX2 processor for HPC workloads
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Soria Pardos, Víctor, Armejach Sanosa, Adrià, Suárez Gracía, Dario, Moretó Planas, Miquel, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Soria Pardos, Víctor, Armejach Sanosa, Adrià, Suárez Gracía, Dario, and Moretó Planas, Miquel
- Abstract
Marvell’s ThunderX2 has been the first Arm-based processor with deployments in large-scale HPC production systems, challenging the dominance that x86 processors had in the last decades. While x86 processors and its software stack have been characterized in detail, the behavior of Arm counterparts is not well known, limiting its adoption. This work methodically characterizes performance and power efficiency of the ThunderX2 running different HPC workloads compiled with two state-of-the-art compilers, GCC and Arm HPC Compiler. We study the maturity of available compilers and find that the Arm HPC Compiler is able to apply additional optimizations, resulting in better performance than GCC. In addition, we also compare both performance and power with respect to an Intel Skylake processor. Despite the faster single thread performance of Skylake, ThunderX2 is able to match performance on multi-threaded workloads due to its superior memory bandwidth. However, power efficiency of ThunderX2 is far from matching Skylake-based processors when AVX512 extensions are used., Peer Reviewed, Postprint (author's final draft)
- Published
- 2021
15. An academic RISC-V silicon implementation based on open-source components
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Universitat Politècnica de Catalunya. HIPICS - Grup de Circuits i Sistemes Integrats d'Altes Prestacions, Abella Ferrer, Jaume, Bulla, Calvin, Cabo Pitarch, Guillem, Cazorla Almeida, Francisco Javier, Cristal Kestelman, Adrián, Doblas Font, Max, Figueras Bagué, Roger, González Trejo, Alberto, Hernández Luz, Carles, Hernández Calderón, César Alejandro, Jiménez Arador, Víctor, Kosmidis, Leonidas, Kostalampros, Ioannis-Vatistas, Langarita Benítez, Rubén, Leyva Santes, Neiel, López Paradís, Guillem, Marimon Illana, Joan, Martínez Martínez, Ricardo, Mendoza Escobar, Jonnatan, Moll Echeto, Francisco de Borja, Moretó Planas, Miquel, Pavón Rivera, Julián, Ramírez Lazo, Cristóbal, Ramírez Salinas, Marco Antonio, Rojas Morales, Carlos, Rubio Sola, Jose Antonio, Ruiz, Abraham Josafat, Sonmez, Nehir, Soria Pardos, Víctor, Teres Teres, Lluis, Unsal, Osman Sabri, Valero Cortés, Mateo, Vargas Valdivieso, Iván, Villa Vargas, Luis Alfonso, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Universitat Politècnica de Catalunya. HIPICS - Grup de Circuits i Sistemes Integrats d'Altes Prestacions, Abella Ferrer, Jaume, Bulla, Calvin, Cabo Pitarch, Guillem, Cazorla Almeida, Francisco Javier, Cristal Kestelman, Adrián, Doblas Font, Max, Figueras Bagué, Roger, González Trejo, Alberto, Hernández Luz, Carles, Hernández Calderón, César Alejandro, Jiménez Arador, Víctor, Kosmidis, Leonidas, Kostalampros, Ioannis-Vatistas, Langarita Benítez, Rubén, Leyva Santes, Neiel, López Paradís, Guillem, Marimon Illana, Joan, Martínez Martínez, Ricardo, Mendoza Escobar, Jonnatan, Moll Echeto, Francisco de Borja, Moretó Planas, Miquel, Pavón Rivera, Julián, Ramírez Lazo, Cristóbal, Ramírez Salinas, Marco Antonio, Rojas Morales, Carlos, Rubio Sola, Jose Antonio, Ruiz, Abraham Josafat, Sonmez, Nehir, Soria Pardos, Víctor, Teres Teres, Lluis, Unsal, Osman Sabri, Valero Cortés, Mateo, Vargas Valdivieso, Iván, and Villa Vargas, Luis Alfonso
- Abstract
©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works., The design presented in this paper, called preDRAC, is a RISC-V general purpose processor capable of booting Linux jointly developed by BSC, CIC-IPN, IMB-CNM (CSIC), and UPC. The preDRAC processor is the first RISC-V processor designed and fabricated by a Spanish or Mexican academic institution, and will be the basis of future RISC-V designs jointly developed by these institutions. This paper summarizes the design tasks, for FPGA first and for SoC later, from high architectural level descriptions down to RTL and then going through logic synthesis and physical design to get the layout ready for its final tapeout in CMOS 65nm technology., The DRAC project is co-financed by the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of total eligible cost. The authors are part of RedRISCV which promotes activities around open hardware. The Lagarto Project is supported by the Research and Graduate Secretary (SIP) of the Instituto Politecnico Nacional (IPN) ´ from Mexico, and by the CONACyT scholarship for Center for Research in Computing (CIC-IPN)., Peer Reviewed, Postprint (author's final draft)
- Published
- 2020
16. On the use of many-core Marvell ThunderX2 processor for HPC workloads
- Author
-
Soria-Pardos, Víctor, primary, Armejach, Adrià, additional, Suárez, Darío, additional, and Moretó, Miquel, additional
- Published
- 2020
- Full Text
- View/download PDF
17. Caracterización de aplicaciones HPC para extensiones vectoriales de ARM
- Author
-
Soria Pardos, Víctor, Armejach Sanosa, Adrià, and Moretó Planas, Miquel
- Abstract
Hoy en día, la mayoría de repertorios de instrucciones (ISA) incluyen instrucciones que procesan multiples datos en una única instruccion. Éstas instrucciones se utilizan para acelerar aplicaciones de alto rendimiento (HPC). La primera parte de este trabajo busca caracterizar aplicaciones HPC que han sido optimizadas utilizando NEON, que es el actual subcojunto de instrucciones vectoriales soportado por los procesadores basados en la ISA ARMv8. Para alcanzar este objetivo tenemos a nuestra disposición dos procesadores tope de gama basados en ARMv8, que son ThunderX y ThunderX2, y dos de los principales compiladores del mercado, GCC y Arm HPC Compiler. Con ellos hemos caracterizado una colección de benchmarks extraidos del conjunto de benchmarks RAJAPerf y las aplicaciones HACCKernels y HPCG. Esta caracterización incluye una serie de experimentos que buscan calcular el speed-up, la escalabilidad, la eficiencia energética y de consumo de potencia. Además, hemos analizado el código ensamblador para identificar que optimiaciones se han llevado a cabo y qué caracteristicas hacen que unos experimentos sean más rápidos que otros. La segunda parte de este trabajo se centra en la nueva extensión vectorial escalable (SVE) de Arm, la cual está especificada en la ISA ARMv8.2. Esta especificación introduce el modelo de programación independiente de la longitud de los registros vectoriales (VLA). La cual permite que los fabricantes de procesadores puedan elegir diferentes longitudes de vectores entre 128 y 2048 bits, para la implementación de sus microarquitecturas. A día de hoy, no existe ninguna máquina que implementa este nuevo repertorio de instrucciones, por lo tanto hemos tenido que usar una herramienta de emulación (ArmIE) desarrollada por Arm. Esta herramienta nos permite ejecutar binarios compilados con soporte para SVE en procesadores de la ISA ARMv8. Nuestro trabajo analiza cómo los compiladores GCC y Arm HPC Compiler vectorizan estos benchmarks y además propone ciertas optimizaciones de bajo nivel para mejorar la generación de código.
- Published
- 2019
18. Characterization of HPC applications for ARM SIMD instructions
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Armejach Sanosa, Adrià, Moretó Planas, Miquel, Soria Pardos, Víctor, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Armejach Sanosa, Adrià, Moretó Planas, Miquel, and Soria Pardos, Víctor
- Abstract
Hoy en día, la mayoría de repertorios de instrucciones (ISA) incluyen instrucciones que procesan multiples datos en una única instruccion. Éstas instrucciones se utilizan para acelerar aplicaciones de alto rendimiento (HPC). La primera parte de este trabajo busca caracterizar aplicaciones HPC que han sido optimizadas utilizando NEON, que es el actual subcojunto de instrucciones vectoriales soportado por los procesadores basados en la ISA ARMv8. Para alcanzar este objetivo tenemos a nuestra disposición dos procesadores tope de gama basados en ARMv8, que son ThunderX y ThunderX2, y dos de los principales compiladores del mercado, GCC y Arm HPC Compiler. Con ellos hemos caracterizado una colección de benchmarks extraidos del conjunto de benchmarks RAJAPerf y las aplicaciones HACCKernels y HPCG. Esta caracterización incluye una serie de experimentos que buscan calcular el speed-up, la escalabilidad, la eficiencia energética y de consumo de potencia. Además, hemos analizado el código ensamblador para identificar que optimiaciones se han llevado a cabo y qué caracteristicas hacen que unos experimentos sean más rápidos que otros. La segunda parte de este trabajo se centra en la nueva extensión vectorial escalable (SVE) de Arm, la cual está especificada en la ISA ARMv8.2. Esta especificación introduce el modelo de programación independiente de la longitud de los registros vectoriales (VLA). La cual permite que los fabricantes de procesadores puedan elegir diferentes longitudes de vectores entre 128 y 2048 bits, para la implementación de sus microarquitecturas. A día de hoy, no existe ninguna máquina que implementa este nuevo repertorio de instrucciones, por lo tanto hemos tenido que usar una herramienta de emulación (ArmIE) desarrollada por Arm. Esta herramienta nos permite ejecutar binarios compilados con soporte para SVE en procesadores de la ISA ARMv8. Nuestro trabajo analiza cómo los compiladores GCC y Arm HPC Compiler vectorizan estos benchmarks y además propone cierta, Nowadays, most Intruction Set Architectures (ISA) include Single Instructions that process Multiple Data (SIMD) to speed up High Performance Computing (HPC) applications. The first part of this work aims to characterize HPC applications optimized using the NEON extension, which is the actual SIMD extension supported by ARMv8 processors. For this purpose, we have two high-end ARMv8 processors, ThunderX and ThunderX2, and two mainstream comercial ARMv8 compilers, GCC and Arm HPC Compiler. With this set up we have characterized a collection of benchmarks extracted from RAJAPerf, HACCKernels and HPCG benchmarks. The characterization includes experimental work in order to obtain speed-up, scalability, energy efficiency and power efficiency measurements for all benchmarks. Moreover, we have taken a look into the assembly code to identify what optimizations are used by each compiler that makes benchmarks run faster or slower. The second part of this work focuses on the novel Scalable Vector Extension (SVE) specified in the ARMv8.2 ISA. This SIMD specification introduces a Vector-Length Agnostic programming model, which enables implementation choices for vector lengths that scale from 128 to 2048 bits. To this day, no real processor implements this new ISA, therefore we have used the Arm Instruction Emulator (ArmIE), an emulation tool developed by Arm, that allows the execution of SVE compiled binaries running in an ARMv8 processor. Our work analizes how compilers that support SVE (GCC and Arm HPC Compiler) vectorize the benchmarks and what is the quality of the generated assembly code. We also propose some low level optimizations to improve code generation.
- Published
- 2019
19. Characterization of HPC applications for ARM SIMD instructions
- Author
-
Soria Pardos, Víctor, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Armejach Sanosa, Adrià, and Moreto Planas, Miquel
- Subjects
eficiencia ,optimización ,Vectorización ,speed-up ,OpenMP ,NEON ,Skylake ,Anàlisi vectorial ,ThunderX2 ,compilador ,HPC ,Arm ,High performance computing ,GCC ,Vector analysis ,SVE ,rendimiento ,ThunderX ,Càlcul intensiu (Informàtica) - Abstract
Hoy en día, la mayoría de repertorios de instrucciones (ISA) incluyen instrucciones que procesan multiples datos en una única instruccion. Éstas instrucciones se utilizan para acelerar aplicaciones de alto rendimiento (HPC). La primera parte de este trabajo busca caracterizar aplicaciones HPC que han sido optimizadas utilizando NEON, que es el actual subcojunto de instrucciones vectoriales soportado por los procesadores basados en la ISA ARMv8. Para alcanzar este objetivo tenemos a nuestra disposición dos procesadores tope de gama basados en ARMv8, que son ThunderX y ThunderX2, y dos de los principales compiladores del mercado, GCC y Arm HPC Compiler. Con ellos hemos caracterizado una colección de benchmarks extraidos del conjunto de benchmarks RAJAPerf y las aplicaciones HACCKernels y HPCG. Esta caracterización incluye una serie de experimentos que buscan calcular el speed-up, la escalabilidad, la eficiencia energética y de consumo de potencia. Además, hemos analizado el código ensamblador para identificar que optimiaciones se han llevado a cabo y qué caracteristicas hacen que unos experimentos sean más rápidos que otros. La segunda parte de este trabajo se centra en la nueva extensión vectorial escalable (SVE) de Arm, la cual está especificada en la ISA ARMv8.2. Esta especificación introduce el modelo de programación independiente de la longitud de los registros vectoriales (VLA). La cual permite que los fabricantes de procesadores puedan elegir diferentes longitudes de vectores entre 128 y 2048 bits, para la implementación de sus microarquitecturas. A día de hoy, no existe ninguna máquina que implementa este nuevo repertorio de instrucciones, por lo tanto hemos tenido que usar una herramienta de emulación (ArmIE) desarrollada por Arm. Esta herramienta nos permite ejecutar binarios compilados con soporte para SVE en procesadores de la ISA ARMv8. Nuestro trabajo analiza cómo los compiladores GCC y Arm HPC Compiler vectorizan estos benchmarks y además propone ciertas optimizaciones de bajo nivel para mejorar la generación de código. Nowadays, most Intruction Set Architectures (ISA) include Single Instructions that process Multiple Data (SIMD) to speed up High Performance Computing (HPC) applications. The first part of this work aims to characterize HPC applications optimized using the NEON extension, which is the actual SIMD extension supported by ARMv8 processors. For this purpose, we have two high-end ARMv8 processors, ThunderX and ThunderX2, and two mainstream comercial ARMv8 compilers, GCC and Arm HPC Compiler. With this set up we have characterized a collection of benchmarks extracted from RAJAPerf, HACCKernels and HPCG benchmarks. The characterization includes experimental work in order to obtain speed-up, scalability, energy efficiency and power efficiency measurements for all benchmarks. Moreover, we have taken a look into the assembly code to identify what optimizations are used by each compiler that makes benchmarks run faster or slower. The second part of this work focuses on the novel Scalable Vector Extension (SVE) specified in the ARMv8.2 ISA. This SIMD specification introduces a Vector-Length Agnostic programming model, which enables implementation choices for vector lengths that scale from 128 to 2048 bits. To this day, no real processor implements this new ISA, therefore we have used the Arm Instruction Emulator (ArmIE), an emulation tool developed by Arm, that allows the execution of SVE compiled binaries running in an ARMv8 processor. Our work analizes how compilers that support SVE (GCC and Arm HPC Compiler) vectorize the benchmarks and what is the quality of the generated assembly code. We also propose some low level optimizations to improve code generation.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.