Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors, European Commission, Generalitat Valenciana, European Regional Development Fund, Ministerio de Economía y Competitividad, Ministerio de Educación, Cultura y Deporte, Comisión Interministerial de Ciencia y Tecnología, Catalán, Sandra, Castelló, Adrián, Igual, Francisco D., Rodríguez-Sánchez, Rafael, Quintana Ortí, Enrique Salvador, Universitat Politècnica de València. Departamento de Informática de Sistemas y Computadores - Departament d'Informàtica de Sistemes i Computadors, European Commission, Generalitat Valenciana, European Regional Development Fund, Ministerio de Economía y Competitividad, Ministerio de Educación, Cultura y Deporte, Comisión Interministerial de Ciencia y Tecnología, Catalán, Sandra, Castelló, Adrián, Igual, Francisco D., Rodríguez-Sánchez, Rafael, and Quintana Ortí, Enrique Salvador
[EN] We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded version of basic linear algebra subroutines (BLAS). The proposed approach is also different from the more sophisticated runtime-based implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a high level of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of linear algebra package (LAPACK) functionality on any multicore platform with an OpenMP-like runtime.