Start Over

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

Authors :: Quintana-Ortí, Gregorio
Van de Geijn, Robert A.
Source :: Repositori Universitat Jaume I, Universitat Jaume I
Publication Year :: 2009
Publisher :: Departament d' Enginyeria i Ciència dels Computadors, Universitat Jaume I, 2009.
Abstract: The arrival of hardware accelerators has created a new gold rush to be the rst to deliver their promise of high performance for numerical applications. Since they are relatively hard to program, with limited language and compiler support, it is generally accepted that one needs to roll up one's sleeves and tough it out, not unlike the early days of distributed me- mory parallel computing (or any other period after the introduction of a drastically di erent architecture). In this paper we remind the community that while this is a noble endeavor, there is a lot of low hanging fruit that can be harvested easily. Picking this low hanging fruit bene ts the scienti c computing community immediately and prototypes the approach that the further optimizations may wish to follow. We demonstrate this by focusing on a widely used set of operations, the level-3 BLAS, targeting the NVIDIA family of GPUs La llegada de los aceleradores hardware ha creado una nueva fiebre del oro en ser los primeros en conseguir las prometidas elevadas prestaciones en aplicaciones numéricas. Ya que son relativamente difíciles de programar, con un soporte de lenguajes y compiladores limitado, se acepta que uno tiene que arremangarse la camisa y apretar los dientes, de forma no muy distinta a los primeros días de la programación de máquinas con memoria distribuida (o a cualquier otro periodo tras la introducción de una arquitectura drásticamente diferente). En este trabajo recordamos a la comunidad que mientras ésa es una actitud noble, hay un montón de fruta que puede ser recogida mucho más fácilmente. Recoger esta fruta beneficia a la comunidad científica inmediatamente y sirve para prototipar las aproximaciones que las subsiguientes optimizaciones deberían seguir. En este artículo demostramos lo anterior aplicándolo a un amplio conjunto de operaciones, el BLAS de nivel 3, orientado la la familia de GPUs de NVIDIA