Start Over

GVLE: a highly optimized GPU-based implementation of variable-length encoding.

Authors :: Fuentes-Alventosa, Antonio
Gómez-Luna, Juan
Medina-Carnicer, R.
Source :: Journal of Supercomputing. May2023, Vol. 79 Issue 8, p8447-8474. 28p.
Publication Year :: 2023
Abstract: Nowadays, the massive use of multimedia data gives to data compression a fundamental role in reducing the storage requirements and communication bandwidth. Variable-length encoding (VLE) is a relevant data compression method that reduces input data size by assigning shorter codewords to mostly used symbols, and longer codewords to rarely utilized symbols. As it is a common strategy in many compression algorithms, such as the popular Huffman coding, speeding VLE up is essential to accelerate them. For this reason, during the last decade and a half, efficient VLE implementations have been presented in the area of General Purpose Graphics Processing Units (GPGPU). The main performance issues of the state-of-the-art GPU-based implementations of VLE are the following. First, the way in which the codeword look-up table is stored in shared memory is not optimized to reduce the bank conflicts. Second, input/output data are read/written through inefficient strided global memory accesses. Third, the way in which the thread-codes are built is not optimized to reduce the number of executed instructions. Our goal in this work is to significantly speed up the state-of-the-art implementations of VLE by solving their performance issues. To this end, we propose GVLE, a highly optimized implementation of VLE on GPU, which uses the following optimization strategies. First, the caching of the codeword look-up table is done in a way that minimizes the bank conflicts. Second, input data are read by using vectorized loads to exploit fully the available global memory bandwidth. Third, each thread encoding is performed efficiently in the register space with high instruction-level parallelism and lower number of executed instructions. Fourth, a novel inter-block scan method, which outperforms those of state-of-the-art solutions, is used to calculate the bit-positions of the thread-blocks encodings in the output bit-stream. Our proposed mechanism is based on a regular segmented scan performed efficiently on sequences of bit-lengths of 32 consecutive thread-blocks encodings by using global atomic additions. Fifth, output data are written efficiently by executing coalesced global memory stores. An exhaustive experimental evaluation shows that our solution is on average 2.6 × faster than the best state-of-the-art implementation. Additionally, it shows that the scan algorithm is on average 1.62 × faster if it utilizes our inter-block scan method instead of that of the best state-of-the-art VLE solution. Hence, our inter-block scan method offers promising possibilities to accelerate algorithms that require it, such as the scan itself or the stream compaction. [ABSTRACT FROM AUTHOR]