Back to Search Start Over

Efficient String Sort with Multi-Character Encoding and Adaptive Sampling

Authors :
Aoying Zhou
Weining Qian
Wen Jin
Source :
SIGMOD Conference
Publication Year :
2021
Publisher :
ACM, 2021.

Abstract

Sorting plays a fundamental role in computer science. It has far reaching applications in database operations and data science tasks. An important class of sorting keys are strings and among all string sorting methods, radix sort is a simple but effective algorithm. Many works have been studied to accelerate radix string sort. One typical approach is to process multiple characters in each sorting pass. However, this approach incurs the crucial issue of the radix being too large. To address the problem, we introduce a novel multi-character encoding based method that can significantly reduce the radix. This new encoding scheme takes advantage of the sparse alphabet space usage as well as the sparsity of distinguishing prefixes of the inputs which are commonly seen in real-world datasets. Combining the effective encoding scheme with an adaptive sampling process to generate the encoding efficiently, our proposed sorting algorithm essentially blends radix sort with sample sort and achieves substantial improvement over other sorting approaches. The results on both real datasets and synthetic datasets show that our method yields an average 4.85× performance improvement over C++ STL sort[21], 1.47× improvement over the state-of-the-art Radix Sort on strings implementation[19] and 2.55× over the multikey quicksort[6]. Preliminary tests in a multi-core environment also show it is competitive or better than the most recent parallel string sorting algorithm pS5[8] which demonstrates the scalability of our method.

Details

Database :
OpenAIRE
Journal :
Proceedings of the 2021 International Conference on Management of Data
Accession number :
edsair.doi...........83a1aad0071f007dc5dc5441d247aef4
Full Text :
https://doi.org/10.1145/3448016.3457319