Back to Search
Start Over
Efficient String Sort with Multi-Character Encoding and Adaptive Sampling
- Source :
- SIGMOD Conference
- Publication Year :
- 2021
- Publisher :
- ACM, 2021.
-
Abstract
- Sorting plays a fundamental role in computer science. It has far reaching applications in database operations and data science tasks. An important class of sorting keys are strings and among all string sorting methods, radix sort is a simple but effective algorithm. Many works have been studied to accelerate radix string sort. One typical approach is to process multiple characters in each sorting pass. However, this approach incurs the crucial issue of the radix being too large. To address the problem, we introduce a novel multi-character encoding based method that can significantly reduce the radix. This new encoding scheme takes advantage of the sparse alphabet space usage as well as the sparsity of distinguishing prefixes of the inputs which are commonly seen in real-world datasets. Combining the effective encoding scheme with an adaptive sampling process to generate the encoding efficiently, our proposed sorting algorithm essentially blends radix sort with sample sort and achieves substantial improvement over other sorting approaches. The results on both real datasets and synthetic datasets show that our method yields an average 4.85× performance improvement over C++ STL sort[21], 1.47× improvement over the state-of-the-art Radix Sort on strings implementation[19] and 2.55× over the multikey quicksort[6]. Preliminary tests in a multi-core environment also show it is competitive or better than the most recent parallel string sorting algorithm pS5[8] which demonstrates the scalability of our method.
- Subjects :
- 020203 distributed computing
Theoretical computer science
Sorting algorithm
Computer science
Radix sort
String (computer science)
Sorting
Character encoding
0102 computer and information sciences
02 engineering and technology
01 natural sciences
010201 computation theory & mathematics
0202 electrical engineering, electronic engineering, information engineering
sort
Radix
Quicksort
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- Proceedings of the 2021 International Conference on Management of Data
- Accession number :
- edsair.doi...........83a1aad0071f007dc5dc5441d247aef4
- Full Text :
- https://doi.org/10.1145/3448016.3457319