1. Fast, parallel, and cache-friendly suffix array construction
- Author
-
Jamshed Khan, Tobias Rubel, Erin Molloy, Laxman Dhulipala, and Rob Patro
- Subjects
Suffix array ,Longest common prefix ,Data structures ,Indexing ,Parallel algorithms ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Purpose String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. Methods In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. Results We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPS-SA .
- Published
- 2024
- Full Text
- View/download PDF