1. Efficient pattern matching in degenerate strings with the Burrows–Wheeler transform
- Author
-
Thierry Lecroq, Yannick Guesnet, Jacqueline W. Daykin, Arnaud Lefebvre, Martine Léonard, Laurent Mouchard, Bruce W. Watson, Élise Prieur-Gaston, Richard Groult, Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV), Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), and Normandie Université (NU)
- Subjects
Burrows–Wheeler transform ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,0102 computer and information sciences ,02 engineering and technology ,01 natural sciences ,Theoretical Computer Science ,Combinatorics ,0202 electrical engineering, electronic engineering, information engineering ,Pattern matching ,Time complexity ,Mathematics ,Discrete mathematics ,Sequence ,String (computer science) ,Degenerate energy levels ,String ,Computer Science Applications ,Algorithm ,010201 computation theory & mathematics ,Signal Processing ,020201 artificial intelligence & image processing ,Degenerate ,Alphabet ,Constant (mathematics) ,Information Systems - Abstract
A degenerate or indeterminate string on an alphabet Σ is a sequence of non-empty subsets of Σ. Given a degenerate string t of length n and its Burrows–Wheeler transform we present a new method for searching for a degenerate pattern of length m in t running in O ( m n ) time on a constant size alphabet Σ. Furthermore, it is a hybrid pattern matching technique that works on both regular and degenerate strings. A degenerate string is said to be conservative if its number of non-solid letters is upper-bounded by a fixed positive constant q; in this case we show that the search time complexity is O ( q m 2 ) for counting the number of occurrences and O ( q m 2 + occ ) for reporting the found occurrences where occ is the number of occurrences of the pattern in t. Experimental results show that our method performs well in practice.
- Published
- 2019
- Full Text
- View/download PDF