132 results on '"Cazaux, Bastien"'
Search Results
2. A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs
- Author
-
Park, Sangsoo, Park, Sung Gwan, Cazaux, Bastien, Park, Kunsoo, and Rivals, Eric
- Subjects
Computer Science - Data Structures and Algorithms - Abstract
The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n^2)., Comment: 8 pages, 2 figures, submitted to CPM 2021
- Published
- 2021
3. Algorithms and Complexity on Indexing Founder Graphs
- Author
-
Equi, Massimo, Norri, Tuukka, Alanko, Jarno, Cazaux, Bastien, Tomescu, Alexandru I., and Mäkinen, Veli
- Subjects
Computer Science - Data Structures and Algorithms ,Computer Science - Computational Complexity ,E.1 ,E.4 ,F.1.3 ,F.2.2 - Abstract
We study the problem of matching a string in a labeled graph. Previous research has shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time, nor index the graph in polynomial time to answer queries efficiently (Equi et al. ICALP 2019, SOFSEM 2021). These conditional lower-bounds cover even deterministic graphs with binary alphabet, but there naturally exist also graph classes that are easy to index: E.g. Wheeler graphs (Gagie et al. Theor. Comp. Sci. 2017) cover graphs admitting a Burrows-Wheeler transform -based indexing scheme. However, it is NP-complete to recognize if a graph is a Wheeler graph (Gibney, Thankachan, ESA 2019). We propose an approach to alleviate the construction bottleneck of Wheeler graphs. Rather than starting from an arbitrary graph, we study graphs induced from multiple sequence alignments (MSAs). Elastic degenerate strings (Bernadini et al. SPIRE 2017, ICALP 2019) can be seen as such graphs, and we introduce here their generalization: elastic founder graphs. We first prove that even such induced graphs are hard to index under OVH. Then we introduce two subclasses, repeat-free and semi-repeat-free graphs, that are easy to index. We give a linear time algorithm to construct a repeat-free non-elastic founder graph from a gapless MSA, and (parameterized) near-linear time algorithms to construct semi-repeat-free (repeat-free, respectively) elastic founder graphs from general MSAs. Finally, we show that repeat-free elastic founder graphs admit a reduction to Wheeler graphs in polynomial time., Comment: This is an extended full version of WABI 2020 paper (https://doi.org/10.4230/LIPIcs.WABI.2020.7), whose preprint is in arXiv:2005.09342, and of ISAAC 2021 paper (to appear)
- Published
- 2021
4. Greedy-reduction from Shortest Linear Superstring to Shortest Circular Superstring
- Author
-
Cazaux, Bastien and Rivals, Eric
- Subjects
Computer Science - Data Structures and Algorithms ,68W32 ,F.2.2 - Abstract
A superstring of a set of strings correspond to a string which contains all the other strings as substrings. The problem of finding the Shortest Linear Superstring is a well-know and well-studied problem in stringology. We present here a variant of this problem, the Shortest Circular Superstring problem where the sought superstring is a circular string. We show a strong link between these two problems and prove that the Shortest Circular Superstring problem is NP-complete. Moreover, we propose a new conjecture on the approximation ratio of the Shortest Circular Superstring problem.
- Published
- 2020
5. Linear Time Construction of Indexable Founder Block Graphs
- Author
-
Mäkinen, Veli, Cazaux, Bastien, Equi, Massimo, Norri, Tuukka, and Tomescu, Alexandru I.
- Subjects
Computer Science - Data Structures and Algorithms ,E.1 ,F.2.2 ,J.3 - Abstract
We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410\times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3\%$ of the size of the MSA.
- Published
- 2020
6. Strong link between BWT and XBW via Aho-Corasick automaton and applications to Run-Length Encoding
- Author
-
Cazaux, Bastien and Rivals, Eric
- Subjects
Computer Science - Data Structures and Algorithms - Abstract
The boom of genomic sequencing makes compression of set of sequences inescapable. This underlies the need for multi-string indexing data structures that helps compressing the data. The most prominent example of such data structures is the Burrows-Wheeler Transform (BWT), a reversible permutation of a text that improves its compressibility. A similar data structure, the eXtended Burrows-Wheeler Transform (XBW), is able to index a tree labelled with alphabet symbols. A link between a multi-string BWT and the Aho-Corasick automaton has already been found and led to a way to build a XBW from a multi-string BWT. We exhibit a stronger link between a multi-string BWT and a XBW by using the order of the concatenation in the multi-string. This bijective link has several applications: first, it allows to build one data structure from the other; second, it enables one to compute an ordering of the input strings that optimises a Run-Length measure (i.e., the compressibility) of the BWT or of the XBW.
- Published
- 2018
- Full Text
- View/download PDF
7. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time
- Author
-
Norri, Tuukka, Cazaux, Bastien, Kosolobov, Dmitry, and Mäkinen, Veli
- Subjects
Computer Science - Data Structures and Algorithms - Abstract
Given a threshold $L$ and a set $\mathcal{R} = \{R_1, \ldots, R_m\}$ of $m$ haplotype sequences, each having length $n$, the minimum segmentation problem for founder reconstruction is to partition the sequences into disjoint segments $\mathcal{R}[i_1{+}1,i_2], \mathcal{R}[i_2{+}1, i_3], \ldots, \mathcal{R}[i_{r-1}{+}1, i_r]$, where $0 = i_1 < \cdots < i_r = n$ and $\mathcal{R}[i_{j-1}{+}1, i_j]$ is the set $\{R_1[i_{j-1}{+}1, i_j], \ldots, R_m[i_{j-1}{+}1, i_j]\}$, such that the length of each segment, $i_j - i_{j-1}$, is at least $L$ and $K = \max_j\{ |\mathcal{R}[i_{j-1}{+}1, i_j]| \}$ is minimized. The distinct substrings in the segments $\mathcal{R}[i_{j-1}{+}1, i_j]$ represent founder blocks that can be concatenated to form $K$ founder sequences representing the original $\mathcal{R}$ such that crossovers happen only at segment boundaries. We give an optimal $O(mn)$ time algorithm to solve the problem, improving over earlier $O(mn^2)$. This improvement enables to exploit the algorithm on a pan-genomic setting of haplotypes being complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling.
- Published
- 2018
- Full Text
- View/download PDF
8. Hierarchical Overlap Graph
- Author
-
Cazaux, Bastien and Rivals, Eric
- Subjects
Computer Science - Data Structures and Algorithms ,68W32, 68W40 ,F.2.2 - Abstract
Given a set of finite words, the Overlap Graph (OG) is a complete weighted digraph where each word is a node and where the weight of an arc equals the length of the longest overlap of one word onto the other (Overlap is an asymmetric notion). The OG serves to assemble DNA fragments or to compute shortest superstrings which are a compressed representation of the input. The OG requires a space is quadratic in the number of words, which limits its scalability. The Hierarchical Overlap Graph (HOG) is an alternative graph that also encodes all maximal overlaps, but uses a space that is linear in the sum of the lengths of the input words. We propose the first algorithm to build the HOG in linear space for words of equal length., Comment: 11 pages, 3 figures, appendix, 6 graphics, 2 algorithms, 13 references
- Published
- 2018
- Full Text
- View/download PDF
9. The Compressed Overlap Index
- Author
-
Canovas, Rodrigo, Cazaux, Bastien, and Rivals, Eric
- Subjects
Computer Science - Data Structures and Algorithms - Abstract
For analysing text algorithms, for computing superstrings, or for testing random number generators, one needs to compute all overlaps between any pairs of words in a given set. The positions of overlaps of a word onto itself, or of two words, are needed to compute the absence probability of a word in a random text, or the numbers of common words shared by two random texts. In all these contexts, one needs to compute or to query overlaps between pairs of words in a given set. For this sake, we designed COvI, a compressed overlap index that supports multiple queries on overlaps: like computing the correlation of two words, or listing pairs of words whose longest overlap is maximal among all possible pairs. COvI stores overlaps in a hierarchical and non-redundant manner. We propose an implementation that can handle datasets of millions of words and still answer queries efficiently. Comparison with a baseline solution - called FullAC - relying on the Aho-Corasick automaton shows that COvI provides significant advantages. For similar construction times, COvI requires half the memory FullAC, and still solves complex queries much faster.
- Published
- 2017
10. Efficient Construction of Hierarchical Overlap Graphs
- Author
-
Park, Sung Gwan, Cazaux, Bastien, Park, Kunsoo, Rivals, Eric, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Boucher, Christina, editor, and Thankachan, Sharma V., editor
- Published
- 2020
- Full Text
- View/download PDF
11. Read Mapping on de Bruijn graph
- Author
-
Limasset, Antoine, Cazaux, Bastien, Rivals, Eric, and Peterlongo, Pierre
- Subjects
Computer Science - Data Structures and Algorithms ,Quantitative Biology - Genomics - Abstract
Background Next Generation Sequencing (NGS) has dramatically enhanced our ability to sequence genomes, but not to assemble them. In practice, many published genome sequences remain in the state of a large set of contigs. Each contig describes the sequence found along some path of the assembly graph, however, the set of contigs does not record all the sequence information contained in that graph. Although many subsequent analyses can be performed with the set of contigs, one may ask whether mapping reads on the contigs is as informative as mapping them on the paths of the assembly graph. Currently, one lacks practical tools to perform mapping on such graphs. Results Here, we propose a formal definition of mapping on a de Bruijn graph, analyse the problem complexity which turns out to be NP-complete, and provide a practical solution.We propose a pipeline called GGMAP (Greedy Graph MAPping). Its novelty is a procedure to map reads on branching paths of the graph, for which we designed a heuristic algorithm called BGREAT (de Bruijn Graph REAd mapping Tool). For the sake of efficiency, BGREAT rewrites a read sequence as a succession of unitigs sequences. GGMAP can map millions of reads per CPU hour on a de Bruijn graph built from a large set of human genomic reads. Surprisingly, results show that up to 22% more reads can be mapped on the graph but not on the contig set. Conclusions Although mapping reads on a de Bruijn graph is complex task, our proposal offers a practical solution combining efficiency with an improved mapping capacity compared to assembly-based mapping even for complex eukaryotic data. Availability: github.com/Malfoy/BGREAT Keywords: Read mapping; De bruijn graphs; NGS; NP-completeness, Comment: BMC Bioinformatics
- Published
- 2015
- Full Text
- View/download PDF
12. Conway–Bromage–Lyndon (CBL): an exact, dynamic representation of k-mer sets.
- Author
-
Martayan, Igor, Cazaux, Bastien, Limasset, Antoine, and Marchet, Camille
- Subjects
- *
ROTATIONAL motion , *KNIVES , *LIBRARIES , *VOCABULARY - Abstract
Summary In this article, we introduce the Conway–Bromage–Lyndon (CBL) structure, a compressed, dynamic and exact method for representing k -mer sets. Originating from Conway and Bromage's concept, CBL innovatively employs the smallest cyclic rotations of k -mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fano's scheme. This structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and compression. Our findings suggest that CBL outperforms existing dynamic k -mer set methods. Unique to this work, CBL stands out as the only known exact k -mer structure offering in-place set operations. Its different combined abilities position it as a flexible Swiss knife structure for k -mer set management. Availability and implementation https://github.com/imartayan/CBL. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Linear Time Maximum Segmentation Problems in Column Stream Model
- Author
-
Cazaux, Bastien, Kosolobov, Dmitry, Mäkinen, Veli, Norri, Tuukka, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Brisaboa, Nieves R., editor, and Puglisi, Simon J., editor
- Published
- 2019
- Full Text
- View/download PDF
14. Tinted de Bruijn Graphs for efficient read extraction from sequencing datasets
- Author
-
Vandamme, Lea, primary, Cazaux, Bastien, additional, and Limasset, Antoine, additional
- Published
- 2024
- Full Text
- View/download PDF
15. Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets
- Author
-
Martayan, Igor, primary, Cazaux, Bastien, additional, Limasset, Antoine, additional, and Marchet, Camille, additional
- Published
- 2024
- Full Text
- View/download PDF
16. Efficient Construction of Hierarchical Overlap Graphs
- Author
-
Park, Sung Gwan, primary, Cazaux, Bastien, additional, Park, Kunsoo, additional, and Rivals, Eric, additional
- Published
- 2020
- Full Text
- View/download PDF
17. Relationship between superstring and compression measures: New insights on the greedy conjecture
- Author
-
Cazaux, Bastien and Rivals, Eric
- Published
- 2018
- Full Text
- View/download PDF
18. A survey of mapping algorithms in the long-reads era
- Author
-
Sahlin, Kristoffer, primary, Baudeau, Thomas, additional, Cazaux, Bastien, additional, and Marchet, Camille, additional
- Published
- 2023
- Full Text
- View/download PDF
19. Superstring Graph: A New Approach for Genome Assembly
- Author
-
Cazaux, Bastien, Sacomoto, Gustavo, Rivals, Eric, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Dondi, Riccardo, editor, Fertin, Guillaume, editor, and Mauri, Giancarlo, editor
- Published
- 2016
- Full Text
- View/download PDF
20. Finding all maximal perfect haplotype blocks in linear time
- Author
-
Alanko, Jarno, Bannai, Hideo, Cazaux, Bastien, Peterlongo, Pierre, and Stoye, Jens
- Published
- 2020
- Full Text
- View/download PDF
21. Linear Time Maximum Segmentation Problems in Column Stream Model
- Author
-
Cazaux, Bastien, primary, Kosolobov, Dmitry, additional, Mäkinen, Veli, additional, and Norri, Tuukka, additional
- Published
- 2019
- Full Text
- View/download PDF
22. Construction of a de Bruijn Graph for Assembly from a Truncated Suffix Tree
- Author
-
Cazaux, Bastien, Lecroq, Thierry, Rivals, Eric, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Dediu, Adrian-Horia, editor, Formenti, Enrico, editor, Martín-Vide, Carlos, editor, and Truthe, Bianca, editor
- Published
- 2015
- Full Text
- View/download PDF
23. dipwmsearch: a Python package for searching di-PWM motifs
- Author
-
Mille, Marie, primary, Ripoll, Julie, additional, Cazaux, Bastien, additional, and Rivals, Eric, additional
- Published
- 2023
- Full Text
- View/download PDF
24. A survey of mapping algorithms in the long-reads era
- Author
-
Sahlin, Kristoffer, Baudeau, Thomas, Cazaux, Bastien, Marchet, Camille, Sahlin, Kristoffer, Baudeau, Thomas, Cazaux, Bastien, and Marchet, Camille
- Abstract
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings (http:// bcazaux.polytech-lille.net/Minimap2/).
- Published
- 2023
- Full Text
- View/download PDF
25. The power of greedy algorithms for approximating Max-ATSP, Cyclic Cover, and superstrings
- Author
-
Cazaux, Bastien and Rivals, Eric
- Published
- 2016
- Full Text
- View/download PDF
26. From Indexing Data Structures to de Bruijn Graphs
- Author
-
Cazaux, Bastien, Lecroq, Thierry, Rivals, Eric, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Kobsa, Alfred, editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Weikum, Gerhard, editor, Kulikov, Alexander S., editor, Kuznetsov, Sergei O., editor, and Pevzner, Pavel, editor
- Published
- 2014
- Full Text
- View/download PDF
27. Linear time minimum segmentation enables scalable founder reconstruction
- Author
-
Norri, Tuukka, Cazaux, Bastien, Kosolobov, Dmitry, and Mäkinen, Veli
- Published
- 2019
- Full Text
- View/download PDF
28. HaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets
- Author
-
Kirsch-Gerweck, Benedikt, primary, Bohnenkämper, Leonard, additional, Henrichs, Michel T, additional, Alanko, Jarno N, additional, Bannai, Hideo, additional, Cazaux, Bastien, additional, Peterlongo, Pierre, additional, Burger, Joachim, additional, Stoye, Jens, additional, and Diekmann, Yoan, additional
- Published
- 2023
- Full Text
- View/download PDF
29. dipwmsearch: a python package for searching di-PWM motifs
- Author
-
Mille, Marie, primary, Ripoll, Julie, additional, Cazaux, Bastien, additional, and Rivals, Eric, additional
- Published
- 2022
- Full Text
- View/download PDF
30. Algorithms and Complexity on Indexing Founder Graphs
- Author
-
Equi, Massimo, primary, Norri, Tuukka, additional, Alanko, Jarno, additional, Cazaux, Bastien, additional, Tomescu, Alexandru I., additional, and Mäkinen, Veli, additional
- Published
- 2022
- Full Text
- View/download PDF
31. Toward Optimal Fingerprint Indexing for Large Scale Genomics
- Author
-
Clément Agret and Bastien Cazaux and Antoine Limasset, Agret, Clément, Cazaux, Bastien, Limasset, Antoine, Clément Agret and Bastien Cazaux and Antoine Limasset, Agret, Clément, Cazaux, Bastien, and Limasset, Antoine
- Abstract
Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases.
- Published
- 2022
- Full Text
- View/download PDF
32. A survey of mapping algorithms in the long-reads era
- Author
-
Sahlin, Kristoffer, primary, Baudeau, Thomas, additional, Cazaux, Bastien, additional, and Marchet, Camille, additional
- Published
- 2022
- Full Text
- View/download PDF
33. Superstring Graph: A New Approach for Genome Assembly
- Author
-
Cazaux, Bastien, primary, Sacomoto, Gustavo, additional, and Rivals, Eric, additional
- Published
- 2016
- Full Text
- View/download PDF
34. Algorithms and Complexity on Indexing Elastic Founder Graphs
- Author
-
Equi, Massimo, Norri, Tuukka, Alanko, Jarno, Cazaux, Bastien, Tomescu, Alexandru I., M��kinen, Veli, Ahn, Hee-Kap, Sadakane, Kunihiko, Department of Computer Science, Algorithmic Bioinformatics, Genome-scale Algorithmics research group / Veli Mäkinen, Bioinformatics, and Helsinki Institute for Information Technology
- Subjects
Theory of computation ��� Sorting and searching ,Theory of computation ��� Graph algorithms analysis ,segmentation ,Theory of computation → Problems, reductions and completeness ,113 Computer and information sciences ,Theory of computation ��� Pattern matching ,Applied computing ��� Genomics ,Applied computing → Genomics ,Theory of computation → Graph algorithms analysis ,Theory of computation → Sorting and searching ,Theory of computation → Dynamic programming ,Theory of computation ��� Problems, reductions and completeness ,Theory of computation ��� Dynamic programming ,multiple sequence alignment ,orthogonal vectors hypothesis ,Theory of computation → Pattern matching - Abstract
We study the problem of matching a string in a labeled graph. Previous research has shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time, nor index the graph in polynomial time to answer queries efficiently (Equi et al. ICALP 2019, SOFSEM 2021). These conditional lower-bounds cover even deterministic graphs with binary alphabet, but there naturally exist also graph classes that are easy to index: E.g. Wheeler graphs (Gagie et al. Theor. Comp. Sci. 2017) cover graphs admitting a Burrows-Wheeler transform -based indexing scheme. However, it is NP-complete to recognize if a graph is a Wheeler graph (Gibney, Thankachan, ESA 2019). We propose an approach to alleviate the construction bottleneck of Wheeler graphs. Rather than starting from an arbitrary graph, we study graphs induced from multiple sequence alignments. Elastic degenerate strings (Bernadini et al. SPIRE 2017, ICALP 2019) can be seen as such graphs, and we introduce here their generalization: elastic founder graphs. We first prove that even such induced graphs are hard to index under OVH. Then we introduce two subclasses that are easy to index. Moreover, we give a near-linear time algorithm to construct indexable elastic founder graphs. This algorithm is based on an earlier segmentation algorithm for gapless multiple sequence alignments inducing non-elastic founder graphs (M��kinen et al., WABI 2020), but uses more involved techniques to cope with repetitive string collections synchronized with gaps. Finally, we show that one of the subclasses admits a reduction to Wheeler graphs in polynomial time., LIPIcs, Vol. 212, 32nd International Symposium on Algorithms and Computation (ISAAC 2021), pages 20:1-20:18
- Published
- 2021
35. Construction of a de Bruijn Graph for Assembly from a Truncated Suffix Tree
- Author
-
Cazaux, Bastien, primary, Lecroq, Thierry, additional, and Rivals, Eric, additional
- Published
- 2015
- Full Text
- View/download PDF
36. Search algorithms for dinucleotide Position Weight Matrices
- Author
-
Mille, Marie, Ripoll, Julie, Cazaux, Bastien, Rivals, Eric, Méthodes et Algorithmes pour la Bioinformatique (MAB), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), Mémoires - Université de Montpellier - Faculté des sciences (UM FS), Université de Montpellier (UM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), project FluoRib (INCa grant n°2018-131), Institut Pasteur, Paris, Société Française de Bioinformatique (S.F.B.I.), and ANR-10-LABX-0020,NUMEV,Digital and Hardware Solutions and Modeling for the Environement and Life Sciences(2010)
- Subjects
index ,PSSM ,online search ,pattern matching ,binding site ,motif ,genomics ,RNA ,PWM ,DNA ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,offline search ,transcription factor - Abstract
National audience; Transcription regulation is an important cellular process. Specialized proteins, called Transcription Factors (TF), bind on short, specific, DNA sequences to regulate the expression of nearby genes. The sequences recognized by a TF in the vicinity of different genes are not identical, but similar. One captures the similarity of those binding site in different representations, which are generally called /motifs/. The most widely used sort of motifs are Position Weight Matrix (PWM) (also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM)). A PWM is built from a multiple alignment of "true" binding sequences and capture the observed variation of nucleotides at the different positions. Several databases (JASPAR, TRANSFAC, etc.) collect PWMs for known TFs. Those PWMs are used to scan new DNA sequences to find putative binding sites and possibly to annotate them. In the case of complete genomes, the scanning procedure for many PWM may last a long time \cite{Pizzi2011}.PWM assume that the distinct positions of the bound sequence are independent of each other. However, several works have observed that a mutation at given position influence the probability of mutation at neighboring positions. To overcome this limitation of PWMs, Kulakovskiy et al. have proposed a more complex sort of motif, called di-PWMs, which model the frequency of occurrence of dinucleotides in the binding sites (instead of mononucleotides for PWMs) \cite{KULAKOVSKIY2013}. Their studies show that di-PWMs improve in sensitivity compared to PWMs, and thus produce less false positives when scanning a sequence.Our aim is to design new search algorithms for di-PWMs, either online or offline, and to implement them. Our online scanning algorithm computes a partial score for some positions in the current window, and estimates the maximum achievable score for the whole window. If this score does not match requested threshold, the window can be discarded. Our offline approach works in two steps. As for read mapping, the genome is first preprocessed to produce an index data structure. Then, for any given motif and a threshold score, we enumerate potential matching words (that word whose score lies above the threshold) and search their occurrences in the index in optimal time. The difficulty is to design an efficient enumeration algorithm. Such an approach was developed for PWMs in the MOTIF software \cite{MOTIF}.If time suffices, we plan to compare experimentally our implementations to that of an existing search tool for di-PWMs, called SPRY-SARUS \cite{Pizzi2011}.
- Published
- 2021
37. Toward optimal fingerprint indexing for large scale genomics
- Author
-
Agret, Clement, primary, Cazaux, Bastien, additional, and Limasset, Antoine, additional
- Published
- 2021
- Full Text
- View/download PDF
38. From Indexing Data Structures to de Bruijn Graphs
- Author
-
Cazaux, Bastien, primary, Lecroq, Thierry, additional, and Rivals, Eric, additional
- Published
- 2014
- Full Text
- View/download PDF
39. Founder reconstruction enables scalable and seamless pangenomic analysis
- Author
-
Norri, Tuukka, primary, Cazaux, Bastien, additional, Dönges, Saska, additional, Valenzuela, Daniel, additional, and Mäkinen, Veli, additional
- Published
- 2021
- Full Text
- View/download PDF
40. Algorithms and Complexity on Indexing Elastic Founder Graphs
- Author
-
Massimo Equi and Tuukka Norri and Jarno Alanko and Bastien Cazaux and Alexandru I. Tomescu and Veli Mäkinen, Equi, Massimo, Norri, Tuukka, Alanko, Jarno, Cazaux, Bastien, Tomescu, Alexandru I., Mäkinen, Veli, Massimo Equi and Tuukka Norri and Jarno Alanko and Bastien Cazaux and Alexandru I. Tomescu and Veli Mäkinen, Equi, Massimo, Norri, Tuukka, Alanko, Jarno, Cazaux, Bastien, Tomescu, Alexandru I., and Mäkinen, Veli
- Abstract
We study the problem of matching a string in a labeled graph. Previous research has shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time, nor index the graph in polynomial time to answer queries efficiently (Equi et al. ICALP 2019, SOFSEM 2021). These conditional lower-bounds cover even deterministic graphs with binary alphabet, but there naturally exist also graph classes that are easy to index: E.g. Wheeler graphs (Gagie et al. Theor. Comp. Sci. 2017) cover graphs admitting a Burrows-Wheeler transform -based indexing scheme. However, it is NP-complete to recognize if a graph is a Wheeler graph (Gibney, Thankachan, ESA 2019). We propose an approach to alleviate the construction bottleneck of Wheeler graphs. Rather than starting from an arbitrary graph, we study graphs induced from multiple sequence alignments. Elastic degenerate strings (Bernadini et al. SPIRE 2017, ICALP 2019) can be seen as such graphs, and we introduce here their generalization: elastic founder graphs. We first prove that even such induced graphs are hard to index under OVH. Then we introduce two subclasses that are easy to index. Moreover, we give a near-linear time algorithm to construct indexable elastic founder graphs. This algorithm is based on an earlier segmentation algorithm for gapless multiple sequence alignments inducing non-elastic founder graphs (Mäkinen et al., WABI 2020), but uses more involved techniques to cope with repetitive string collections synchronized with gaps. Finally, we show that one of the subclasses admits a reduction to Wheeler graphs in polynomial time.
- Published
- 2021
- Full Text
- View/download PDF
41. A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs
- Author
-
Sangsoo Park and Sung Gwan Park and Bastien Cazaux and Kunsoo Park and Eric Rivals, Park, Sangsoo, Park, Sung Gwan, Cazaux, Bastien, Park, Kunsoo, Rivals, Eric, Sangsoo Park and Sung Gwan Park and Bastien Cazaux and Kunsoo Park and Eric Rivals, Park, Sangsoo, Park, Sung Gwan, Cazaux, Bastien, Park, Kunsoo, and Rivals, Eric
- Abstract
The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n²).
- Published
- 2021
- Full Text
- View/download PDF
42. Hierarchical Overlap Graph
- Author
-
Cazaux, Bastien and Rivals, Eric
- Published
- 2020
- Full Text
- View/download PDF
43. Linear Time Construction of Indexable Founder Block Graphs
- Author
-
Cazaux, Bastien, Equi, Massimo, Norri, Tuukka, Tomescu, Alexandru I., Cazaux, Bastien, Equi, Massimo, Norri, Tuukka, and Tomescu, Alexandru I.
- Published
- 2020
- Full Text
- View/download PDF
44. Linear Time Construction of Indexable Founder Block Graphs
- Author
-
Veli Mäkinen and Bastien Cazaux and Massimo Equi and Tuukka Norri and Alexandru I. Tomescu, Mäkinen, Veli, Cazaux, Bastien, Equi, Massimo, Norri, Tuukka, Tomescu, Alexandru I., Veli Mäkinen and Bastien Cazaux and Massimo Equi and Tuukka Norri and Alexandru I. Tomescu, Mäkinen, Veli, Cazaux, Bastien, Equi, Massimo, Norri, Tuukka, and Tomescu, Alexandru I.
- Abstract
We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SARS-CoV-2 strains are reported. An MSA of size 410× 29811 is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only 3% of the size of the MSA.
- Published
- 2020
- Full Text
- View/download PDF
45. Linking BWT and XBW via Aho-Corasick Automaton: Applications to Run-Length Encoding
- Author
-
Cazaux, Bastien, Rivals, Eric, Méthodes et Algorithmes pour la Bioinformatique (MAB), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), Institut de Biologie Computationnelle (IBC), Université de Montpellier (UM)-Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Support from the Institut de Biologie Computationnelle (ANR-11-BINF-0002).Eric Rivals: thanks the GEM Flagship project funded from Labex NUMEV (ANR-10-LABX-0020)., University of Pisa, Nadia Pisanti, Solon P. Pissis, ANR-11-BINF-0002,IBC,Institut de biologie Computationnelle(2011), ANR-10-LABX-0020,NUMEV,Digital and Hardware Solutions and Modeling for the Environement and Life Sciences(2010), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), ANR-11-BINF-0002,IBC,Institut de Biologie Computationnelle de Montpellier(2011), and ANR-10-LABX-0020/10-LABX-0020,NUMEV,Digital and Hardware Solutions and Modeling for the Environement and Life Sciences(2010)
- Subjects
Run Length Encoding (RLE) ,Aho-Corasick automaton ,Aho-Corasick Tree ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Burrows Wheeler Transform ,Compression ,XBWT ,[INFO.INFO-DM]Computer Science [cs]/Discrete Mathematics [cs.DM] ,Data Structures ,Algorithm ,2012 ACM Subject Classification Mathematics of computing → Discrete mathematics ,Theory of computation → Randomness, geometry and discrete structures ,Theory of computation → Data structures and algorithms for data management ,Indexing ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,Stringology - Abstract
URN: urn:nbn:de:0030-drops-104955URL: http://drops.dagstuhl.de/opus/volltexte/2019/10495/ISBN ={978-3-95977-103-0},ISSN ={1868-8969},; International audience; The boom of genomic sequencing makes compression of sets of sequences inescapable. This underlies the need for multi-string indexing data structures that helps compressing the data. The most prominent example of such data structures is the Burrows-Wheeler Transform (BWT), a reversible permutation of a text that improves its compressibility. A similar data structure, the eXtended Burrows-Wheeler Transform (XBW), is able to index a tree labelled with alphabet symbols. A link between a multi-string BWT and the Aho-Corasick automaton has already been found and led to a way to build a XBW from a multi-string BWT. We exhibit a stronger link between a multi-string BWT and a XBW by using the order of the concatenation in the multi-string. This bijective link has several applications: first, it allows one to build one data structure from the other; second, it enables one to compute an ordering of the input strings that optimises a Run-Length measure (i.e., the compressibility) of the BWT or of the XBW.
- Published
- 2019
46. Lien entre Transformée de Burrows-Wheeler (BWT) et BWT étendue (XBW) via l'automate d'Aho-Corasick : applications en compression par encodage des longueurs
- Author
-
Cazaux, Bastien, Rivals, Eric, Pisanti, Nadia, Pissis, Solon P., Department of Computer Science, Méthodes et Algorithmes pour la Bioinformatique (MAB), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Institut de Biologie Computationnelle (IBC), Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Support from the Institut de Biologie Computationnelle (ANR-11-BINF-0002).Eric Rivals: thanks the GEM Flagship project funded from Labex NUMEV (ANR-10-LABX-0020)., University of Pisa, Nadia Pisanti, Solon P. Pissis, ANR-11-BINF-0002,IBC,Institut de Biologie Computationnelle de Montpellier(2011), ANR-10-LABX-0020/10-LABX-0020,NUMEV,Digital and Hardware Solutions and Modeling for the Environement and Life Sciences(2010), ANR-11-BINF-0002,IBC,Institut de biologie Computationnelle(2011), ANR-10-LABX-0020,NUMEV,Digital and Hardware Solutions and Modeling for the Environement and Life Sciences(2010), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), and Université de Montpellier (UM)-Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Run Length Encoding (RLE) ,000 Computer science, knowledge, general works ,Aho-Corasick automaton ,Aho-Corasick Tree ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,education ,Burrows Wheeler Transform ,Compression ,Data_CODINGANDINFORMATIONTHEORY ,0102 computer and information sciences ,02 engineering and technology ,XBWT ,[INFO.INFO-DM]Computer Science [cs]/Discrete Mathematics [cs.DM] ,Data Structures ,113 Computer and information sciences ,01 natural sciences ,Algorithm ,2012 ACM Subject Classification Mathematics of computing → Discrete mathematics ,Theory of computation → Randomness, geometry and discrete structures ,Theory of computation → Data structures and algorithms for data management ,010201 computation theory & mathematics ,Computer Science ,0202 electrical engineering, electronic engineering, information engineering ,Indexing ,020201 artificial intelligence & image processing ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,Stringology - Abstract
URN: urn:nbn:de:0030-drops-104955URL: http://drops.dagstuhl.de/opus/volltexte/2019/10495/ISBN ={978-3-95977-103-0},ISSN ={1868-8969},; International audience; The boom of genomic sequencing makes compression of sets of sequences inescapable. This underlies the need for multi-string indexing data structures that helps compressing the data. The most prominent example of such data structures is the Burrows-Wheeler Transform (BWT), a reversible permutation of a text that improves its compressibility. A similar data structure, the eXtended Burrows-Wheeler Transform (XBW), is able to index a tree labelled with alphabet symbols. A link between a multi-string BWT and the Aho-Corasick automaton has already been found and led to a way to build a XBW from a multi-string BWT. We exhibit a stronger link between a multi-string BWT and a XBW by using the order of the concatenation in the multi-string. This bijective link has several applications: first, it allows one to build one data structure from the other; second, it enables one to compute an ordering of the input strings that optimises a Run-Length measure (i.e., the compressibility) of the BWT or of the XBW.
- Published
- 2019
47. Finding All Maximal Perfect Haplotype Blocks in Linear Time
- Author
-
Alanko, Jarno, Bannai, Hideo, Cazaux, Bastien, Peterlongo, Pierre, Stoye, Jens, Alanko, Jarno, Bannai, Hideo, Cazaux, Bastien, Peterlongo, Pierre, and Stoye, Jens
- Abstract
Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Proceedings of BSB 2019) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.
- Published
- 2019
- Full Text
- View/download PDF
48. Finding All Maximal Perfect Haplotype Blocks in Linear Time
- Author
-
Jarno Alanko and Hideo Bannai and Bastien Cazaux and Pierre Peterlongo and Jens Stoye, Alanko, Jarno, Bannai, Hideo, Cazaux, Bastien, Peterlongo, Pierre, Stoye, Jens, Jarno Alanko and Hideo Bannai and Bastien Cazaux and Pierre Peterlongo and Jens Stoye, Alanko, Jarno, Bannai, Hideo, Cazaux, Bastien, Peterlongo, Pierre, and Stoye, Jens
- Abstract
Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Proceedings of BSB 2019) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.
- Published
- 2019
- Full Text
- View/download PDF
49. Linking BWT and XBW via Aho-Corasick Automaton: Applications to Run-Length Encoding
- Author
-
Bastien Cazaux and Eric Rivals, Cazaux, Bastien, Rivals, Eric, Bastien Cazaux and Eric Rivals, Cazaux, Bastien, and Rivals, Eric
- Abstract
The boom of genomic sequencing makes compression of sets of sequences inescapable. This underlies the need for multi-string indexing data structures that helps compressing the data. The most prominent example of such data structures is the Burrows-Wheeler Transform (BWT), a reversible permutation of a text that improves its compressibility. A similar data structure, the eXtended Burrows-Wheeler Transform (XBW), is able to index a tree labelled with alphabet symbols. A link between a multi-string BWT and the Aho-Corasick automaton has already been found and led to a way to build a XBW from a multi-string BWT. We exhibit a stronger link between a multi-string BWT and a XBW by using the order of the concatenation in the multi-string. This bijective link has several applications: first, it allows one to build one data structure from the other; second, it enables one to compute an ordering of the input strings that optimises a Run-Length measure (i.e., the compressibility) of the BWT or of the XBW.
- Published
- 2019
- Full Text
- View/download PDF
50. Superstrings with multiplicities
- Author
-
Rivals, Eric, Cazaux, Bastien, Institut de Biologie Computationnelle (IBC), Université de Montpellier (UM)-Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Méthodes et Algorithmes pour la Bioinformatique (MAB), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), Défi MASTODONS C3G http://www.lirmm.fr/~rivals/C3G/ from CNRS, Qingdao University, Gonzalo Navarro, David Sankoff, Binhai Zhu, ANR-11-BINF-0002,IBC,Institut de biologie Computationnelle(2011), Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), ANR-11-BINF-0002,IBC,Institut de Biologie Computationnelle de Montpellier(2011), Navarro, Gonzalo, Sankoff, David, Zhu, Binhai, and Department of Computer Science
- Subjects
[INFO.INFO-CC]Computer Science [cs]/Computational Complexity [cs.CC] ,APX ,subset system ,ACM: F.: Theory of Computation/F.2: ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Cyclic covers ,02 engineering and technology ,ACM: G.: Mathematics of Computing/G.2: DISCRETE MATHEMATICS ,0202 electrical engineering, electronic engineering, information engineering ,overlap ,algorithm analysis ,Approximation ,060201 languages & linguistics ,000 Computer science, knowledge, general works ,Pattern matching, Approximation ,06 humanities and the arts ,Cyclic cover ,Subset system, String theory ,113 Computer and information sciences ,ACM: F.: Theory of Computation ,Approximation algorithms ,Greedy algorithms ,greedy algorithm ,0602 languages and literature ,Computer Science ,020201 artificial intelligence & image processing ,ACM: G.: Mathematics of Computing/G.1: NUMERICAL ANALYSIS/G.1.2: Approximation ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] - Abstract
A superstring of a set of words P = s1, · · · , sp is a string that contains each word of P as substring. Given P, the well known Shortest Linear Superstring problem (SLS), asks for a shortest superstring of P. In a variant of SLS, called Multi-SLS, each word si comes with an integer m(i), its multiplicity, that sets a constraint on its number of occurrences, and the goal is to find a shortest superstring that contains at least m(i) occurrences of si. Multi-SLS generalizes SLS and is obviously as hard to solve, but it has been studied only in special cases (with words of length 2 or with a fixed number of words). The approximability of Multi-SLS in the general case remains open. Here, we study the approximability of Multi-SLS and that of the companion problem Multi-SCCS, which asks for a shortest cyclic cover instead of shortest superstring. First, we investigate the approximation of a greedy algorithm for maximizing the compression offered by a superstring or by a cyclic cover: the approximation ratio is 1/2 for Multi-SLS and 1 for Multi-SCCS. Then, we exhibit a linear time approximation algorithm, Concat-Greedy, and show it achieves a ratio of 4 regarding the superstring length. This demonstrates that for both measures Multi-SLS belongs to the class of APX problems. © 2018 Yoshifumi Sakai; licensed under Creative Commons License CC-BY.
- Published
- 2018
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.