12 results on '"Thierry Lecroq"'
Search Results
2. Fast string matching for DNA sequences
- Author
-
Thierry Lecroq, Kunsoo Park, Cheol Ryu, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA), School of Computer Science and Engineering [Seoul] (School of CSE), and Seoul National University [Seoul] (SNU)
- Subjects
General Computer Science ,Computer science ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Window (computing) ,0102 computer and information sciences ,02 engineering and technology ,String searching algorithm ,ComputingMethodologies_ARTIFICIALINTELLIGENCE ,01 natural sciences ,Theoretical Computer Science ,010201 computation theory & mathematics ,0202 electrical engineering, electronic engineering, information engineering ,Order (group theory) ,020201 artificial intelligence & image processing ,Algorithm ,ComputingMilieux_MISCELLANEOUS - Abstract
In this paper we propose the Maximal Average Shift (MAS) algorithm that finds a pattern scan order that maximizes the average shift length. We also present two extensions of MAS: one improves the scan speed of MAS by using the scan result of the previous window, and the other improves the running time of MAS by using q-grams. These algorithms show better average performances in scan speed than previous string matching algorithms for DNA sequences.
- Published
- 2020
3. UMI-Gen: A UMI-based read simulator for variant calling evaluation in paired-end sequencing NGS libraries
- Author
-
Caroline Bérard, Thierry Lecroq, Fabrice Jardin, Vincent Sater, Pierre-Julien Viailly, Philippe Ruminy, Élise Prieur-Gaston, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Normandie Université (NU), Génomique et Médecine Personnalisée du Cancer et des Maladies Neuropsychiatriques (GPMCND), Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Normandie Université (NU)-Institut National de la Santé et de la Recherche Médicale (INSERM), CCSD, Accord Elsevier, Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), and Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)
- Subjects
Computer science ,lcsh:Biotechnology ,Biophysics ,Word error rate ,Biochemistry ,DNA sequencing ,chemistry.chemical_compound ,03 medical and health sciences ,0302 clinical medicine ,Structural Biology ,lcsh:TP248.13-248.65 ,Variant calling ,Genetics ,Nucleotide ,Copy-number variation ,MIT License ,ComputingMilieux_MISCELLANEOUS ,Paired-end tag ,Simulation ,[INFO.INFO-BI] Computer Science [cs]/Bioinformatics [q-bio.QM] ,030304 developmental biology ,chemistry.chemical_classification ,0303 health sciences ,Biological data ,Sequence analysis ,UMI ,Pipeline (software) ,Computer Science Applications ,Identifier ,chemistry ,NGS ,030220 oncology & carcinogenesis ,Simulator ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,DNA ,Research Article ,Biotechnology - Abstract
MotivationWith Next Generation Sequencing becoming more affordable every year, NGS technologies asserted themselves as the fastest and most reliable way to detect Single Nucleotide Variants (SNV) and Copy Number Variations (CNV) in cancer patients. These technologies can be used to sequence DNA at very high depths thus allowing to detect abnormalities in tumor cells with very low frequencies. A lot of different variant callers are publicly available and usually do a good job at calling out variants. However, when frequencies begin to drop under 1%, the specificity of these tools suffers greatly as true variants at very low frequencies can be easily confused with sequencing or PCR artifacts. The recent use of Unique Molecular Identifiers (UMI) in NGS experiments offered a way to accurately separate true variants from artifacts. UMI-based variant callers are slowly replacing raw-reads based variant callers as the standard method for an accurate detection of variants at very low frequencies. However, benchmarking done in the tools publication are usually realized on real biological data in which real variants are not known, making it difficult to assess their accuracy.ResultsWe present UMI-Gen, a UMI-based reads simulator for targeted sequencing paired-end data. UMI-Gen generates reference reads covering the targeted regions at a user customizable depth. After that, using a number of control files, it estimates the background error rate at each position and then modifies the generated reads to mimic real biological data. Finally, it will insert real variants in the reads from a list provided by the user.AvailabilityThe entire pipeline is available athttps://gitlab.com/vincent-sater/umigen-masterunder MIT license.Contactvincent.sater@gmail.com
- Published
- 2020
4. FM-index of alignment with gaps
- Author
-
Thierry Lecroq, Joong Chae Na, Hyunjoon Kim, Kunsoo Park, Laurent Mouchard, Seunghwan Min, M. Léonard, Heejin Park, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA), School of Computer Science and Engineering [Seoul] (School of CSE), Seoul National University [Seoul] (SNU), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU), and Lecroq, Thierry
- Subjects
FOS: Computer and information sciences ,0301 basic medicine ,General Computer Science ,Computer science ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,[INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,0102 computer and information sciences ,[INFO] Computer Science [cs] ,01 natural sciences ,Genome ,Pattern search ,Theoretical Computer Science ,law.invention ,03 medical and health sciences ,law ,Computer Science - Data Structures and Algorithms ,Data_FILES ,Data Structures and Algorithms (cs.DS) ,[INFO]Computer Science [cs] ,ComputingMilieux_MISCELLANEOUS ,Suffix array ,Siren (codec) ,030104 developmental biology ,Transformation (function) ,Index (publishing) ,010201 computation theory & mathematics ,Key (cryptography) ,Algorithm ,FM-index - Abstract
Recently, a compressed index for similar strings, called the FM-index of alignment (FMA), has been proposed with the functionalities of pattern search and random access. The FMA is quite efficient in space requirement and pattern search time, but it is applicable only for an alignment of similar strings without gaps. In this paper we propose the FM-index of alignment with gaps, a realistic index for similar strings, which allows gaps in their alignment. For this, we design a new version of the suffix array of alignment by using alignment transformation and a new definition of the alignment-suffix. The new suffix array of alignment enables us to support the LF-mapping and backward search, the key functionalities of the FM-index, regardless of gap existence in the alignment. We experimentally compared our index with RLCSA due to Makinen et al. on 100 genome sequences from the 1000 Genomes Project. The index size of our index is less than one third of that of RLCSA., 15pages
- Published
- 2018
5. Binary block order Rouen Transform
- Author
-
Jacqueline W. Daykin, Martine Léonard, Élise Prieur-Gaston, Arnaud Lefebvre, Richard Groult, Thierry Lecroq, Yannick Guesnet, Department of Computer Science, Royal Holloway [University of London] (RHUL), Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA), Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), and Normandie Université (NU)
- Subjects
General Computer Science ,Burrows–Wheeler transform ,bijective ,suffix array ,Binary number ,suffix-sorting ,0102 computer and information sciences ,02 engineering and technology ,Lyndon word ,01 natural sciences ,Theoretical Computer Science ,law.invention ,Combinatorics ,law ,lexicographic order ,B-word ,0202 electrical engineering, electronic engineering, information engineering ,[INFO]Computer Science [cs] ,word ,Computer Science::Data Structures and Algorithms ,Time complexity ,binary alphabet ,ComputingMilieux_MISCELLANEOUS ,Mathematics ,algorithm ,String (computer science) ,Suffix array ,inverse transform ,Lexicographical order ,data clustering ,linear ,block order ,010201 computation theory & mathematics ,string ,Bijection ,020201 artificial intelligence & image processing ,Burrows-Wheeler Transform ,Computer Science::Formal Languages and Automata Theory ,Word (computer architecture) - Abstract
Novel twin binary Burrows-Wheeler type transforms are introduced.The transforms are defined for Lyndon-like B-words which apply binary block order.We call this approach the B-BWT Rouen Transform.These bijective Rouen Transforms and inverses are computed in linear time.Preliminary experimental results indicate potential value of binary transforms. We introduce bijective Burrows-Wheeler type transforms for binary strings.1 The original method by Burrows and Wheeler 4 is based on lexicographic order for general alphabets, and the transform is defined to be the last column of the ordered BWT matrix. This new approach applies binary block order, B-order, which yields not one, but twin transforms: one based on Lyndon words, the other on a repetition of Lyndon words. These binary B-BWT transforms are constructed here for B-words, analogous structures to Lyndon words. A key computation in the transforms is the application of a linear-time suffix-sorting technique, such as 18,21,22,27, to sort the cyclic rotations of a binary input string into their B-order. Moreover, like the original lexicographic transform, we show that computing the B-BWT inverses is also achieved in linear time by using straightforward combinatorial arguments.
- Published
- 2016
6. FM-index of alignment: A compressed index for similar strings
- Author
-
Heejin Park, Laurent Mouchard, Martine Lonard, Kunsoo Park, Joong Chae Na, Hyunjoon Kim, Thierry Lecroq, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA), School of Computer Science and Engineering [Seoul] (School of CSE), Seoul National University [Seoul] (SNU), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), and Normandie Université (NU)
- Subjects
0301 basic medicine ,Compressed suffix array ,Theoretical computer science ,General Computer Science ,Computer science ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Suffix array ,02 engineering and technology ,Genome ,Pattern search ,Theoretical Computer Science ,law.invention ,03 medical and health sciences ,030104 developmental biology ,Index (publishing) ,law ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,020201 artificial intelligence & image processing ,Algorithm ,ComputingMilieux_MISCELLANEOUS ,FM-index - Abstract
In this paper we propose the FM-index of alignment, a compressed index for similar strings with the functionalities of pattern search and random access. For this, we first design a new and improved version of the suffix array of alignment. The FM-index of alignment is an FM-index of this suffix array of alignment. The FM-index of alignment supports the LF-mapping and backward search, the key functionalities of the FM-index, but the LF-mapping and backward search of our index is significantly more involved than the original FM-index. We implemented the FM-index of alignment and did experiments on 100 genome sequences from the 1000 Genomes Project. The index size of the FM-index of alignment is about a half of that of RLCSA due to Mkinen et al.
- Published
- 2016
7. Linear computation of unbordered conjugate on unordered alphabet
- Author
-
Jean-Pierre Duval, Arnaud Lefebvre, and Thierry Lecroq
- Subjects
Discrete mathematics ,General Computer Science ,Computation ,MathematicsofComputing_NUMERICALANALYSIS ,Theoretical Computer Science ,Combinatorics ,Combinatorics on words ,Bounded function ,Alphabet ,Time complexity ,Computer Science::Formal Languages and Automata Theory ,Word (computer architecture) ,Mathematics ,Conjugate - Abstract
We present an algorithm that, given a word w of length n on an unordered alphabet, computes one of its unbordered conjugates. If such a conjugate does not exist, the algorithm computes one of its conjugates that is a power of an unbordered word. The time complexity of the algorithm is O(n): the number of comparisons between letters of w is bounded by 4n.
- Published
- 2014
8. Fast exact string matching algorithms
- Author
-
Thierry Lecroq, Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU), and Lecroq, Thierry
- Subjects
Theoretical computer science ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Hash function ,String (computer science) ,Commentz-Walter algorithm ,[INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,0102 computer and information sciences ,02 engineering and technology ,String searching algorithm ,Approximate string matching ,01 natural sciences ,Computer Science Applications ,Theoretical Computer Science ,010201 computation theory & mathematics ,Signal Processing ,3-dimensional matching ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,String metric ,Algorithm ,ComputingMilieux_MISCELLANEOUS ,Information Systems ,Mathematics - Abstract
String matching is the problem of finding all the occurrences of a pattern in a text. We propose a very fast new family of string matching algorithms based on hashing q-grams. The new algorithms are the fastest on many cases, in particular, on small size alphabets.
- Published
- 2007
9. On special families of morphisms related to δ-matching and don't care symbols
- Author
-
Wojciech Plandowski, Richard Cole, Thierry Lecroq, Wojciech Rytter, and Costas S. Iliopoulos
- Subjects
Matching (graph theory) ,Existence theorem ,Computer Science Applications ,Theoretical Computer Science ,Combinatorics ,Morphism ,Integer ,Symbol (programming) ,Signal Processing ,Interval (graph theory) ,Pattern matching ,Alphabet ,Information Systems ,Mathematics - Abstract
The δ-matching problem is a special version of approximate pattern-matching, motivated by applications in musical information retrieval, where the alphabet Σ is an interval of integers. We investigate relations between δ-matching and pattern-matching with don't care symbol * (a symbol matching every symbol, including itself). We show that the δ-matching is reducible to k instances of pattern-matching with don't cares. We investigate how the numbers δ and k are related by introducing δ-distinguishing families H of morphisms. The size of H corresponds to k. We show that for minimal families |H| we have |H|=Θ(δ).
- Published
- 2003
10. Compror: On-line lossless data compression with a factor oracle
- Author
-
Thierry Lecroq and Arnaud Lefebvre
- Subjects
Lossless compression ,Computer science ,Computation ,Data compression ratio ,Data_CODINGANDINFORMATIONTHEORY ,Lossy compression ,Computer Science Applications ,Theoretical Computer Science ,Signal Processing ,Compression ratio ,Factor oracle ,Algorithm ,Time complexity ,Information Systems ,Data compression - Abstract
We present in this article a linear time and space data compression method. This method, based on a factor oracle and the computation of the length of repeated suffixes, is easy to implement, fast and gives good compression ratios.
- Published
- 2002
11. A variation on the Boyer-Moore algorithm
- Author
-
Thierry Lecroq
- Subjects
Discrete mathematics ,General Computer Science ,Matching (graph theory) ,Theoretical Computer Science ,law.invention ,Prefix ,law ,Position (vector) ,Suffix ,Boyer–Moore string search algorithm ,Time complexity ,Word (computer architecture) ,Computer Science(all) ,Complement (set theory) ,Mathematics - Abstract
String-matching consists in finding all the occurrences of a word w in a text t. Several algorithms have been found for solving this problem. They are presented by Aho in a recent book [l]. Among these algorithms, the Boyer-Moore approach [S, 1 l] seems to lead to the fastest algorithms for the search phase. Even if the original version of the Bayer-Moore algorithm has a quadratic worst case, its behavior in practice seems to be sublinear. Furthermore, other authors [9,2] have improved this worst-case time complexity for the search phase so that it becomes linear in the length of the text. The best bound for the number of letter comparisons is due to Apostolico and Giancarlo [2] and is 2n-m+ 1, where n is the length of the text and m the length of the word. Another particularity of the Boyer-Moore algorithm is that the study of its complexity is not obvious; see [lo, 73. Basically, the Boyer-Moore algorithm tries to find for a given position in the text the longest suffix of the word which ends at that position. A new approach can possess the ability for a given position in the text to compute the length of the longest prefix of the word which ends at that position. When we know this length, we are able to compute a better shift than the Boyer-Moore approach. In the first version we make a new attempt at matching, forgetting all the previous prefixes matched. This leads to a very simple algorithm but it has a quadratic worst-case running time. In an improved version we memorize the position where the previous longest prefix found ends and we make a new attempt at matching only the number of characters corresponding to the complement of this prefix. We are then able to compute a shift without reading again backwards more than half the characters of the prefix found in the previous attempt. This leads to a linear-time algorithm which scans the text characters at most three times each.
- Published
- 1992
12. Preface
- Author
-
Thierry Lecroq, Jan Bucek, Jorge Bernardino, Anna Esposito, Daniel Cozzolino, Alexander Samardak, Ivan Ganchev, Boaventura DaCosta, Jorge Freire de Sousa, Vijay Kumar Thakur, Elena Magaril, Petr Haušild, and Seppo Virtanen
- Subjects
Computational Theory and Mathematics ,Discrete Mathematics and Combinatorics ,Theoretical Computer Science - Published
- 2014
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.