41 results on '"Élise, Prieur-Gaston"'
Search Results
2. UMI-Gen: A UMI-based read simulator for variant calling evaluation in paired-end sequencing NGS libraries
- Author
-
Vincent Sater, Pierre-Julien Viailly, Thierry Lecroq, Philippe Ruminy, Caroline Bérard, Élise Prieur-Gaston, and Fabrice Jardin
- Subjects
Sequence analysis ,UMI ,Simulator ,Variant calling ,NGS ,Biotechnology ,TP248.13-248.65 - Abstract
Motivation: With Next Generation Sequencing becoming more affordable every year, NGS technologies asserted themselves as the fastest and most reliable way to detect Single Nucleotide Variants (SNV) and Copy Number Variations (CNV) in cancer patients. These technologies can be used to sequence DNA at very high depths thus allowing to detect abnormalities in tumor cells with very low frequencies. Multiple variant callers are publicly available and are usually efficient at calling out variants. However, when frequencies begin to drop under 1%, the specificity of these tools suffers greatly as true variants at very low frequencies can be easily confused with sequencing or PCR artifacts. The recent use of Unique Molecular Identifiers (UMI) in NGS experiments has offered a way to accurately separate true variants from artifacts. UMI-based variant callers are slowly replacing raw-read based variant callers as the standard method for an accurate detection of variants at very low frequencies. However, benchmarking done in the tools publication are usually realized on real biological data in which real variants are not known, making it difficult to assess their accuracy. Results: We present UMI-Gen, a UMI-based read simulator for targeted sequencing paired-end data. UMI-Gen generates reference reads covering the targeted regions at a user customizable depth. After that, using a number of control files, it estimates the background error rate at each position and then modifies the generated reads to mimic real biological data. Finally, it will insert real variants in the reads from a list provided by the user. Availability: The entire pipeline is available at https://gitlab.com/vincent-sater/umigen under MIT license.
- Published
- 2020
- Full Text
- View/download PDF
3. Practical fast on-line exact pattern matching algorithms for highly similar sequences.
- Author
-
Nadia Ben Nsira, Thierry Lecroq, and élise Prieur-Gaston
- Published
- 2018
- Full Text
- View/download PDF
4. Three Strategies for the Dead-Zone String Matching Algorithm.
- Author
-
Jacqueline W. Daykin, Richard Groult, Yannick Guesnet, Thierry Lecroq, Arnaud Lefebvre, Martine Léonard, Laurent Mouchard, élise Prieur-Gaston, and Bruce W. Watson
- Published
- 2018
5. Fast practical online exact single and multiple pattern matching algorithms in highly similar sequences.
- Author
-
Nadia Ben Nsira, Thierry Lecroq, and élise Prieur-Gaston
- Published
- 2019
- Full Text
- View/download PDF
6. A survey of string orderings and their application to the Burrows-Wheeler transform.
- Author
-
Jacqueline W. Daykin, Richard Groult, Yannick Guesnet, Thierry Lecroq, Arnaud Lefebvre, Martine Léonard, and élise Prieur-Gaston
- Published
- 2018
- Full Text
- View/download PDF
7. Online Computation of Abelian Runs.
- Author
-
Gabriele Fici, Thierry Lecroq, Arnaud Lefebvre, and élise Prieur-Gaston
- Published
- 2015
- Full Text
- View/download PDF
8. A note on easy and efficient computation of full abelian periods of a word.
- Author
-
Gabriele Fici, Thierry Lecroq, Arnaud Lefebvre, élise Prieur-Gaston, and William F. Smyth
- Published
- 2016
- Full Text
- View/download PDF
9. Abelian powers and repetitions in Sturmian words.
- Author
-
Gabriele Fici, Alessio Langiu, Thierry Lecroq, Arnaud Lefebvre, Filippo Mignosi, Jarkko Peltomäki, and élise Prieur-Gaston
- Published
- 2016
- Full Text
- View/download PDF
10. Binary block order Rouen Transform.
- Author
-
Jacqueline W. Daykin, Richard Groult, Yannick Guesnet, Thierry Lecroq, Arnaud Lefebvre, Martine Léonard, and élise Prieur-Gaston
- Published
- 2016
- Full Text
- View/download PDF
11. Fast computation of abelian runs.
- Author
-
Gabriele Fici, Tomasz Kociumaka, Thierry Lecroq, Arnaud Lefebvre, and élise Prieur-Gaston
- Published
- 2016
- Full Text
- View/download PDF
12. Abelian Repetitions in Sturmian Words.
- Author
-
Gabriele Fici, Alessio Langiu, Thierry Lecroq, Arnaud Lefebvre, Filippo Mignosi, and élise Prieur-Gaston
- Published
- 2013
- Full Text
- View/download PDF
13. Quasi-linear Time Computation of the Abelian Periods of a Word.
- Author
-
Gabriele Fici, Thierry Lecroq, Arnaud Lefebvre, élise Prieur-Gaston, and William F. Smyth
- Published
- 2012
14. Computing Abelian Periods in Words.
- Author
-
Gabriele Fici, Thierry Lecroq, Arnaud Lefebvre, and élise Prieur-Gaston
- Published
- 2011
15. Efficient pattern matching in degenerate strings with the Burrows-Wheeler transform.
- Author
-
Jacqueline W. Daykin, Richard Groult, Yannick Guesnet, Thierry Lecroq, Arnaud Lefebvre, Martine Léonard, Laurent Mouchard, élise Prieur-Gaston, and Bruce W. Watson
- Published
- 2017
16. Algorithms for computing Abelian periods of words.
- Author
-
Gabriele Fici, Thierry Lecroq, Arnaud Lefebvre, and élise Prieur-Gaston
- Published
- 2014
- Full Text
- View/download PDF
17. UMI-Varcal: A Low-Frequency Variant Caller for UMI-Tagged Paired-End Sequencing Data
- Author
-
Vincent, Sater, Pierre-Julien, Viailly, Thierry, Lecroq, Élise, Prieur-Gaston, Élodie, Bohers, Mathieu, Viennot, Philippe, Ruminy, Hélène, Dauchel, Pierre, Vera, and Fabrice, Jardin
- Subjects
DNA Copy Number Variations ,Computational Biology ,High-Throughput Nucleotide Sequencing ,DNA ,Artifacts - Abstract
The rapid transition from traditional sequencing methods to Next-Generation Sequencing (NGS) has allowed for a faster and more accurate detection of somatic variants (Single-Nucleotide Variant (SNV) and Copy Number Variation (CNV)) in tumor cells. NGS technologies require a succession of steps during which false variants can be silently added at low frequencies. Filtering these artifacts can be a rather difficult task especially when the experiments are designed to look for very low frequency variants. Recently, adding unique molecular barcodes called UMI (Unique Molecular Identifier) to the DNA fragments appears to be a very effective strategy to specifically filter out false variants from the variant calling results (Kukita et al. DNA Res 22(4):269-277, 2015; Newman et al. Nat Biotechnol 34(5):547-555, 2016; Schmitt et al. Proc Natl Acad Sci U S A 109(36):14508-14513). Here, we describe UMI-VarCal (Sater et al. Bioinformatics 36:2718-2724, 2020), which can use the UMI information from UMI-tagged reads to offer a faster and more accurate variant calling analysis.
- Published
- 2022
18. UMI-Gen: A UMI-based read simulator for variant calling evaluation in paired-end sequencing NGS libraries
- Author
-
Caroline Bérard, Thierry Lecroq, Fabrice Jardin, Vincent Sater, Pierre-Julien Viailly, Philippe Ruminy, Élise Prieur-Gaston, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Normandie Université (NU), Génomique et Médecine Personnalisée du Cancer et des Maladies Neuropsychiatriques (GPMCND), Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Normandie Université (NU)-Institut National de la Santé et de la Recherche Médicale (INSERM), CCSD, Accord Elsevier, Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), and Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)
- Subjects
Computer science ,lcsh:Biotechnology ,Biophysics ,Word error rate ,Biochemistry ,DNA sequencing ,chemistry.chemical_compound ,03 medical and health sciences ,0302 clinical medicine ,Structural Biology ,lcsh:TP248.13-248.65 ,Variant calling ,Genetics ,Nucleotide ,Copy-number variation ,MIT License ,ComputingMilieux_MISCELLANEOUS ,Paired-end tag ,Simulation ,[INFO.INFO-BI] Computer Science [cs]/Bioinformatics [q-bio.QM] ,030304 developmental biology ,chemistry.chemical_classification ,0303 health sciences ,Biological data ,Sequence analysis ,UMI ,Pipeline (software) ,Computer Science Applications ,Identifier ,chemistry ,NGS ,030220 oncology & carcinogenesis ,Simulator ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,DNA ,Research Article ,Biotechnology - Abstract
MotivationWith Next Generation Sequencing becoming more affordable every year, NGS technologies asserted themselves as the fastest and most reliable way to detect Single Nucleotide Variants (SNV) and Copy Number Variations (CNV) in cancer patients. These technologies can be used to sequence DNA at very high depths thus allowing to detect abnormalities in tumor cells with very low frequencies. A lot of different variant callers are publicly available and usually do a good job at calling out variants. However, when frequencies begin to drop under 1%, the specificity of these tools suffers greatly as true variants at very low frequencies can be easily confused with sequencing or PCR artifacts. The recent use of Unique Molecular Identifiers (UMI) in NGS experiments offered a way to accurately separate true variants from artifacts. UMI-based variant callers are slowly replacing raw-reads based variant callers as the standard method for an accurate detection of variants at very low frequencies. However, benchmarking done in the tools publication are usually realized on real biological data in which real variants are not known, making it difficult to assess their accuracy.ResultsWe present UMI-Gen, a UMI-based reads simulator for targeted sequencing paired-end data. UMI-Gen generates reference reads covering the targeted regions at a user customizable depth. After that, using a number of control files, it estimates the background error rate at each position and then modifies the generated reads to mimic real biological data. Finally, it will insert real variants in the reads from a list provided by the user.AvailabilityThe entire pipeline is available athttps://gitlab.com/vincent-sater/umigen-masterunder MIT license.Contactvincent.sater@gmail.com
- Published
- 2020
- Full Text
- View/download PDF
19. UMI-Varcal: A Low-Frequency Variant Caller for UMI-Tagged Paired-End Sequencing Data
- Author
-
Vincent Sater, Pierre-Julien Viailly, Thierry Lecroq, Élise Prieur-Gaston, Élodie Bohers, Mathieu Viennot, Philippe Ruminy, Hélène Dauchel, Pierre Vera, and Fabrice Jardin
- Published
- 2022
- Full Text
- View/download PDF
20. A Note on Easy and Efficient Computation of Full Abelian Periods of a Word.
- Author
-
Gabriele Fici, Thierry Lecroq, Arnaud Lefebvre, élise Prieur-Gaston, and William F. Smyth
- Published
- 2015
21. Fast Computation of Abelian Runs.
- Author
-
Gabriele Fici, Tomasz Kociumaka, Thierry Lecroq, Arnaud Lefebvre, and élise Prieur-Gaston
- Published
- 2015
22. Abelian Powers and Repetitions in Sturmian Words.
- Author
-
Gabriele Fici, Alessio Langiu, Thierry Lecroq, Arnaud Lefebvre, Filippo Mignosi, and élise Prieur-Gaston
- Published
- 2015
23. Algorithms for Computing Abelian Periods of Words
- Author
-
Gabriele Fici, Thierry Lecroq, Arnaud Lefebvre, and élise Prieur-Gaston
- Published
- 2012
24. Abelian Repetitions in Sturmian Words
- Author
-
Gabriele Fici, Alessio Langiu, Thierry Lecroq, Arnaud Lefebvre, Filippo Mignosi, and élise Prieur-Gaston
- Published
- 2012
25. UMI-VarCal: a new UMI-based variant caller that efficiently improves low-frequency variant detection in paired-end sequencing NGS libraries
- Author
-
Philippe Ruminy, Fabrice Jardin, Hélène Dauchel, Élise Prieur-Gaston, Vincent Sater, Pierre Vera, Pierre-Julien Viailly, Mathieu Viennot, Elodie Bohers, Thierry Lecroq, Génomique et Médecine Personnalisée du Cancer et des Maladies Neuropsychiatriques (GPMCND), Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Normandie Université (NU)-Institut National de la Santé et de la Recherche Médicale (INSERM), Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Normandie Université (NU), and Equipe Quantification en Imagerie Fonctionnelle (QuantIF-LITIS)
- Subjects
Statistics and Probability ,Computer science ,Pipeline (computing) ,computer.software_genre ,Biochemistry ,Polymerase Chain Reaction ,law.invention ,03 medical and health sciences ,chemistry.chemical_compound ,0302 clinical medicine ,law ,Molecular Biology ,Polymerase chain reaction ,Paired-end tag ,ComputingMilieux_MISCELLANEOUS ,030304 developmental biology ,0303 health sciences ,High-Throughput Nucleotide Sequencing ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,chemistry ,Filter (video) ,030220 oncology & carcinogenesis ,Data mining ,Noise (video) ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,computer ,DNA ,Algorithms - Abstract
Motivation Next-generation sequencing has become the go-to standard method for the detection of single-nucleotide variants in tumor cells. The use of such technologies requires a PCR amplification step and a sequencing step, steps in which artifacts are introduced at very low frequencies. These artifacts are often confused with true low-frequency variants that can be found in tumor cells and cell-free DNA. The recent use of unique molecular identifiers (UMI) in targeted sequencing protocols has offered a trustworthy approach to filter out artefactual variants and accurately call low-frequency variants. However, the integration of UMI analysis in the variant calling process led to developing tools that are significantly slower and more memory consuming than raw-reads-based variant callers. Results We present UMI-VarCal, a UMI-based variant caller for targeted sequencing data with better sensitivity compared to other variant callers. Being developed with performance in mind, UMI-VarCal stands out from the crowd by being one of the few variant callers that do not rely on SAMtools to do their pileup. Instead, at its core runs an innovative homemade pileup algorithm specifically designed to treat the UMI tags in the reads. After the pileup, a Poisson statistical test is applied at every position to determine if the frequency of the variant is significantly higher than the background error noise. Finally, an analysis of UMI tags is performed, a strand bias and a homopolymer length filter are applied to achieve better accuracy. We illustrate the results obtained using UMI-VarCal through the sequencing of tumor samples and we show how UMI-VarCal is both faster and more sensitive than other publicly available solutions. Availability and implementation The entire pipeline is available at https://gitlab.com/vincent-sater/umi-varcal-master under MIT license. Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2019
- Full Text
- View/download PDF
26. UMI-Gen: a UMI-based reads simulator for variant calling evaluation in paired-end sequencing NGS libraries
- Author
-
Élise Prieur-Gaston, Fabrice Jardin, Hélène Dauchel, Thierry Lecroq, Vincent Sater, Pierre-Julien Viailly, Elodie Bohers, Pierre Vera, Philippe Ruminy, and Mathieu Viennot
- Subjects
chemistry.chemical_classification ,0303 health sciences ,Computer science ,business.industry ,Pattern recognition ,Filter (signal processing) ,DNA sequencing ,law.invention ,03 medical and health sciences ,chemistry.chemical_compound ,0302 clinical medicine ,chemistry ,law ,030220 oncology & carcinogenesis ,Nucleotide ,Noise (video) ,Artificial intelligence ,business ,Polymerase chain reaction ,Paired-end tag ,DNA ,030304 developmental biology - Abstract
Next Generation Sequencing (NGS) has become the go-to standard method for the detection of Single Nucleotide Variants (SNV) in tumor cells. The use of such technologies requires a PCR amplification step and a sequencing step, steps in which artifacts are introduced at very low frequencies. These artifacts are often confused with true low-frequency variants that can be found in tumor cells and cell-free DNA. The recent use of Unique Molecular Identifiers (UMI) in targeted sequencing protocols has offered a trustworthy approach to filter out artifactual variants and accurately call low frequency variants. However, the integration of UMI analysis in the variant calling process led to developing tools that are significantly slower and more memory consuming than raw-reads-based variant callers. We present UMI-VarCal, a UMI-based variant caller for targeted sequencing data with better sensitivity compared to other variant callers. Being developed with performance in mind, UMI-VarCal stands out from the crowd by being one of the few variant callers that don9t rely on SAMtools to do their pileup. Instead, at its core runs an innovative homemade pileup algorithm specifically designed to treat the UMI tags in the reads. After the pileup, a Poisson statistical test is applied at every position to determine if the frequency of the variant is significantly higher than the background error noise. Finally, an analysis of UMI tags is performed, a strand bias and a homopolymer length filter are applied to achieve better accuracy. We illustrate the results obtained using UMI-VarCal through the sequencing of tumor samples and we show how UMI-VarCal is both faster and more sensitive than other publicly available solutions.
- Published
- 2019
- Full Text
- View/download PDF
27. Efficient pattern matching in degenerate strings with the Burrows–Wheeler transform
- Author
-
Thierry Lecroq, Yannick Guesnet, Jacqueline W. Daykin, Arnaud Lefebvre, Martine Léonard, Laurent Mouchard, Bruce W. Watson, Élise Prieur-Gaston, Richard Groult, Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV), Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), and Normandie Université (NU)
- Subjects
Burrows–Wheeler transform ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,0102 computer and information sciences ,02 engineering and technology ,01 natural sciences ,Theoretical Computer Science ,Combinatorics ,0202 electrical engineering, electronic engineering, information engineering ,Pattern matching ,Time complexity ,Mathematics ,Discrete mathematics ,Sequence ,String (computer science) ,Degenerate energy levels ,String ,Computer Science Applications ,Algorithm ,010201 computation theory & mathematics ,Signal Processing ,020201 artificial intelligence & image processing ,Degenerate ,Alphabet ,Constant (mathematics) ,Information Systems - Abstract
A degenerate or indeterminate string on an alphabet Σ is a sequence of non-empty subsets of Σ. Given a degenerate string t of length n and its Burrows–Wheeler transform we present a new method for searching for a degenerate pattern of length m in t running in O ( m n ) time on a constant size alphabet Σ. Furthermore, it is a hybrid pattern matching technique that works on both regular and degenerate strings. A degenerate string is said to be conservative if its number of non-solid letters is upper-bounded by a fixed positive constant q; in this case we show that the search time complexity is O ( q m 2 ) for counting the number of occurrences and O ( q m 2 + occ ) for reporting the found occurrences where occ is the number of occurrences of the pattern in t. Experimental results show that our method performs well in practice.
- Published
- 2019
- Full Text
- View/download PDF
28. Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text.
- Author
-
Antonio Jimeno-Yepes, élise Prieur-Gaston, and Aurélie Névéol
- Published
- 2013
- Full Text
- View/download PDF
29. EVA: Exome Variation Analyzer, an efficient and versatile tool for filtering strategies in medical genomics.
- Author
-
Sophie Coutant, Chloé Cabot, Arnaud Lefebvre, Martine Léonard, élise Prieur-Gaston, Dominique Campion, Thierry Lecroq, and Hélène Dauchel
- Published
- 2012
- Full Text
- View/download PDF
30. Matching health information seekers' queries to medical terms.
- Author
-
Lina Fatima Soualmia, élise Prieur-Gaston, Zied Moalla, Thierry Lecroq, and Stéfan Jacques Darmoni
- Published
- 2012
- Full Text
- View/download PDF
31. Abelian Powers and Repetitions in Sturmian Words
- Author
-
Alessio Langiu, Élise Prieur-Gaston, Gabriele Fici, Filippo Mignosi, Thierry Lecroq, Jarkko Peltomäki, Arnaud Lefebvre, Università degli studi di Palermo - University of Palermo, Department of Informatics [King's College London], King‘s College London, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Normandie Université (NU), Dipartimento di Informatica [Italy] (DI), Università degli Studi dell'Aquila (UNIVAQ), Fici, G., Langiu, A., Lecroq, T., Lefebvre, A., Mignosi, F., Peltomäki, J., Prieur-Gaston, É., Dipartimento di Energia, ingegneria dell'Informazione e modelli Matematici [Palermo] (DEIM), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), and Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)
- Subjects
FOS: Computer and information sciences ,Fibonacci number ,General Computer Science ,Discrete Mathematics (cs.DM) ,Formal Languages and Automata Theory (cs.FL) ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Computer Science - Formal Languages and Automata Theory ,0102 computer and information sciences ,01 natural sciences ,Theoretical Computer Science ,Combinatorics ,FOS: Mathematics ,Mathematics - Combinatorics ,[INFO]Computer Science [cs] ,Number Theory (math.NT) ,0101 mathematics ,Abelian group ,Continued fraction ,Fibonacci word ,ComputingMilieux_MISCELLANEOUS ,Quotient ,Mathematics ,Mathematics - Number Theory ,ta111 ,010102 general mathematics ,Computer Science (all) ,Sturmian word ,Abelian period ,Abelian power ,Critical exponent ,Lagrange constant ,010201 computation theory & mathematics ,Bounded function ,Exponent ,Combinatorics (math.CO) ,Computer Science::Formal Languages and Automata Theory ,Computer Science - Discrete Mathematics - Abstract
Richomme, Saari and Zamboni (J. Lond. Math. Soc. 83: 79-95, 2011) proved that at every position of a Sturmian word starts an abelian power of exponent $k$ for every $k > 0$. We improve on this result by studying the maximum exponents of abelian powers and abelian repetitions (an abelian repetition is an analogue of a fractional power) in Sturmian words. We give a formula for computing the maximum exponent of an abelian power of abelian period $m$ starting at a given position in any Sturmian word of rotation angle $\alpha$. vAs an analogue of the critical exponent, we introduce the abelian critical exponent $A(s_\alpha)$ of a Sturmian word $s_\alpha$ of angle $\alpha$ as the quantity $A(s_\alpha) = limsup\ k_{m}/m=limsup\ k'_{m}/m$, where $k_{m}$ (resp. $k'_{m}$) denotes the maximum exponent of an abelian power (resp.~of an abelian repetition) of abelian period $m$ (the superior limits coincide for Sturmian words). We show that $A(s_\alpha)$ equals the Lagrange constant of the number $\alpha$. This yields a formula for computing $A(s_\alpha)$ in terms of the partial quotients of the continued fraction expansion of $\alpha$. Using this formula, we prove that $A(s_\alpha) \geq \sqrt{5}$ and that the equality holds for the Fibonacci word. We further prove that $A(s_\alpha)$ is finite if and only if $\alpha$ has bounded partial quotients, that is, if and only if $s_{\alpha}$ is $\beta$-power-free for some real number $\beta$. Concerning the infinite Fibonacci word, we prove that: i) The longest prefix that is an abelian repetition of period $F_j$, $j>1$, has length $F_j( F_{j+1}+F_{j-1} +1)-2$ if $j$ is even or $F_j( F_{j+1}+F_{j-1} )-2$ if $j$ is odd, where $F_{j}$ is the $j$th Fibonacci number; ii) The minimum abelian period of any factor is a Fibonacci number. Further, we derive a formula for the minimum abelian periods of the finite Fibonacci words, Comment: To appear in Theoretical Computer Science
- Published
- 2018
- Full Text
- View/download PDF
32. Practical fast on-line exact pattern matching algorithms for highly similar sequences
- Author
-
Élise Prieur-Gaston, Nadia Ben Nsira, Thierry Lecroq, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), and Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)
- Subjects
0301 basic medicine ,Set (abstract data type) ,03 medical and health sciences ,030104 developmental biology ,Efficient algorithm ,Computer science ,Line (geometry) ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Window (computing) ,Pattern matching ,Variation (game tree) ,Algorithm ,ComputingMilieux_MISCELLANEOUS - Abstract
With the advent of high-throughput sequencing technologies there are more and more genomic sequences of individuals of the same species available. These sequences only differ by a very small amount of variations. There is thus a strong need for efficient algorithms for performing fast pattern matching in such specific sets of sequences. In this paper we propose efficient practical algorithms that solve on-line exact pattern matching problem in a set of highly similar DNA sequences. We first present a method for exact single pattern matching when $k$ variations are allowed in a window which size is equal to the pattern length. We then propose an algorithm for exact multiple pattern matching when only one variation is allowed in a window which size is equal to the length of the longest pattern. Experimental results show that our algorithms, though not optimal in the worst case, have good performances in practice.
- Published
- 2018
33. Three Strategies for the Dead-Zone String Matching Algorithm
- Author
-
Jacqueline Daykin, Richard Groult, Yannick Guesnet, Thierry Lecroq, Arnaud Lefebvre, Martine Léonard, Laurent Mouchard, Élise Prieur-Gaston, Bruce Watson, Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV), Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Normandie Université (NU), Jan Holub and Jan Ž\vdárek, and Lecroq, Thierry
- Subjects
[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,[INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,ComputingMilieux_MISCELLANEOUS - Abstract
International audience
- Published
- 2017
34. Efficient pattern matching in degenerate strings with the Burrows-Wheeler transform
- Author
-
Jacqueline Daykin, Richard Groult, Yannick Guesnet, Thierry Lecroq, Arnaud Lefebvre, Martine Léonard, Laurent Mouchard, Élise Prieur-Gaston, Bruce Watson, Modélisation, Information et Systèmes - UR UPJV 4290 (MIS), Université de Picardie Jules Verne (UPJV), Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Normandie Université (NU), and Lecroq, Thierry
- Subjects
FOS: Computer and information sciences ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Computer Science - Data Structures and Algorithms ,[INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Data Structures and Algorithms (cs.DS) ,[INFO]Computer Science [cs] ,[INFO] Computer Science [cs] - Abstract
A degenerate or indeterminate string on an alphabet $\Sigma$ is a sequence of non-empty subsets of $\Sigma$. Given a degenerate string $t$ of length $n$, we present a new method based on the Burrows--Wheeler transform for searching for a degenerate pattern of length $m$ in $t$ running in $O(mn)$ time on a constant size alphabet $\Sigma$. Furthermore, it is a hybrid pattern-matching technique that works on both regular and degenerate strings. A degenerate string is said to be conservative if its number of non-solid letters is upper-bounded by a fixed positive constant $q$; in this case we show that the search complexity time is $O(qm^2)$. Experimental results show that our method performs well in practice., Comment: 7 pages, 1 figure
- Published
- 2017
35. Fast practical online exact single and multiple pattern matching algorithms in highly similar sequences
- Author
-
Thierry Lecroq, Nadia Ben Nsira, Élise Prieur-Gaston, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), and Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)
- Subjects
Computer science ,Bioinformatics ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,0206 medical engineering ,Variation (game tree) ,String searching algorithm ,DNA sequences ,02 engineering and technology ,Library and Information Sciences ,General Biochemistry, Genetics and Molecular Biology ,Set (abstract data type) ,Computational biology ,Landau-Vishkin algorithm ,Pattern matching ,Genomic sequences ,Similar sequences ,Efficient algorithm ,Window (computing) ,Algorithm design ,Aho-Corasick algorithm ,Aho–Corasick string matching algorithm ,String matching ,Algorithm ,020602 bioinformatics ,Information Systems - Abstract
International audience; With the advent of high-throughput sequencing technologies there are more and more genomic sequences of individuals of the same species available. These sequences only differ by a very small amount of variations. There is thus a strong need for efficient algorithms for performing fast pattern matching in such specific sets of sequences. In this paper, we propose efficient practical algorithms that solve on-line exact pattern matching problem in a set of highly similar DNA sequences. We first present a method for exact single pattern matching when k variations are allowed in a window which size is equal to the pattern length. We then propose an algorithm for exact multiple pattern matching when only one variation is allowed in a window which size is equal to the length of the longest pattern. Experimental results show that our algorithms, though not optimal in the worst case, have good performances in practice.
- Published
- 2019
- Full Text
- View/download PDF
36. A note on easy and efficient computation of full abelian periods of a word
- Author
-
Arnaud Lefebvre, William F. Smyth, Thierry Lecroq, Gabriele Fici, Élise Prieur-Gaston, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Normandie Université (NU), McMaster University [Hamilton, Ontario], Università degli studi di Palermo - University of Palermo, Department of Computing and Software (McMaster University), Fici, G., Lecroq, T., Lefebvre, A., Prieur-Gaston, É., Smyth, W., and Lecroq, Thierry
- Subjects
FOS: Computer and information sciences ,Discrete Mathematics (cs.DM) ,Formal Languages and Automata Theory (cs.FL) ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,[INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Elementary abelian group ,Computer Science - Formal Languages and Automata Theory ,0102 computer and information sciences ,02 engineering and technology ,[INFO] Computer Science [cs] ,01 natural sciences ,Rank of an abelian group ,Combinatorics ,Simple (abstract algebra) ,Computer Science - Data Structures and Algorithms ,0202 electrical engineering, electronic engineering, information engineering ,Discrete Mathematics and Combinatorics ,Data Structures and Algorithms (cs.DS) ,[INFO]Computer Science [cs] ,Abelian group ,Hidden subgroup problem ,Discrete Mathematics and Combinatoric ,ComputingMilieux_MISCELLANEOUS ,Mathematics ,Combinatorics on word ,Discrete mathematics ,Applied Mathematics ,020206 networking & telecommunications ,Abelian period ,Text algorithm ,Weak repetition ,Free abelian group ,Abelian power ,Combinatorics on words ,Design of algorithm ,010201 computation theory & mathematics ,Word (computer architecture) ,Computer Science::Formal Languages and Automata Theory ,Computer Science - Discrete Mathematics - Abstract
Constantinescu and Ilie (Bulletin of the EATCS 89, 167-170, 2006) introduced the idea of an Abelian period with head and tail of a finite word. An Abelian period is called full if both the head and the tail are empty. We present a simple and easy-to-implement $O(n\log\log n)$-time algorithm for computing all the full Abelian periods of a word of length $n$ over a constant-size alphabet. Experiments show that our algorithm significantly outperforms the $O(n)$ algorithm proposed by Kociumaka et al. (Proc. of STACS, 245-256, 2013) for the same problem., Accepted for publication in Discrete Applied Mathematics
- Published
- 2016
- Full Text
- View/download PDF
37. Online Computation of Abelian Runs
- Author
-
Arnaud Lefebvre, Thierry Lecroq, Gabriele Fici, Élise Prieur-Gaston, Università degli studi di Palermo - University of Palermo, Equipe Traitement de l'information en Biologie Santé (TIBS - LITIS), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Normandie Université (NU), Adrian-Horia Dediu, AH, Formenti, E, Vide, CM, Truthe, B, Fici, G., Lecroq, T., Lefebvre, A., Prieur-Gaston, É., and Lecroq, Thierry
- Subjects
FOS: Computer and information sciences ,Formal Languages and Automata Theory (cs.FL) ,Abelian run ,[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,[INFO.INFO-DS] Computer Science [cs]/Data Structures and Algorithms [cs.DS] ,Computer Science - Formal Languages and Automata Theory ,0102 computer and information sciences ,02 engineering and technology ,[INFO] Computer Science [cs] ,01 natural sciences ,Online computation ,Theoretical Computer Science ,Combinatorics ,Computer Science - Data Structures and Algorithms ,0202 electrical engineering, electronic engineering, information engineering ,Data Structures and Algorithms (cs.DS) ,[INFO]Computer Science [cs] ,Abelian group ,ComputingMilieux_MISCELLANEOUS ,Mathematics ,Combinatorics on word ,Discrete mathematics ,Computer Science (all) ,020206 networking & telecommunications ,Abelian period ,Text algorithm ,16. Peace & justice ,Substring ,Combinatorics on words ,010201 computation theory & mathematics ,Word (group theory) ,Computer Science::Formal Languages and Automata Theory - Abstract
Given a word $w$ and a Parikh vector $\mathcal{P}$, an abelian run of period $\mathcal{P}$ in $w$ is a maximal occurrence of a substring of $w$ having abelian period $\mathcal{P}$. We give an algorithm that finds all the abelian runs of period $\mathcal{P}$ in a word of length $n$ in time $O(n\times |\mathcal{P}|)$ and space $O(\sigma+|\mathcal{P}|)$., Comment: To appear in the Proceedings of LATA 2015
- Published
- 2015
38. Abelian Repetitions in Sturmian Words
- Author
-
Élise Prieur-Gaston, Thierry Lecroq, Arnaud Lefebvre, Gabriele Fici, Alessio Langiu, Filippo Mignosi, Fici, G, Langiu, A, Lecroq, T, Lefebvre, A, Mignosi, F, and Prieur-Gaston, E
- Subjects
FOS: Computer and information sciences ,Fibonacci number ,Discrete Mathematics (cs.DM) ,Formal Languages and Automata Theory (cs.FL) ,Computer Science - Formal Languages and Automata Theory ,G.2.1 ,68R15 ,FOS: Mathematics ,Combinatorics on words, Sturmian word ,Mathematics - Combinatorics ,Abelian group ,Fibonacci word ,Mathematics ,Discrete mathematics ,Mathematics::Combinatorics ,Sturmian word ,Combinatorics on words ,Number theory ,F.2.2 ,F.4.3 ,Exponent ,Combinatorics (math.CO) ,Word (group theory) ,Computer Science::Formal Languages and Automata Theory ,Computer Science - Discrete Mathematics - Abstract
We investigate abelian repetitions in Sturmian words. We exploit a bijection between factors of Sturmian words and subintervals of the unitary segment that allows us to study the periods of abelian repetitions by using classical results of elementary Number Theory. We prove that in any Sturmian word the superior limit of the ratio between the maximal exponent of an abelian repetition of period $m$ and $m$ is a number $\geq\sqrt{5}$, and the equality holds for the Fibonacci infinite word. We further prove that the longest prefix of the Fibonacci infinite word that is an abelian repetition of period $F_j$, $j>1$, has length $F_j(F_{j+1}+F_{j-1} +1)-2$ if $j$ is even or $F_j(F_{j+1}+F_{j-1})-2$ if $j$ is odd. This allows us to give an exact formula for the smallest abelian periods of the Fibonacci finite words. More precisely, we prove that for $j\geq 3$, the Fibonacci word $f_j$ has abelian period equal to $F_n$, where $n = \lfloor{j/2}\rfloor$ if $j = 0, 1, 2\mod{4}$, or $n = 1 + \lfloor{j/2}\rfloor$ if $ j = 3\mod{4}$., Accepted to DLT 2013
- Published
- 2013
39. Correction orthographique de requêtes: L’apport des distances de Levenshtein et Stoilos
- Author
-
Élise Prieur-Gaston, Stéfan Jacques Darmoni, Zied Moalla, and Lina Fatima Soualmia
- Subjects
Matching (graph theory) ,Computer science ,business.industry ,String (computer science) ,computer.software_genre ,Term (time) ,String distance ,Index (publishing) ,Normalized edit distance ,Metric (mathematics) ,Edit distance ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
Background: Medical text repositories not only constitute a significant amount of data but also represent an interesting scientific test bed for those willing to apply natural language processing to information retrieval. In order to improve retrieval performance of the Catalogue and Index of Health Resources in French (CISMeF) and its search tool Doc’CISMeF, we tested a new method to correct misspellings of the queries written by the users. Methods: In addition to exact phonetic term matching, we tested two approximate string comparators. The approximate comparators are the string distance metric of Stoilos and the Levenshtein edit distance. We also calculated the results of the two-combined algorithm to examine whether it improves misspelling correction of the queries. Results: At a threshold comparator score of 0.2, the normalized Levenshtein algorithm achieved the highest recall of 76% but the highest precision 94% is achieved by combining the distances of Levenshtein and Stoilos. Conclusion: Although the well-known good performance of the normalized edit distance of Levenshtein, we have demonstrated in this paper that its combination with the Stoilos algorithm improves the results for misspelling correction.
- Published
- 2011
- Full Text
- View/download PDF
40. Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
- Author
-
Élise Prieur-Gaston, Aurélie Névéol, Antonio Jimeno Yepes, National Library of Medicine (NLM), National Institutes of Health [Bethesda] (NIH)-National Center for Biotechnology Information (NCBI), National ICT Australia [Sydney] (NICTA), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), Normandie Université (NU), Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), and Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Statistical machine translation ,020205 medical informatics ,Machine translation ,Multilingual corpus generation ,Computer science ,media_common.quotation_subject ,MEDLINE ,Automatic translation ,02 engineering and technology ,computer.software_genre ,Biochemistry ,Domain (software engineering) ,Structural Biology ,0202 electrical engineering, electronic engineering, information engineering ,Quality (business) ,Official language ,Molecular Biology ,media_common ,Biomedical domain ,Publishing ,Information retrieval ,Models, Statistical ,business.industry ,Applied Mathematics ,Statistical model ,Linguistics ,Translating ,Computer Science Applications ,Biomedical text ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,020201 artificial intelligence & image processing ,Artificial intelligence ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,business ,computer ,Natural language processing ,Research Article - Abstract
International audience; Background: Most of the institutional and research information in the biomedical domain is available in the formof English text. Even in countries where English is an official language, such as the United States, language can be abarrier for accessing biomedical information for non-native speakers. Recent progress in machine translationsuggests that this technique could help make English texts accessible to speakers of other languages. However, thelack of adequate specialized corpora needed to train statistical models currently limits the quality of automatictranslations in the biomedical domain.Results: We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain,using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINEand abstract text automatically retrieved from journal websites, which substantially extends the corpora used inprevious work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish)we use the Moses package to train a statistical machine translation model that outperforms previous models forautomatic translation of biomedical text.Conclusions: We have built translation data sets in the biomedical domain that can easily be extended to otherlanguages available in MEDLINE. These sets can successfully be applied to train statistical machine translationmodels. While further progress should be made by incorporating out-of-domain corpora and domain-specificlexicons, we believe that this work improves the automatic translation of biomedical texts.
- Published
- 2013
- Full Text
- View/download PDF
41. Matching health information seekers' queries to medical terms
- Author
-
Thierry Lecroq, Zied Moalla, Stéfan Jacques Darmoni, Lina Fatima Soualmia, Élise Prieur-Gaston, Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Université Le Havre Normandie (ULH), and Normandie Université (NU)
- Subjects
Matching (statistics) ,020205 medical informatics ,Computer science ,Score ,Information Storage and Retrieval ,02 engineering and technology ,Query language ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Query expansion ,Structural Biology ,Online search ,Controlled vocabulary ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,lcsh:QH301-705.5 ,Molecular Biology ,Language ,Internet ,Information retrieval ,Applied Mathematics ,Research ,String (computer science) ,3. Good health ,Computer Science Applications ,lcsh:Biology (General) ,Vocabulary, Controlled ,[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] ,lcsh:R858-859.7 ,020201 artificial intelligence & image processing ,Edit distance ,Algorithms ,Medical Informatics ,Coding (social sciences) - Abstract
Background The Internet is a major source of health information but most seekers are not familiar with medical vocabularies. Hence, their searches fail due to bad query formulation. Several methods have been proposed to improve information retrieval: query expansion, syntactic and semantic techniques or knowledge-based methods. However, it would be useful to clean those queries which are misspelled. In this paper, we propose a simple yet efficient method in order to correct misspellings of queries submitted by health information seekers to a medical online search tool. Methods In addition to query normalizations and exact phonetic term matching, we tested two approximate string comparators: the similarity score function of Stoilos and the normalized Levenshtein edit distance. We propose here to combine them to increase the number of matched medical terms in French. We first took a sample of query logs to determine the thresholds and processing times. In the second run, at a greater scale we tested different combinations of query normalizations before or after misspelling correction with the retained thresholds in the first run. Results According to the total number of suggestions (around 163, the number of the first sample of queries), at a threshold comparator score of 0.3, the normalized Levenshtein edit distance gave the highest F-Measure (88.15%) and at a threshold comparator score of 0.7, the Stoilos function gave the highest F-Measure (84.31%). By combining Levenshtein and Stoilos, the highest F-Measure (80.28%) is obtained with 0.2 and 0.7 thresholds respectively. However, queries are composed by several terms that may be combination of medical terms. The process of query normalization and segmentation is thus required. The highest F-Measure (64.18%) is obtained when this process is realized before spelling-correction. Conclusions Despite the widely known high performance of the normalized edit distance of Levenshtein, we show in this paper that its combination with the Stoilos algorithm improved the results for misspelling correction of user queries. Accuracy is improved by combining spelling, phoneme-based information and string normalizations and segmentations into medical terms. These encouraging results have enabled the integration of this method into two projects funded by the French National Research Agency-Technologies for Health Care. The first aims to facilitate the coding process of clinical free texts contained in Electronic Health Records and discharge summaries, whereas the second aims at improving information retrieval through Electronic Health Records.
- Published
- 2012
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.