179 results on '"Text indexing"'
Search Results
2. Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts
- Author
-
Duyster, Anouk, Kociumaka, Tomasz, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Lipták, Zsuzsanna, editor, Moura, Edleno, editor, Figueroa, Karina, editor, and Baeza-Yates, Ricardo, editor
- Published
- 2025
- Full Text
- View/download PDF
3. Text Indexing for Faster Gapped Pattern Matching.
- Author
-
Hossen, Md Helal, Gibney, Daniel, and Thankachan, Sharma V.
- Subjects
- *
TEXT mining , *HARDNESS , *ALGORITHMS , *INDEXING , *COMPUTATIONAL biology - Abstract
We revisit the following version of the Gapped String Indexing problem, where the goal is to preprocess a text T [ 1.. n ] to enable efficient reporting of all occ occurrences of a gapped pattern P = P 1 [ α.. β ] P 2 in T. An occurrence of P in T is defined as a pair (i , j) where substrings T [ i.. i + | P 1 |) and T [ j.. j + | P 2 |) match P 1 and P 2 , respectively, with a gap j − (i + | P 1 |) lying within the interval [ α.. β ] . This problem has significant applications in computational biology and text mining. A hardness result on this problem suggests that any index with polylogarithmic query time must occupy near quadratic space. In a recent study [STACS 2024], Bille et al. presented a sub-quadratic space index using space O ˜ (n 2 − δ / 3) , where 0 ≤ δ ≤ 1 is a parameter fixed at the time of index construction. Its query time is O ˜ (| P 1 | + | P 2 | + n δ · (1 + occ)) , which is sub-linear per occurrence when δ < 1 . We show how to achieve a gap-sensitive query time of O ˜ (| P 1 | + | P 2 | + n δ · (1 + occ 1 − δ) + ∑ g ∈ [ α.. β ] occ g · g δ) using the same space, where occ g denotes the number of occurrences with gap g. This is faster when there are many occurrences with small gaps. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Text Indexing
- Author
-
Jo, Taeho, Kacprzyk, Janusz, Series Editor, and Jo, Taeho
- Published
- 2024
- Full Text
- View/download PDF
5. Near-Optimal Search Time in δ-Optimal Space, and Vice Versa.
- Author
-
Kociumaka, Tomasz, Navarro, Gonzalo, and Olivares, Francisco
- Subjects
- *
COMPRESSIBILITY - Abstract
Two recent lower bounds on the compressibility of repetitive sequences, δ ≤ γ , have received much attention. It has been shown that a length-n string S over an alphabet of size σ can be represented within the optimal O (δ log n log σ δ log n ) space, and further, that within that space one can find all the occ occurrences in S of any length-m pattern in time O (m log n + o c c log ϵ n) for any constant ϵ > 0 . Instead, the near-optimal search time O (m + (o c c + 1) log ϵ n) has been achieved only within O (γ log n γ) space. Both results are based on considerably different locally consistent parsing techniques. The question of whether the better search time could be supported within the δ -optimal space remained open. In this paper, we prove that both techniques can indeed be combined to obtain the best of both worlds: O (m + (o c c + 1) log ϵ n) search time within O (δ log n log σ δ log n ) space. Moreover, the number of occurrences can be computed in O (m + log 2 + ϵ n) time within O (δ log n log σ δ log n ) space. We also show that an extra sublogarithmic factor on top of this space enables optimal O (m + o c c) search time, whereas an extra logarithmic factor enables optimal O(m) counting time. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. CSA-MEM: Enhancing Circular DNA Multiple Alignment Through Text Indexing Algorithms
- Author
-
Salgado, André, Fernandes, Francisco, Freitas, Ana Teresa, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Guo, Xuan, editor, Mangul, Serghei, editor, Patterson, Murray, editor, and Zelikovsky, Alexander, editor
- Published
- 2023
- Full Text
- View/download PDF
7. Index Expansion
- Author
-
Jo, Taeho and Jo, Taeho
- Published
- 2023
- Full Text
- View/download PDF
8. Computing All-vs-All MEMs in Run-Length-Encoded Collections of HiFi Reads
- Author
-
Díaz-Domínguez, Diego, Puglisi, Simon J., Salmela, Leena, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Arroyuelo, Diego, editor, and Poblete, Barbara, editor
- Published
- 2022
- Full Text
- View/download PDF
9. Ranked Document Retrieval in External Memory.
- Author
-
SHAH, RAHUL, CHENG SHENG, THANKACHAN, SHARMA, and VITTER, JEFFREY
- Subjects
INFORMATION retrieval ,DATA structures - Abstract
The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection {T1,T2, . . .,Td} of d strings (called documents) of total length n into a data structure, such that for any given query (P, k), where P is a string (called pattern) of length p ≥ 1 and k ∈ [1,d] is an integer, the identifiers of those k documents that are most relevant to P can be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented an O(n)-space (in words) data structure with O(p + k log k) query time. The query time was later improved to O(p + k) [SODA 2012] and further to O(p/logσ n + k) [SIAM Journal on Computing 2017] by Navarro and Nekrich, where σ is the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takes O(n)-space and answer queries in O(p/B +logB n +k/B +log ∗ (n/B)) I/Os, where B is the block size. The second one takes O(n log ∗ (n/B)) space and answer queries in optimal O(p/B + logB n + k/B) I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-k document retrieval, we present an O(n log(d/B)) space data structure with optimal query cost. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
10. Data Encoding
- Author
-
Jo, Taeho and Jo, Taeho
- Published
- 2021
- Full Text
- View/download PDF
11. Improved pangenomic classification accuracy with chain statistics.
- Author
-
Brown NK, Shivakumar VS, and Langmead B
- Abstract
Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index classification used matching statistics or pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail to capture coarse-grained information about whether seeds appear co-linearly in the reference. We present a novel approach that additionally obtains coarse-grained co-linearity ("chain") statistics. We do this without using a chaining algorithm, which would require superlinear time in the number of matches. We start with a collection of strings, avoiding the multiple-alignment step required by graph approaches. We rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs that correspond to these multi-MUMs. From these, we select those that can be "tunneled," and mark these with the corresponding multi-MUM identifiers. This yields an O ( r + n / d ) -space index for a collection of d sequences having a length- n BWT consisting of r maximal equal-character runs. Using the index, we simultaneously compute fine-grained matching statistics and coarse-grained chain statistics in linear time with respect to query length. We found that this substantially improves classification accuracy compared to past compressed-indexing approaches and reaches the same level of accuracy as less efficient alignment-based methods., Competing Interests: Disclosure of Interests. Ben Langmead is the founder of InOrder Labs, LLC.
- Published
- 2024
- Full Text
- View/download PDF
12. An Efficient Elastic-Degenerate Text Index? Not Likely
- Author
-
Gibney, Daniel, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Boucher, Christina, editor, and Thankachan, Sharma V., editor
- Published
- 2020
- Full Text
- View/download PDF
13. Reconstructing Scanned Documents for Full-Text Indexing to Empower Digital Library Services
- Author
-
Nitu, Melania, Dascalu, Mihai, Dascalu, Maria-Iuliana, Cotet, Teodor-Mihai, Tomescu, Silvia, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Popescu, Elvira, editor, Hao, Tianyong, editor, Hsu, Ting-Chia, editor, Xie, Haoran, editor, Temperini, Marco, editor, and Chen, Wei, editor
- Published
- 2020
- Full Text
- View/download PDF
14. Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures.
- Author
-
NAVARRO, GONZALO
- Subjects
- *
MOORE'S law , *COLLECTIONS , *MAGNITUDE (Mathematics) - Abstract
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability to handle them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures. In this first part, we describe the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings. In the quest for an ideal measure of repetitiveness, we uncover a fascinating web of relations between those measures, as well as the limits up to which the data can be recovered, and up to which direct access to the compressed data can be provided. This is the basic aspect of indexability, which is covered in the second part of this survey. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
15. Indexing Highly Repetitive String Collections, Part II: Compressed Indexes.
- Author
-
NAVARRO, GONZALO
- Subjects
- *
MOORE'S law , *COLLECTIONS , *MAGNITUDE (Mathematics) - Abstract
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures. In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations. We conclude with the current challenges in this fascinating field. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
16. Text Indexing
- Author
-
Jo, Taeho, Kacprzyk, Janusz, Series Editor, and Jo, Taeho
- Published
- 2019
- Full Text
- View/download PDF
17. Applications of Multilingual Thesauri for the Texts Indexing in the Field of Agriculture
- Author
-
Karwowski, Waldemar, Orłowski, Arkadiusz, Rusek, Marian, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Pejaś, Jerzy, editor, El Fray, Imed, editor, and Hyla, Tomasz, editor
- Published
- 2019
- Full Text
- View/download PDF
18. SACABench: Benchmarking Suffix Array Construction
- Author
-
Bahne, Johannes, Bertram, Nico, Böcker, Marvin, Bode, Jonas, Fischer, Johannes, Foot, Hermann, Grieskamp, Florian, Kurpicz, Florian, Löbel, Marvin, Magiera, Oliver, Pink, Rosa, Piper, David, Poeplau, Christopher, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Brisaboa, Nieves R., editor, and Puglisi, Simon J., editor
- Published
- 2019
- Full Text
- View/download PDF
19. Identifying Enemy Item Pairs using Natural Language Processing.
- Author
-
Becker, Kirk A. and Shu-chuan Kao
- Subjects
NATURAL language processing ,DIPLOMATICS ,MACHINE translating - Abstract
Natural Language Processing (NLP) offers methods for understanding and quantifying the similarity between written documents. Within the testing industry these methods have been used for automatic item generation, automated scoring of text and speech, modeling item characteristics, automatic question answering, machine translation, and automated referencing. This paper presents research into the use of NLP for the identification of enemy and duplicate items to improve the maintenance of test item banks. Similar pairs of items can be identified using NLP, limiting the number of items content experts must review to identify enemy and duplicat items. Results from multiple testing programs show that previousely unidentified enemy pairs can be discovered with this method. [ABSTRACT FROM AUTHOR]
- Published
- 2022
20. Grammar-compressed indexes with logarithmic search time.
- Author
-
Claude, Francisco, Navarro, Gonzalo, and Pacheco, Alejandro
- Subjects
- *
GRAMMAR , *SIGNS & symbols , *TIME , *TEXT files , *COLLECTIONS - Abstract
Let a text T [ 1.. n ] be the only string generated by a context-free grammar with g (terminal and nonterminal) symbols, and of size G (measured as the sum of the lengths of the right-hand sides of the rules). Such a grammar, called a grammar-compressed representation of T , can be encoded using G lg G bits. We introduce the first grammar-compressed index that uses O (G lg n) bits (precisely, G lg n + (2 + ϵ) G lg g for any constant ϵ > 0) and can find the occ occurrences of patterns P [ 1.. m ] in time O ((m 2 + o c c) lg G). We implement the index and demonstrate its practicality in comparison with the state of the art, on highly repetitive text collections. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
21. FM-index for Dummies
- Author
-
Grabowski, Szymon, Raniszewski, Marcin, Deorowicz, Sebastian, Diniz Junqueira Barbosa, Simone, Series editor, Chen, Phoebe, Series editor, Du, Xiaoyong, Series editor, Filipe, Joaquim, Series editor, Kara, Orhun, Series editor, Kotenko, Igor, Series editor, Liu, Ting, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Kozielski, Stanisław, editor, Mrozek, Dariusz, editor, Kasprowski, Paweł, editor, Małysiak-Mrozek, Bożena, editor, and Kostrzewa, Daniel, editor
- Published
- 2017
- Full Text
- View/download PDF
22. Efficient Online String Matching Based on Characters Distance Text Sampling.
- Author
-
Faro, Simone, Marino, Francesco Pio, and Pavone, Arianna
- Subjects
- *
INFORMATION retrieval , *COMPUTER science , *ELECTRONIC information resource searching , *ALGORITHMS , *APPLICATION software , *HAMMING distance , *PATTERN matching , *NATURAL language processing - Abstract
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a given pivot character and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal average-time complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11 to 2.8% of the text size, with a gain up to 50% if compared with previous solutions. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
23. Text Indexing
- Author
-
Aluru, Srinivas and Kao, Ming-Yang, editor
- Published
- 2016
- Full Text
- View/download PDF
24. Suffix Array Construction
- Author
-
Kärkkäinen, Juha and Kao, Ming-Yang, editor
- Published
- 2016
- Full Text
- View/download PDF
25. LOTUS: Adaptive Text Search for Big Linked Data
- Author
-
Ilievski, Filip, Beek, Wouter, van Erp, Marieke, Rietveld, Laurens, Schlobach, Stefan, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Sack, Harald, editor, Blomqvist, Eva, editor, d'Aquin, Mathieu, editor, Ghidini, Chiara, editor, Ponzetto, Simone Paolo, editor, and Lange, Christoph, editor
- Published
- 2016
- Full Text
- View/download PDF
26. Bidirectional String Anchors for Improved Text Indexing and Top-K Similarity Search
- Author
-
Loukides, Grigorios, Pissis, Solon P., Sweering, Michelle, Loukides, Grigorios, Pissis, Solon P., and Sweering, Michelle
- Abstract
The minimizers sampling mechanism is a popular mechanism for string sampling. However, minimizers sampling mechanisms lack good guarantees on the expected size of their samples for different combinations of their input parameters. Furthermore, indexes constructed over minimizers samples lack good worst-case guarantees for on-line pattern searches. In response, we propose bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given an integer ℓ, our mechanism selects the lexicographically smallest rotation in every length-ℓ fragment. We show that, like minimizers samples, bd-anchors samples are approximately uniform, locally consistent, and computable in linear time. Furthermore, our experiments demonstrate that the bd-anchors sample sizes decrease proportionally to ℓ; and that these sizes are competitive to or smaller than the minimizers sample sizes. We theoretically justify these results by analyzing the expected size of bd-anchors samples. We also prove that computing a total order on the input alphabet which minimizes the bd-anchors sample size is NP-hard. We next highlight the benefits of bd-anchors in two important applications: text indexing and top-K similarity search. For the first application, we develop an index for performing on-line pattern searches in near-optimal time, and show experimentally that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index; we also show that it is substantially faster than two classic text indexes. For the second application, we develop a heuristic for top-K similarity search under edit distance, and show experimentally that it is generally as accurate as the state-of-the-art tool for the same purpose but more than one order of magnitude faster.
- Published
- 2023
- Full Text
- View/download PDF
27. Implementasi Sistem Text Indexing pada Aplikasi Pencarian Data Dokumen Menggunakan Mongodb Berbasis Web
- Author
-
Frankie and Susetyo, Yeremia Alfa
- Subjects
Text Indexing ,Fulltext Search ,Flask ,MongoDB ,Pencarian Data ,Python - Abstract
Pertumbuhan teknologi informasi yang sangat pesat membuat jumlah data yang tersimpan dalam database semakin banyak setiap harinya. Basis data relasional (SQL) yang sudah lama digunakan kini mengalami perkembangan dengan munculnya database NoSQL seperti MongoDB. MongoDB menyimpan data dalam format BSON dan memiliki fitur Text Indexes yang berguna untuk mempercepat pencarian teks pada konten string. Fitur ini sangat berguna dalam mencari data dalam bentuk teks atau string dalam jumlah yang besar. Text Indexes MongoDB memiliki skema yang fleksibel sehingga tidak memerlukan struktur skema yang ketat untuk mengindeks data teks dibandingkan dengan database SQL yang memerlukan kolom dengan tipe data yang tepat untuk melakukan indeks. Text Indexes MongoDB mendukung bahasa yang lebih banyak dibandingkan SQL karena menggunakan mesin pencarian teks open source yang disebut Apache Lucene. Dalam penelitian ini, peneliti akan mengimplementasikan Text Indexing pada data dokumen (PDF) yang sudah dikonversi menjadi teks, kemudian dimasukkan ke dalam MongoDB sebelum melakukan indexing. Setelah itu, peneliti akan melakukan perbandingan performa query pencarian antara data yang sudah terindeks dan belum terindeks dalam database MongoDB dari segi kecepatan. Hasil perbandingan akan dibuat dalam bentuk tabel dan grafik agar lebih mudah dipahami. Berdasarkan hasil penelitian yang dilakukan, dapat disimpulkan bahwa penggunaan fitur text indexing pada MongoDB dapat mempercepat waktu pencarian kata atau string. Dalam percobaan yang dilakukan dengan menggunakan 5000 record data, didapatkan hasil bahwa penggunaan text indexing pada pencarian 1 keyword menghasilkan peningkatan kecepatan pencarian sebesar 11705,88%, pada pencarian 2 keyword sebesar 60833,33%, dan pada pencarian 3 keyword sebesar 44320%. The rapid growth of information technology has led to an increase in the amount of data stored in databases every day. Relational databases (SQL) that have been in use for a long time are now being developed with the emergence of NoSQL databases such as MongoDB. MongoDB stores data in BSON format and has a Text Indexes feature that is useful for speeding up text search on string content. This feature is particularly useful in searching for data in the form of texts or strings in large quantities. MongoDB's Text Indexes have a flexible schema that does not require a strict schema structure to index text data, unlike SQL databases that require columns with the appropriate data type to perform indexing. MongoDB's Text Indexes support more languages than SQL because they use an open-source text search engine called Apache Lucene. In this study, the researcher will implement Text Indexing on document data (PDF) that has been converted into text, then inserted into MongoDB before indexing. Afterward, the researcher will compare the performance of search queries between indexed and non-indexed data in MongoDB in terms of speed. The comparison results will be presented in tables and graphs to facilitate understanding. Based on the research conducted, it can be concluded that the use of the text indexing feature in MongoDB can speed up keyword or string search time. In the experiment conducted using 5000 data records, the results showed that the use of text indexing for searching 1 keyword resulted in a search speed improvement of 11705,88%, for searching 2 keywords it was 60833,33%, and for searching 3 keywords it was 44320%.
- Published
- 2023
28. Fixed Block Compression Boosting in FM-Indexes: Theory and Practice.
- Author
-
Gog, Simon, Kärkkäinen, Juha, Kempa, Dominik, Petri, Matthias, and Puglisi, Simon J.
- Subjects
- *
PATTERN matching , *THEORY-practice relationship , *DATA structures - Abstract
The FM index (Ferragina and Manzini in J ACM 52(4):552–581, 2005) is a widely-used compressed data structure that stores a string T in a compressed form and also supports fast pattern matching queries. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation and improved practical performance. Our main theoretical result is a new technique called fixed block compression boosting, which is a simpler and faster alternative to optimal compression boosting and implicit compression boosting used in previous FM-indexes. We also describe several new techniques for implementing fixed-block boosting efficiently, including a new, fast, and space-efficient implementation of wavelet trees. Our extensive experiments show the new indexes to be consistently fast and small relative to the state-of-the-art, and thus they make a good "off-the-shelf" choice for many applications. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
29. Text Indexing for Regular Expression Matching
- Author
-
Daniel Gibney and Sharma V. Thankachan
- Subjects
regular expressions ,text indexing ,pattern matching ,Industrial engineering. Management engineering ,T55.4-60.8 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Finding substrings of a text T that match a regular expression p is a fundamental problem. Despite being the subject of extensive research, no solution with a time complexity significantly better than O(|T||p|) has been found. Backurs and Indyk in FOCS 2016 established conditional lower bounds for the algorithmic problem based on the Strong Exponential Time Hypothesis that helps explain this difficulty. A natural question is whether we can improve the time complexity for matching the regular expression by preprocessing the text T? We show that conditioned on the Online Matrix–Vector Multiplication (OMv) conjecture, even with arbitrary polynomial preprocessing time, a regular expression query on a text cannot be answered in strongly sublinear time, i.e., O(|T|1−ε) for any ε>0. Furthermore, if we extend the OMv conjecture to a plausible conjecture regarding Boolean matrix multiplication with polynomial preprocessing time, which we call Online Matrix–Matrix Multiplication (OMM), we can strengthen this hardness result to there being no solution with a query time that is O(|T|3/2−ε). These results hold for alphabet sizes three or greater. We then provide data structures that answer queries in O(|T||p|τ) time where τ∈[1,|T|] is fixed at construction. These include a solution that works for all regular expressions with Expτ·|T| preprocessing time and space. For patterns containing only ‘concatenation’ and ‘or’ operators (the same type used in the hardness result), we provide (1) a deterministic solution which requires Expτ·|T|log2|T| preprocessing time and space, and (2) when |p|≤|T|z for z=2o(log|T|), a randomized solution with amortized query time which answers queries correctly with high probability, requiring Expτ·|T|2Ωlog|T| preprocessing time and space.
- Published
- 2021
- Full Text
- View/download PDF
30. A Full and Linear Index of a Tree for Tree Patterns
- Author
-
Janoušek, Jan, Melichar, Bořivoj, Polách, Radomír, Poliak, Martin, Trávníček, Jan, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Kobsa, Alfred, editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Weikum, Gerhard, editor, Jürgensen, Helmut, editor, Karhumäki, Juhani, editor, and Okhotin, Alexander, editor
- Published
- 2014
- Full Text
- View/download PDF
31. Improved and Extended Locating Functionality on Compressed Suffix Arrays
- Author
-
Gog, Simon, Navarro, Gonzalo, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Kobsa, Alfred, editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Weikum, Gerhard, editor, Gudmundsson, Joachim, editor, and Katajainen, Jyrki, editor
- Published
- 2014
- Full Text
- View/download PDF
32. Hypotheses, Questions, and Evidence
- Author
-
Zobel, Justin and Zobel, Justin
- Published
- 2014
- Full Text
- View/download PDF
33. Indexing labeled sequences
- Author
-
Tatiana Rocher, Mathieu Giraud, and Mikaël Salson
- Subjects
Data structures ,Text indexing ,Burrows–Wheeler transform ,Wavelet Tree ,V(D)J recombination ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Background Labels are a way to add some information on a text, such as functional annotations such as genes on a DNA sequences. V(D)J recombinations are DNA recombinations involving two or three short genes in lymphocytes. Sequencing this short region (500 bp or less) produces labeled sequences and brings insight in the lymphocyte repertoire for onco-hematology or immunology studies. Methods We present two indexes for a text with non-overlapping labels. They store the text in a Burrows–Wheeler transform (BWT) and a compressed label sequence in a Wavelet Tree. The label sequence is taken in the order of the text (TL-index) or in the order of the BWT (TLBW-index). Both indexes need a space related to the entropy of the labeled text. Results These indexes allow efficient text–label queries to count and find labeled patterns. The TLBW-index has an overhead on simple label queries but is very efficient on combined pattern–label queries. We implemented the indexes in C++ and compared them against a baseline solution on pseudo-random as well as on V(D)J labeled texts. Discussion New indexes such as the ones we proposed improve the way we index and query labeled texts as, for instance, lymphocyte repertoire for hematological and immunological studies.
- Published
- 2018
- Full Text
- View/download PDF
34. Orthogonal Range Searching for Text Indexing
- Author
-
Lewenstein, Moshe, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Brodnik, Andrej, editor, López-Ortiz, Alejandro, editor, Raman, Venkatesh, editor, and Viola, Alfredo, editor
- Published
- 2013
- Full Text
- View/download PDF
35. Words Context Analysis for Improvement of Information Retrieval
- Author
-
Szymański, Julian, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Nguyen, Ngoc-Thanh, editor, Hoang, Kiem, editor, and Jȩdrzejowicz, Piotr, editor
- Published
- 2012
- Full Text
- View/download PDF
36. The Position Heap of a Trie
- Author
-
Nakashima, Yuto, I, Tomohiro, Inenaga, Shunsuke, Bannai, Hideo, Takeda, Masayuki, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Calderón-Benavides, Liliana, editor, González-Caro, Cristina, editor, Chávez, Edgar, editor, and Ziviani, Nivio, editor
- Published
- 2012
- Full Text
- View/download PDF
37. Text Indexing : 1993; Manber, Myers 1993; Manber, Myers
- Author
-
Aluru, Srinivas and Kao, Ming-Yang, editor
- Published
- 2008
- Full Text
- View/download PDF
38. An(other) Entropy-Bounded Compressed Suffix Tree
- Author
-
Fischer, Johannes, Mäkinen, Veli, Navarro, Gonzalo, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Ferragina, Paolo, editor, and Landau, Gad M., editor
- Published
- 2008
- Full Text
- View/download PDF
39. A Framework for Dynamizing Succinct Data Structures
- Author
-
Gupta, Ankur, Hon, Wing-Kai, Shah, Rahul, Vitter, Jeffrey Scott, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Arge, Lars, editor, Cachin, Christian, editor, Jurdziński, Tomasz, editor, and Tarlecki, Andrzej, editor
- Published
- 2007
- Full Text
- View/download PDF
40. Factor Automata of Automata and Applications
- Author
-
Mohri, Mehryar, Moreno, Pedro, Weinstein, Eugene, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Holub, Jan, editor, and Žďárek, Jan, editor
- Published
- 2007
- Full Text
- View/download PDF
41. A Faster Query Algorithm for the Text Fingerprinting Problem
- Author
-
Chan, Chi-Yuan, Yu, Hung-I, Hon, Wing-Kai, Wang, Biing-Feng, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Arge, Lars, editor, Hoffmann, Michael, editor, and Welzl, Emo, editor
- Published
- 2007
- Full Text
- View/download PDF
42. Dotted Suffix Trees A Structure for Approximate Text Indexing
- Author
-
Coelho, Luís Pedro, Oliveira, Arlindo L., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Crestani, Fabio, editor, Ferragina, Paolo, editor, and Sanderson, Mark, editor
- Published
- 2006
- Full Text
- View/download PDF
43. Reverse-Safe Text Indexing
- Author
-
Solon P. Pissis, Huiping Chen, Grigorios Loukides, Gabriele Fici, Giulia Bernardini, Bernardini, G., Chen, H., Fici, G., Loukides, G., Pissis, S. P., Bernardini G., Chen H., Fici G., Loukides G., and Pissis S.P.
- Subjects
Data structures ,Computer science ,Suffix tree ,suffix tree ,0102 computer and information sciences ,02 engineering and technology ,text indexing ,01 natural sciences ,Theoretical Computer Science ,law.invention ,Set (abstract data type) ,law ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Pattern matching ,data privacy ,Settore INF/01 - Informatica ,Search engine indexing ,pattern matching ,Data structure ,Matrix multiplication ,010201 computation theory & mathematics ,Algorithm ,Adversary model ,Integer (computer science) - Abstract
We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z - reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D . The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z , we propose an algorithm that constructs a z -reverse-safe data structure ( z -RSDS) that has size O(n) and answers decision and counting pattern matching queries of length at most d optimally, where d is maximal for any such z -RSDS. The construction algorithm takes O(nɷ log d) time, where ɷ is the matrix multiplication exponent. We show that, despite the nɷ factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We also show that plugging our method in data analysis applications gives insignificant or no data utility loss. Furthermore, we show how our technique can be extended to support applications under realistic adversary models. Finally, we show a z -RSDS for decision pattern matching queries, whose size can be sublinear in n . A preliminary version of this article appeared in ALENEX 2020.
- Published
- 2021
- Full Text
- View/download PDF
44. Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk)
- Author
-
Sharma V. Thankachan, Thankachan, Sharma V., Sharma V. Thankachan, and Thankachan, Sharma V.
- Abstract
In the past two decades, we have witnessed the design of various compact data structures for pattern matching over an indexed text [Navarro, 2016]. Popular indexes like the FM-index [Paolo Ferragina and Giovanni Manzini, 2005], compressed suffix arrays/trees [Roberto Grossi and Jeffrey Scott Vitter, 2005; Kunihiko Sadakane, 2007], the recent r-index [Travis Gagie et al., 2020; Takaaki Nishimoto and Yasuo Tabei, 2021], etc., capture the key functionalities of classic suffix arrays/trees [Udi Manber and Eugene W. Myers, 1993; Peter Weiner, 1973] in compact space. Mostly, they rely on the Burrows-Wheeler Transform (BWT) and its associated operations [Burrows and Wheeler, 1994]. However, compactly encoding some advanced suffix tree (ST) variants, like parameterized ST [Brenda S. Baker, 1993; S. Rao Kosaraju, 1995; Juan Mendivelso et al., 2020], order-isomorphic/preserving ST [Maxime Crochemore et al., 2016], two-dimensional ST [Raffaele Giancarlo, 1995; Dong Kyue Kim et al., 1998], etc. [Sung Gwan Park et al., 2019; Tetsuo Shibuya, 2000]- collectively known as suffix trees with missing suffix links [Richard Cole and Ramesh Hariharan, 2003], has been challenging. The previous techniques are not easily extendable because these variants do not hold some structural properties of the standard ST that enable compression. However, some limited progress has been made in these directions recently [Arnab Ganguly et al., 2017; Travis Gagie et al., 2017; Gianni Decaroli et al., 2017; Dhrumil Patel and Rahul Shah, 2021; Arnab Ganguly et al., 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Arnab Ganguly et al., 2017; Arnab Ganguly et al., 2022; Arnab Ganguly et al., 2021]. This talk will briefly survey them and highlight some interesting open problems.
- Published
- 2022
- Full Text
- View/download PDF
45. A Taxonomy of Algorithms for Constructing Minimal Acyclic Deterministic Finite Automata
- Author
-
Watson, Bruce W., Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Boldt, Oliver, editor, and Jürgensen, Helmut, editor
- Published
- 2001
- Full Text
- View/download PDF
46. Full-Fledged Real-Time Indexing for Constant Size Alphabets.
- Author
-
Kucherov, Gregory and Nekrich, Yakov
- Subjects
- *
ALPHABET , *INDEXING , *DATA structures , *PATTERN matching , *QUERY languages (Computer science) - Abstract
In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet of constant size. Each new symbol can be prepended to T in O(1) worst-case time. At any moment, we can report all occurrences of a pattern P in the current text in $$O(|P|+k)$$ time, where | P| is the length of P and k is the number of occurrences. This resolves, under assumption of constant size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see Amir and Nor in Real-time indexing over fixed finite alphabets, pp. 1086-1095, 2008). [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
47. A PRACTICAL INDEX FOR APPROXIMATE DICTIONARY MATCHING WITH FEW MISMATCHES.
- Author
-
CISŁAK, Aleksander and GRABOWSKI, Szymon
- Subjects
ONLINE library catalogs ,DIRICHLET principle ,C++ ,HAMMING distance ,KEYWORDS - Abstract
Approximate dictionary matching (checking if a pattern occurs in a collection of strings) is a classic problem with applications in e.g. spellchecking, online catalogs, and web searchers. We present a simple solution called split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is bene cial for the search speed. We compare our solution with other algorithms and we show that it is faster when the Hamming distance is used. Query times in the order of 1 microsecond were reported for one mismatch for a few-megabyte natural language dictionary on a medium-end PC. We also demonstrate that a basic compression technique consisting in q-gram substitution can signi cantly reduce the index size (up to 50% of the input text size for the DNA sequences). [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
48. Position-restricted substring searching over small alphabets.
- Author
-
Biswas, Sudip, Ku, Tsung-Han, Shah, Rahul, and Thankachan, Sharma V.
- Abstract
We consider the problem of indexing a given text T [ 0 . . . n − 1 ] of n characters over an alphabet set Σ of size σ , in order to answer the position-restricted substring searching queries. The query input consists of a pattern P (of length p ) and two indices ℓ and r and the output is the set of all o c c ℓ , r occurrences of P in T [ ℓ . . . r ] . In this paper, we propose an O ( n log σ ) -word space index with O ( p + o c c ℓ , r log log n ) query time. Our solution is interesting when the alphabet size is small. For example, when the alphabet set is of constant size, we achieve significant improvement over the previously best-known linear space index by Navarro and Nekrich [SWAT 2012] with O ( p + o c c ℓ , r log ϵ n ) query time, where ϵ > 0 is an arbitrarily small positive constant. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
49. A Bloom filter based semi-index on q-grams.
- Author
-
Grabowski, Szymon, Susik, Robert, and Raniszewski, Marcin
- Subjects
DATA structures ,MAGNITUDE estimation ,INFORMATION storage & retrieval systems ,INDEXING ,ALGORITHMS - Abstract
We present a simple q-gram based semi-index, which allows to look for a pattern typically only in a small fraction of text blocks. Several space-time tradeoffs are presented. Experiments on Pizza & Chili datasets show that our solution is up to three orders of magnitude faster than the Claude et al. (Journal of Discrete Algorithms 2012; 11:37) semi-index at a comparable space usage. Moreover, the construction of our data structure is fast and easily parallelizable. Copyright © 2016 John Wiley & Sons, Ltd. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
50. A framework for space-efficient read clustering in metagenomic samples.
- Author
-
Alanko, Jarno, Cunial, Fabio, Belazzougui, Djamal, and Mäkinen, Veli
- Subjects
- *
METAGENOMICS , *DNA analysis , *CLUSTER analysis (Statistics) , *TAXONOMY , *ALGORITHMS - Abstract
Background: A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed. Results: We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length l each, on an alphabet of total size σ, our algorithms take O(n(t + log σ)) time and just 2n + o(n) + O(max{lσ log n, K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure. Conclusions: Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the ar [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.