3,964 results on '"edit distance"'
Search Results
2. Elastic-Degenerate String Matching with 1 Error or Mismatch.
- Author
-
Bernardini, Giulia, Gabory, Esteban, Pissis, Solon P., Stougie, Leen, Sweering, Michelle, and Zuba, Wiktor
- Subjects
- *
HAMMING distance , *COMPUTATIONAL geometry , *MATRIX multiplications , *STATISTICAL decision making , *PAN-genome - Abstract
An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an O ~ (n m ω - 1) + O (N) -time algorithm [Bernardini et al., SIAM J. Comput. 2022], where ω denotes the matrix multiplication exponent and the O ~ (·) notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in O (k 2 m G + k N) time, under edit distance, or O (k m G + k N) time, under Hamming distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k = 1 , the existing algorithms run in Ω (m N) time in the worst case. In this paper we make progress in this direction. We show that 1-EDSM can be solved in O ((n m 2 + N) log m) or O (n m 3 + N) time under edit distance. For the decision version of the problem, we present a faster O (n m 2 log m + N log log m) -time algorithm. We also show that 1-EDSM can be solved in O (n m 2 + N log m) time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the k-errata trees for indexing with errors [Cole et al., STOC 2004]. This is an extended version of a paper presented at LATIN 2022. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. An Improved Algorithm for The k-Dyck Edit Distance Problem.
- Author
-
Fried, Dvir, Golan, Shay, Kociumaka, Tomasz, Kopelowitz, Tsvi, Porat, Ely, and Starikovskaya, Tatiana
- Subjects
MATRIX multiplications - Abstract
A Dyck sequence is a sequence of opening and closing parentheses (of various types) that is balanced. The Dyck edit distance of a given sequence of parentheses S is the smallest number of edit operations (insertions, deletions, and substitutions) needed to transform S into a Dyck sequence. We consider the threshold Dyck edit distance problem, where the input is a sequence of parentheses S and a positive integer k, and the goal is to compute the Dyck edit distance of S only if the distance is at most k, and otherwise report that the distance is larger than k. Backurs and Onak [PODS'16] showed that the threshold Dyck edit distance problem can be solved in O(n+k
16 ) time. In this work, we design new algorithms for the threshold Dyck edit distance problem which costs O(n+k4.544184 ) time with high probability or O(n+k4.853059 ) deterministically. Our algorithms combine several new structural properties of the Dyck edit distance problem, a refined algorithm for fast (min, +) matrix product, and a careful modification of ideas used in Valiant's parsing algorithm. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
4. Computing All Minimal Ways to Reach a Context-Free Language
- Author
-
Bruse, Florian, Lange, Martin, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Kovács, Laura, editor, and Sokolova, Ana, editor
- Published
- 2024
- Full Text
- View/download PDF
5. Approximate Cartesian Tree Pattern Matching
- Author
-
Kim, Sungmin, Han, Yo-Sub, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Day, Joel D., editor, and Manea, Florin, editor
- Published
- 2024
- Full Text
- View/download PDF
6. C2 and Phishing Domains Detection Using DNS Analysis
- Author
-
Singh, Neelam, Vinod, Gopika, Kakkar, Akshat, Pratap, Surya, Joseph, Gigi, Chaari, Fakher, Series Editor, Gherardini, Francesco, Series Editor, Ivanov, Vitalii, Series Editor, Haddar, Mohamed, Series Editor, Cavas-Martínez, Francisco, Editorial Board Member, di Mare, Francesca, Editorial Board Member, Kwon, Young W., Editorial Board Member, Tolio, Tullio A. M., Editorial Board Member, Trojanowska, Justyna, Editorial Board Member, Schmitt, Robert, Editorial Board Member, Xu, Jinyang, Editorial Board Member, Varde, Prabhakar V., editor, Vinod, Gopika, editor, and Joshi, N. S., editor
- Published
- 2024
- Full Text
- View/download PDF
7. Model-Independent Error Bound Estimation for Conformance Checking Approximation
- Author
-
Fani Sani, Mohammadreza, Kabierski, Martin, van Zelst, Sebastiaan J., van der Aalst, Wil M. P., van der Aalst, Wil, Series Editor, Ram, Sudha, Series Editor, Rosemann, Michael, Series Editor, Szyperski, Clemens, Series Editor, Guizzardi, Giancarlo, Series Editor, De Weerdt, Jochen, editor, and Pufahl, Luise, editor
- Published
- 2024
- Full Text
- View/download PDF
8. Needleman-Wunsch Attention: A Framework for Enhancing DNA Sequence Embedding
- Author
-
Kyelim Lee and Albert No
- Subjects
Attention ,edit distance ,DNA sequence ,Needleman-Wunsch ,sequence embedding ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
In many biological research studies that rely on DNA sequence data, calculating the edit distance between two sequences is a vital component. However, computing the edit distance involves dynamic programming, which can be computationally intensive. To address this challenge, numerous works have focused on embedding sequences into the vector space while preserving the distance metric. This means that the edit distance between sequences is analogous to the distance between their corresponding vectors. In this study, we propose a novel Needleman-Wunsch Attention (NWA) framework for sequence embedding that leverages the relationship between the Needleman-Wunsch (NW) matrix and attention maps to improve the accuracy and efficiency of edit distance approximation methods. Our approach applies to any deep learning-based sequence embedding network and provides a general solution to improve the accuracy and efficiency of edit distance approximation methods. We validate the effectiveness of our proposed method by applying it to various existing embedding networks, demonstrating improved edit distance-preserving embedding in an actual dataset. The code is publicly available at https://github.com/thisislim/nw-attention/.
- Published
- 2024
- Full Text
- View/download PDF
9. On the Generative Power of ReLU Network for Generating Similar Strings
- Author
-
Mamoona Ghafoor and Tatsuya Akutsu
- Subjects
ReLU neural network ,hamming distance ,edit distance ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Recently, generative networks are widely used in different applied fields including computational biology for data augmentation, DNA sequence generation, and drug discovery. The core idea of these networks is to generate new data instances that resemble a given set of data. However it is unclear how many nodes and layers are required to generate the desirable data. In this context, we study the problem of generating strings with a given Hamming distance and edit distance which are commonly used for sequence comparison, error detection, and correction in computational biology to comprehend genetic variations, mutations, and evolutionary changes. More precisely, for a given string $e$ of length $n$ over a symbol set $\Sigma $ , $m = |\Sigma |$ , we proved that all strings over $\Sigma $ with hamming distance and edit distance at most $d$ from $e$ can be generated by a generative network with rectified linear unit function as an activation function. The depth of these networks is constant and are of size $\mathcal {O}(nd)$ and $\mathcal {O}(\max (md, nd))$ .
- Published
- 2024
- Full Text
- View/download PDF
10. Adaptable DNA Storage Coding: An Efficient Framework for Homopolymer Constraint Transitions
- Author
-
Yunfei Gao and Albert No
- Subjects
DNA storage ,DNA-to-DNA coding ,edit distance ,GC contents ,homopolymer constraint ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Many DNA storage codes take into account homopolymer and GC-content constraints. Still, these codes often need to meet additional practical database requirements, such as error correction and data queries, necessitating considerable financial and time investment in their training or design. As DNA storage technologies, including sequencing and synthesis, continue to evolve rapidly, these codes may need to be retrained or redesigned to adapt to new constraints. In this study, we aim to design a method for adapting an existing DNA storage code to satisfy a new constraint, specifically concerning homopolymer variations. We present a simple and effective framework known as Transfer Coding, which directly maps DNA sequences from an original homopolymer constraint $h_{1}$ to a new constraint $h_{2}$ . This approach essentially combines the existing coding scheme with a Transfer encoder. The proposed method uses strategic base replacements to ensure compliance with constraints, achieving results close to the theoretical limit while keeping alterations to the original sequence minimal.
- Published
- 2024
- Full Text
- View/download PDF
11. MinJoin++: a fast algorithm for string similarity joins under edit distance.
- Author
-
Karpov, Nikolai, Zhang, Haoyu, and Zhang, Qin
- Abstract
We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. 基于哈希和编辑距离算法的 SCD 双层向量化与 变更校验技术.
- Author
-
叶远波, 王吉文, 汪伟, 毛玉荣, and 王志华
- Abstract
Copyright of Electric Power is the property of Electric Power Editorial Office and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
13. Geo-topology, Complexity and Resilience
- Author
-
Papadimitriou, Fivos, Warf, Barney, Series Editor, and Papadimitriou, Fivos
- Published
- 2023
- Full Text
- View/download PDF
14. Privacy-Preserving Edit Distance Computation Using Secret-Sharing Two-Party Computation
- Author
-
Vanegas, Hernán, Cabarcas, Daniel, Aranha, Diego F., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Aly, Abdelrahaman, editor, and Tibouchi, Mehdi, editor
- Published
- 2023
- Full Text
- View/download PDF
15. Isometric Words and Edit Distance: Main Notions and New Variations
- Author
-
Castiglione, Giuseppa, Flores, Manuela, Giammarresi, Dora, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Manzoni, Luca, editor, Mariot, Luca, editor, and Roy Chowdhury, Dipanwita, editor
- Published
- 2023
- Full Text
- View/download PDF
16. A Novel Pattern-Based Edit Distance for Automatic Log Parsing: Implementation and Reproducibility Notes
- Author
-
Raynal, Maxime, Buob, Marc-Olivier, Quénot, Georges, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Kerautret, Bertrand, editor, Colom, Miguel, editor, Krähenbühl, Adrien, editor, Lopresti, Daniel, editor, Monasse, Pascal, editor, and Perret, Benjamin, editor
- Published
- 2023
- Full Text
- View/download PDF
17. Assessment of Divergent Thinking in a Social and Modular Robotics Task: An Edit Distance Approach at the Configuration Level
- Author
-
Kohler, Louis, Romero, Margarida, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Zaphiris, Panayiotis, editor, and Ioannou, Andri, editor
- Published
- 2023
- Full Text
- View/download PDF
18. Extended Pairwise Sequence Alignment
- Author
-
Araujo, Eloi, Martinez, Fábio V., Rozante, Luiz C., Almeida, Nalvo F., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Gervasi, Osvaldo, editor, Murgante, Beniamino, editor, Taniar, David, editor, Apduhan, Bernady O., editor, Braga, Ana Cristina, editor, Garau, Chiara, editor, and Stratigea, Anastasia, editor
- Published
- 2023
- Full Text
- View/download PDF
19. Heuristics for the de Bruijn Graph Sequence Mapping Problem
- Author
-
Rocha, Lucas B., Adi, Said Sadique, Araujo, Eloi, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Gervasi, Osvaldo, editor, Murgante, Beniamino, editor, Taniar, David, editor, Apduhan, Bernady O., editor, Braga, Ana Cristina, editor, Garau, Chiara, editor, and Stratigea, Anastasia, editor
- Published
- 2023
- Full Text
- View/download PDF
20. Bangla Spelling Error Detection and Correction Using N-Gram Model
- Author
-
Bagchi, Promita, Arafin, Mursalin, Akther, Aysha, Alam, Kazi Masudul, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin, Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Satu, Md. Shahriare, editor, Moni, Mohammad Ali, editor, Kaiser, M. Shamim, editor, and Arefin, Mohammad Shamsul, editor
- Published
- 2023
- Full Text
- View/download PDF
21. Fault Diagnosis for Lithium-Ion Batteries in Electric Vehicles Based on VMD and Edit Distance
- Author
-
Li, Xianglong, Zhang, Qian, Jin, Yuan, Chen, Huimin, Yang, Hongqing, Du, Shaohua, Li, Shuowei, Zhang, Caiping, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, Sun, Fengchun, editor, Yang, Qingxin, editor, Dahlquist, Erik, editor, and Xiong, Rui, editor
- Published
- 2023
- Full Text
- View/download PDF
22. BEDSpell: Spelling Error Correction Using BERT-Based Masked Language Model and Edit Distance
- Author
-
Tohidian, Fatemeh, Kashiri, Amin, Lotfi, Fariba, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Troya, Javier, editor, Mirandola, Raffaela, editor, Navarro, Elena, editor, Delgado, Andrea, editor, Segura, Sergio, editor, Ortiz, Guadalupe, editor, Pautasso, Cesare, editor, Zirpins, Christian, editor, Fernández, Pablo, editor, and Ruiz-Cortés, Antonio, editor
- Published
- 2023
- Full Text
- View/download PDF
23. Mechanistic Explanatory Texts
- Author
-
Kohár, Matej, Piccinini, Gualtiero, Series Editor, Brogaard, Berit, Editorial Board Member, Craver, Carl, Editorial Board Member, Machery, Edouard, Editorial Board Member, Shagrir, Oron, Editorial Board Member, Sprevak, Mark, Editorial Board Member, and Kohár, Matej
- Published
- 2023
- Full Text
- View/download PDF
24. Pattern Recognition Using Graph Edit Distance
- Author
-
Dwivedi, Shri Prakash, Singh, Ravi Shankar, Awasthi, Shashank, editor, Sanyal, Goutam, editor, Travieso-Gonzalez, Carlos M., editor, Kumar Srivastava, Pramod, editor, Singh, Dinesh Kumar, editor, and Kant, Rama, editor
- Published
- 2023
- Full Text
- View/download PDF
25. Logical Synonymy Construction for Translated Texts Based on Edge Technology
- Author
-
Zou Haiyan and Xiao Qinhua
- Subjects
edge technology ,edit distance ,longest common substring ,similarity algorithm ,translated text synonymity ,97m50 ,Mathematics ,QA1-939 - Abstract
This paper combines edge technology and communication transmission control technology to characterize the translation logical synonymity of machine-automatically converted translated text, the mapping results of logical synonymity of machine-automatically converted translated text, and construct the logical synonymity model of machine-automatically translated text. Starting from the classification of synonym identification, the similarity algorithm is used to calculate the editing distance and the longest common substring of the synonymous terms of the translated text, and according to the translated text, the synonym matching items are extracted from an arbitrary text collection, and the pairing determines whether there is any synonymity in it. The analysis of data analysis involves analyzing the logical synonymity features of translated texts that use edge technology and Chinese-English and English-Chinese metaphor translation. The results show that the lexical distribution of the translated corpus based on edge technology is basically the same as that of the original Chinese language, except for the gap in the use of nouns (3.35%), verbs (5.19%), adjectives (3.21%), and pronouns (0.30%), the rest of the lexical distributions have very small gaps, and thus show an obvious trend of paradigm, which further confirms that the study of logical synonymy of text based on edge technology is Feasibility. Both explicit metaphors and borrowed metaphors are more than 20% higher in the proportion of direct translations from English to Chinese than from Chinese to English, indicating that the direct translation method is more frequently used in English to Chinese translation. This study contributes to the creation of a linguistic and culturally diverse situation and positively influences the translation process.
- Published
- 2024
- Full Text
- View/download PDF
26. Characterization of random walks on space of unordered trees using efficient metric simulation.
- Author
-
Ben Naoum, Farah, Godin, Christophe, and Azaïs, Romain
- Subjects
- *
RANDOM walks , *TREE size , *TREES - Abstract
The simple random walk on Z p shows two drastically different behaviors depending on the value of p : it is recurrent when p ∈ { 1 , 2 } while it escapes (with a rate increasing with p) as soon as p ≥ 3. This classical example illustrates that the asymptotic properties of a random walk provides some information on the structure of its state space. This paper aims to explore analogous questions on space made up of combinatorial objects with no algebraic structure. We take as a model for this problem the space of unordered unlabeled rooted trees endowed with Zhang edit distance. To this end, it defines the canonical unbiased random walk on the space of trees and provides an efficient algorithm to evaluate its escape rate. Compared to Zhang algorithm, it is incremental and computes the edit distance along the random walk approximately 100 times faster on trees of size 500 on average. The escape rate of the random walk on trees is precisely estimated using intensive numerical simulations, out of reasonable reach without the incremental algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
27. Accelerating similarity-based model matching using dual hashing
- Author
-
He, Xiao, Liu, Yi, and He, Huihong
- Published
- 2024
- Full Text
- View/download PDF
28. Approximating Edit Distance in Truly Subquadratic Time: Quantum and MapReduce.
- Author
-
BOROUJENI, MAHDI, EHSANI, SOHEIL, GHODSI, MOHAMMAD, HAJIAGHAYI, MOHAMMADTAGHI, and SEDDIGHIN, SAEED
- Subjects
ALGORITHMS ,APPROXIMATION algorithms ,PARALLEL algorithms - Abstract
The edit distance between two strings is defined as the smallest number of insertions, deletions, and substitutions that need to be made to transform one of the strings to another one. Approximating edit distance in subquadratic time is "one of the biggest unsolved problems in the field of combinatorial pattern matching" [37]. Our main result is a quantum constant approximation algorithm for computing the edit distance in truly subquadratic time. More precisely, we give an O(n1.810) quantum algorithm that approximates the edit distance within a factor of 3. We further extend this result to an O(n1.708) quantum algorithm that approximates the edit distance within a larger constant factor. Our solutions are based on a framework for approximating edit distance in parallel settings. This framework requires as black box an algorithm that computes the distances of several smaller strings all at once. For a quantum algorithm, we reduce the black box to metric estimation and provide efficient algorithms for approximating it.We further show that this framework enables us to approximate edit distance in distributed settings. To this end, we provide a MapReduce algorithm to approximate edit distance within a factor of 1+∈, with sublinearly many machines and sublinear memory. Also, our algorithm runs in a logarithmic number of rounds. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
29. Military realm entity links based on improved editing distances
- Author
-
XIA Xudong, YU Ronghuan
- Subjects
entity link ,bm25 ,edit distance ,Military Science - Abstract
In order to accurately link the entity references in the commander’s demand statement to the standardized entity nodes in the knowledge graph, an entity linking method in the military domain based on improved edit distance is proposed. By summarizing the non-standard forms of entity referencing to establish an index, and using the BM25 model to integrate the improved edit distance algorithm that considers the character position exchange, inclusion similarity and other similarities, the entities to be linked are sorted to achieve the link. The experimental results show that the entity linking method in the military field can effectively improve the matching accuracy in similarity calculation.
- Published
- 2023
- Full Text
- View/download PDF
30. An Improved Sketching Algorithm for Edit Distance
- Author
-
Jin, Ce, Nelson, Jelani, and Wu, Kewen
- Subjects
edit distance ,sketching - Published
- 2021
31. Approximating Edit Distance Within Constant Factor in Truly Sub-quadratic Time.
- Author
-
CHAKRABORTY, DIPTARKA, DAS, DEBARATI, GOLDENBERG, ELAZAR, KOUCKÝ, MICHAL, and SAKS, MICHAEL
- Subjects
DISTANCES ,DYNAMIC programming ,APPROXIMATION algorithms - Abstract
Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer, and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within approximation factor poly(logn). In this article, we provide an algorithm with running time O(n
2-2/7 ) that approximates the edit distance within a constant factor. [ABSTRACT FROM AUTHOR]- Published
- 2020
- Full Text
- View/download PDF
32. Further Contributions to Romance Dialectometry
- Author
-
Nerbonne, John
- Published
- 2023
- Full Text
- View/download PDF
33. DEVELOPMENT OF AN AUGMENTED DAMERAU– LEVENSHTEIN METHOD FOR CORRECTING SPELLING ERRORS IN KAZAKH TEXTS.
- Author
-
Mukazhanov, Nurzhan, Alibiyeva, Zhibek, Yerimbetova, Aigerim, Kassymova, Aizhan, and Alibiyeva, Nursulu
- Subjects
SPELLING errors ,CHATBOTS ,EMAIL ,TEXT messages - Abstract
The presented paper is devoted to the development of a method for identifying and correcting spelling errors in Kazakh texts. In this paper, the study object is methods for more accurate correction of spelling errors in Kazakh texts. The aim of the study is to develop an augmented version of the Damerau-Levenshtein method for correcting spelling errors in Kazakh language texts. Automatic detection and correction of spelling errors have become a default feature in modern text editors for working with text data, in text messaging applications such as chatbots, messengers, etc. However, although this task is well solved in geographically widespread languages, it has not been fully solved in languages with a small audience, such as the Kazakh language. The methods developed so far cannot correct all spelling errors found in Kazakh texts. Therefore, the development of a method with specific algorithms for spelling error correction in Kazakh texts is considered. As a result of the research work, algorithms for correcting errors found in Kazakh language texts were developed, and the developed algorithms were included in the Damerau-Levenshtein method. The experimental testing results of the augmented Damerau- Levenshtein method showed 97.2 % accuracy in correcting specific errors found only in Kazakh words and 92.8 % accuracy in correcting common errors from letter symbols. The standard Damerau-Levenshtein method testing results showed 76.4 % accuracy in correcting specific errors found only in Kazakh words. The results of the tests in correcting common errors from letter symbols with the standard Damerau-Levenshtein were approximately the same with the augmented Damerau-Levenshtein method, the accuracy is 92.2 %. The extent and conditions of practical application of the results are implemented by including them in text editors, messengers, e-mails and similar applications that work with text data. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
34. Weighted edit distance optimized using genetic algorithm for SMILES-based compound similarity.
- Author
-
Choi, In-Hyuk and Oh, Il-Seok
- Subjects
- *
GENETIC algorithms , *MOLECULAR structure , *GENETIC distance , *SMILING - Abstract
A method for developing new drugs is the ligand-based approach, which requires intermolecular similarity computation. The simplified molecular input line entry system (SMILES) is primarily used to represent the molecular structure in one dimension. It is a representation of molecular structure; the properties can be completely different even if only one character is changed. Applying the conventional edit distance method makes it difficult to obtain optimal results, because the insertion, deletion, and substitution of molecules are considered the same in calculating the distance. This study proposes a novel edit distance using an optimal weight set for three operations. To determine the optimal weight set, we present a genetic algorithm with suitable hyperparameters. To emphasize the impact of the proposed genetic algorithm, we compare it with the exhaustive search algorithm. The experiments performed with four well-known datasets showed that the weighted edit distance optimized with the genetic algorithm resulted in an average performance improvement in approximately 20%. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
35. Hierarchical filtering: improving similar substring matching under edit distance.
- Author
-
Qiu, Tao, Zong, Chuanyu, Yang, Xiaochun, Wang, Bin, and Li, Bing
- Subjects
- *
FILTERS & filtration , *SCIENTIFIC community , *PROBLEM solving - Abstract
Similar substring matching, as an essential operation in applications including read mapping and text retrieval, has attracted significant attention in the research community. In this paper, we study the problem of similar substring matching with edit distance constraints. Existing methods generally utilize a filtering-and-verification framework to solve this problem – a filtering procedure is employed to reduce the searching space before going to a computationally expensive verification step, and the efficiency depends critically on balancing the cost of filtering and verification. The common filtering paradigm is based on the principle of Pigeonhole stating that a matching result must exactly match at least a certain number of substrings from the query, where the substrings act as a filter. However, the polynomial growth of filters caused by enlarging the number of substrings in filters, leading to the cost of filtering and verification is not well-balanced for the existing methods. To this end, we propose a novel filtering paradigm hierarchical filtering, aiming at achieving a fine-grained balance on the cost of filtering and verification. Unlike using a fixed number of substrings in a filter, our method allows the filters contain a different number of substrings that avoids the polynomial growth of filters. The filters are picked in accord with a scoring metric. We devise a tree-based filtering framework for hierarchical filtering. Also, the cost of filtering and verification is further reduced by eliminating the duplication of filters. Extensive experiments have been conducted on four real-world datasets by comparing to state-of-the-art methods Hobbes3, BWA, and BLAST, etc. The results show that our method outperforms the competing methods under a wide range of parameter settings. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
36. Fast Top-k Similar Sequence Search on DNA Databases
- Author
-
Yagi, Ryuichi, Shiokawa, Hiroaki, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Pardede, Eric, editor, Delir Haghighi, Pari, editor, Khalil, Ismail, editor, and Kotsis, Gabriele, editor
- Published
- 2022
- Full Text
- View/download PDF
37. Matching Patterns with Variables Under Edit Distance
- Author
-
Gawrychowski, Paweł, Manea, Florin, Siemer, Stefan, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Arroyuelo, Diego, editor, and Poblete, Barbara, editor
- Published
- 2022
- Full Text
- View/download PDF
38. Elastic-Degenerate String Matching with 1 Error
- Author
-
Bernardini, Giulia, Gabory, Esteban, Pissis, Solon P., Stougie, Leen, Sweering, Michelle, Zuba, Wiktor, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Castañeda, Armando, editor, and Rodríguez-Henríquez, Francisco, editor
- Published
- 2022
- Full Text
- View/download PDF
39. OCR Error Correction for Vietnamese OCR Text with Different Edit Distances
- Author
-
Nguyen, Quoc-Dung, Phan, Nguyet-Minh, Kromer, Pavel, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Barolli, Leonard, editor, and Miwa, Hiroyoshi, editor
- Published
- 2022
- Full Text
- View/download PDF
40. Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds
- Author
-
Ivanov, Pesho, Bichsel, Benjamin, Vechev, Martin, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, and Pe'er, Itsik, editor
- Published
- 2022
- Full Text
- View/download PDF
41. Co-linear Chaining with Overlaps and Gap Costs
- Author
-
Jain, Chirag, Gibney, Daniel, Thankachan, Sharma V., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, and Pe'er, Itsik, editor
- Published
- 2022
- Full Text
- View/download PDF
42. 基于改进编辑距离的军事领域实体链接.
- Author
-
夏旭东 and 于荣欢
- Abstract
In order to accurately link the entity references in the commander’s demand statement to the standardized entity nodes in the knowledge graph, an entity linking method in the military domain based on improved edit distance is proposed. By summarizing the non-standard forms of entity referencing to establish an index, and using the BM25 model to integrate the improved edit distance algorithm that considers the character position exchange, inclusion similarity and other similarities, the entities to be linked are sorted to achieve the link. The experimental results show that the entity linking method in the military field can effectively improve the matching accuracy in similarity calculation. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
43. An easy way to improve scoring of memory span tasks: The edit distance, beyond "correct recall in the correct serial position".
- Author
-
Gonthier, Corentin
- Subjects
- *
RECOLLECTION (Psychology) , *MEMORY span , *PSYCHOMETRICS , *ITEM response theory , *VERBAL memory , *GENOME editing - Abstract
For researchers and psychologists interested in estimating a subject's memory capacity, the current standard for scoring memory span tasks is the partial-credit method: subjects are credited with the number of stimuli that they manage to recall correctly in the correct serial position. A critical issue with this method, however, is that intrusions and omissions can radically change the scores depending on where they occur. For example, when recalling the sequence ABCDE, "ABCD" is worth 4 points but "BCDE" is worth 0 points. This paper presents an improved scoring method based on the edit distance, meaning the number of changes required to edit the recalled sequence into the target. Edit-distance scoring gives results close to partial-credit scoring, but without the corresponding vulnerability to positional shifts. A reanalysis of memory performance in two large datasets (N = 1093 and N = 758) confirms that in addition to being more logically consistent, edit-distance scoring demonstrates similar or better psychometric properties than partial-credit, with comparable validity, a small increase in reliability, and a substantial increase of test information (measurement precision in the context of item response theory). Test information was especially improved for harder items and for subjects with ability in the lower range, whose scores tend to be severely underestimated by partial-credit scoring. Code to compute edit-distance scores with various software is made available at https://osf.io/wdb83/. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
44. Adaptation Rates of West Old Turkic Loanwords in Hungarian: A Quantitative Study.
- Author
-
Yalçınkaya, Ali Can, Parapatics, Andrea, and Szentgyörgyi, Szilárd
- Subjects
LOANWORDS ,QUANTITATIVE research ,TURKIC languages ,ASSIMILATION (Phonetics) ,PHONEME (Linguistics) ,LINGUISTIC typology - Abstract
Copyright of Turkish Studies - Language & Literature is the property of Electronic Turkish Studies and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2023
- Full Text
- View/download PDF
45. Authenticating q -Gram-Based Similarity Search Results for Outsourced String Databases.
- Author
-
Yang, Liangyong, Ye, Haizhou, Liu, Xuyang, Mao, Yijun, and Zhang, Jilian
- Subjects
- *
SEARCH engines , *LOCATION-based services , *DATABASES , *PATTERN matching - Abstract
Approximate string searches have been widely applied in many fields, such as bioinformatics, text retrieval, search engines, and location-based services (LBS). However, the approximate string search results from third-party servers may be incorrect due to the possibility of malicious third parties or compromised servers. In this paper, we design an authenticated index structure (AIS) for string databases, which is based on the Merkle hash tree (MHT) method and the q-gram inverted index. Our AIS can facilitate verification object (VO) construction for approximate string searches with edit distance thresholds. We design an efficient algorithm named GS2 for VO construction at the server side and search result verification at the user side. We also introduce an optimization method called GS2-opt that can reduce VO size dramatically. Finally, we conduct extensive experiments on real datasets to show that our proposed methods are efficient and promising. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
46. Privacy-preserving Hamming and Edit Distance Computation and Applications
- Author
-
DOU Jia-wei
- Subjects
smc ,hamming distance ,edit distance ,semi-honest model ,simulation paradigm ,Computer software ,QA76.75-76.765 ,Technology (General) ,T1-995 - Abstract
With the rapid development of information technology,privacy-preserving multiparty cooperative computation is becoming more and more popular.Secure multiparty computation is a key technology to address such problems.In scientific research and practical applications,people measure the similarity of two strings with Hamming(edit) distance.It is of great significance to study privacy-preserving Hamming(edit) distance computation.This paper studies privacy-preserving Hamming(edit) distance computation.First,we transform Hamming distance computation to inner product computation of vectors,and then use Okamoto-Uchiyama(OU) cryptosystem and encryption-and-choose technique to design protocol for Hamming distance.Second,we give each alphabet of a string a number,transform edit distance to determine whether the difference of the number of two alphabets is 0,and use OU cryptosystem to design a privacy-preserving edit distance computation protocol.The security of the protocol is strictly proved,the computational complexity of the protocol is analyzed,the actual implementation efficiency of the protocol is tested and compared with the existing results.Theoretical analysis and experimental results show that our protocols are efficient.
- Published
- 2022
- Full Text
- View/download PDF
47. Towards efficient top-k fuzzy auto-completion queries
- Author
-
Magdy AbdelNaby, Mohamed E. Khalefa, Yousry Taha, and Ahmed Hassan
- Subjects
Fuzzy string similarity ,Preference-aware query processing ,Edit distance ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
Finding relevant objects in a large repository is a fundamental research problem occurring in many applications, such as: data cleaning, data integration, web search, and information retrieval. Instant type-ahead fuzzy search, where user types her query character by character and find the top-k relevant objects, has become widely involved in many applications because it provides the users with rapid response results and improves the user’s experience. The state-of-the-art algorithms are generally inefficient due to their breadth first search algorithm that results in repeated computations.To this end, we propose a novel depth-oriented instant type-ahead fuzzy search algorithm, that largely avoids repeated computations. The efficiency and effectiveness of the proposed approach are empirically demonstrated using real-world datasets. Experimental results show that our approach is 5–10 times faster than state-of-the-art approaches.
- Published
- 2022
- Full Text
- View/download PDF
48. Application of bioinformatics methods to recognition of network threats
- Author
-
Adam Kozakiewicz, Anna Felkner, Piotr Kijewski, and Tomasz Jordan Kruk
- Subjects
network threat analysis ,sequence alignment ,edit distance ,bioinformatics ,Telecommunication ,TK5101-6720 ,Information technology ,T58.5-58.64 - Abstract
Bioinformatics is a large group of methods used in biology, mostly for analysis of gene sequences. The algorithms developed for this task have recently found a new application in network threat detection. This paper is an introduction to this area of research, presenting a survey of bioinformatics methods applied to this task, outlining the individual tasks and methods used to solve them. It is argued that the early conclusion that such methods are ineffective against polymorphic attacks is in fact too pessimistic.
- Published
- 2023
- Full Text
- View/download PDF
49. Implementation of an automated grading tool for phonetic transcription training.
- Author
-
Speights Atkins, Marisha, Bailey, Dallin J., and Seals, Cheryl D.
- Subjects
- *
STATISTICS , *INTERNET , *COMPUTER assisted instruction , *LEARNING strategies , *SURVEYS , *PHONETICS , *DESCRIPTIVE statistics , *RESEARCH funding , *TECHNOLOGY , *DATA analysis software , *DATA analysis , *ALGORITHMS - Abstract
Clinical phonetic transcription is regarded as a highly specialised skill requiring hours of practice for mastery. Although this skill is a critical part of students' clinical preparation to become speech-language pathologists, students often report feeling unprepared to apply the skill in clinical practice. Previous studies suggest that increased opportunities for practice and timely feedback on transcriptions are needed in order to develop skill confidence. However, providing more opportunities for practice can be impeded by the limited resources to manage the grading of additional assignments. The purpose of this study is to show the implementation of a web-based learning management system (LMS) designed in our labs for phonetics instruction. The Automated Phonetic Transcription Grading Tool (APTgt LMS) was developed to provide a platform for assignment delivery and automated grading of transcription assignments. The APTgt LMS has three embedded IPA keyboards (basic, advanced, and full IPA) and an automated edit distance algorithm modified by phonetic alignment principles, which allows for individualised scoring and visual course-level feedback in an interactive online environment. For pilot testing, student confidence was queried before and after practice opportunities using APTgt. A concurrent mixed methods research design was used to analyse four Likert scale and three open-ended questions. Student confidence in transcribing disordered speech was found to significantly increase (p <0.001) following additional practice. Students reported concerns related to accurate transcription of disordered speech and that additional practice is still needed. Tools like APTgt can aid in facilitating student learning and increasing student confidence in applied transcription. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
50. Hybrid Tamil spell checker with combined character splitting.
- Author
-
Sampath, Anbukkarasi and Shanmugavel, Varadhaganapathy
- Subjects
SOCIAL media ,LONG short-term memory ,NATURAL language processing ,SPELLING errors ,CHINESE language ,SPEECH perception - Abstract
Summary: Spell checker is the application, which helps in finding the spelling errors in a given text. Applications like word processors, mails, search engines, speech recognition and social media forums need these kinds of spell checking tools to increase the correctness of the system. Spell checking is completely implemented in languages such as English, French, and Chinese. But as far as Indian regional languages is concerned, very few works have been carried out, that too partially. Tamil is one such Indian regional language, which requires a fully implemented spell checking application as many people started using this language in social media platforms like Facebook and Twitter. Spelling errors fall on different categories in Tamil language, which involves Sandhi errors, Homophone errors (Mayangoli), and misspelt words error. To tackle all these errors, a new ensemble approach is proposed in this paper. The proposed approach consists of Levenshtein's edit distance algorithm, rule‐based algorithm, Soundex algorithm along with LSTM (Long Short Term Memory) model. We have used a special feature called combine character splitting of Tamil alphabets for feeding the LSTM model to improve the performance of the system. Proposed system produced an accuracy of 95.67%, which is approved by the Tamil scholar. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.