Descriptor: "String metric" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"String metric"' showing total 955 results

Start Over Descriptor "String metric"

955 results on '"String metric"'

1. Question Answering in STACK Applying String Similarity.

Author: Eichhorn, Achim and Helfrich-Schkarbanenko, Andreas
Subjects: NATURAL language processing
Abstract: We present a method to evaluate fill-in-the-blank student answers in STACK using a string metric, which is not possible in the current version of STACK. To increase the quality of the evaluation, we use whitelist and blacklist instead of a single teacher answer. The performance of a STACK question equipped with a string metric is quantitatively demonstrated by evaluating its use in mathematics courses. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

2. NEMA: Automatic Integration of Large Network Management Databases

Author: Narendra Anand, Jiangtao Yin, Fubao Wu, Han Hee Song, Mario Baldi, and Lixin Gao
Subjects: Database Matching, Graph Database, Indexes, Measurement, Monitoring, NEMA, Network management, Reliability, Semantics, FOS: Computer and information sciences, Matching (statistics), Computer Science - Artificial Intelligence, Computer Networks and Communications, Computer science, Reliability (computer networking), computer.software_genre, Data type, Field (computer science), Computer Science - Databases, Electrical and Electronic Engineering, Graph database, Database, business.industry, Databases (cs.DB), Artificial Intelligence (cs.AI), String metric, business, computer, Data integration
Abstract: Network management, whether for malfunction analysis, failure prediction, performance monitoring and improvement, generally involves large amounts of data from different sources. To effectively integrate and manage these sources, automatically finding semantic matches among their schemas or ontologies is crucial. Existing approaches on database matching mainly fall into two categories. One focuses on the schema-level matching based on schema properties such as field names, data types, constraints and schema structures. Network management databases contain massive tables (e.g., network products, incidents, security alert and logs) from different departments and groups with nonuniform field names and schema characteristics. It is not reliable to match them by those schema properties. The other category is based on the instance-level matching using general string similarity techniques, which are not applicable for the matching of large network management databases. In this paper, we develop a matching technique for large NEtwork MAnagement databases (NEMA) deploying instance-level matching for effective data integration and connection. We design matching metrics and scores for both numerical and non-numerical fields and propose algorithms for matching these fields. The effectiveness and efficiency of NEMA are evaluated by conducting experiments based on ground truth field pairs in large network management databases. Our measurement on large databases with 1,458 fields, each of which contains over 10 million records, reveals that the accuracies of NEMA are up to 95%. It achieves 2%-10% higher accuracy and 5x-14x speedup over baseline methods., Comment: 14 pages, 13 Figures, 7 tables
Published: 2021
Full Text: View/download PDF

3. A Tiling Algorithm-Based String Similarity Measure

Author: Peter Z. Revesz
Subjects: General Computer Science, Computer science, General Engineering, Measure (physics), String metric, Algorithm
Abstract: This paper describes a similarity measure for strings based on a tiling algorithm. The algorithm is applied to a pair of proteins that are described by their respective amino acid sequences. The paper also describes how the algorithm can be used to find highly conserved amino acid sequences and examples of horizontal gene transfer between different species
Published: 2021
Full Text: View/download PDF

4. Spell corrector for Bangla language using Norvig’s algorithm and Jaro-Winkler distance

Author: Tajbia Karim, Istiak Ahamed, Maliha Jahan, Selim Reza, Dilshad Ara Hossain, and Zarin Tasnim
Subjects: Control and Optimization, Grammar, Computer Networks and Communications, Computer science, media_common.quotation_subject, Spell, Spelling, language.human_language, Bengali, Hardware and Architecture, Control and Systems Engineering, Computer Science (miscellaneous), language, Edit distance, Jaro–Winkler distance, Electrical and Electronic Engineering, String metric, Instrumentation, Algorithm, Word (computer architecture), Information Systems, media_common
Abstract: In the online world, especially in the social media platform most of us write without much regard to correct spelling and grammar. The spelling mistakes are much larger in proportion when it comes to Bangla language. In our paper, we presented a method for error detection and correction in Bangla words' spellings. Our system could detect a misspelled Bangla word and provide two following services-suggesting correct spellings for the word and correcting the word. We had used Norvig's algorithm for the purpose but instead of using probabilities of the words to prepare the suggestions and corrections, we had used Jaro-Winkler distance. The previous works done in this field for Bangla language are either very slow or offers less accuracy. Our system successfully achieved a 97% accuracy when evaluated with 1000 Bangla words.
Published: 2021
Full Text: View/download PDF

5. WEB APP: String Similarity Search - A Hash-based Approach

Author: Snehal Bobhate
Subjects: Information retrieval, Computer science, business.industry, Hash function, Web application, String metric, business
Abstract: During this Project, we study string similarity search based on edit distance that is supported by many database management systems like Oracle and PostgreSQL. Given the edit distance, ed(s, t), between two strings, s and t, the string similarity search is to search out each string t in a string database D which is almost like a query string s such that ed(s, t) = t for a given threshold t. Within the literature, most existing work takes a filter-and-verify approach, where the filter step is introduced to reduce the high verification cost of 2 strings by utilizing an index engineered offline for D. The two up-to-date approaches are prefix filtering and native filtering. We have a tendency to propose 2 new hash- primarily based labeling techniques, named OX label and XX label, for string similarity search. We have a tendency to assign a hash-label, H s , to a string s, and prune the dissimilar strings by comparing 2 hash-labels, H s and H t , for two strings s and t within the filter step. The key idea is to take the dissimilar bit- patterns between 2 hash-labels.Our hash-based mostly approaches achieve high efficiency, and keep its index size and index construction time one order of magnitude smaller than the present approaches in our experiment at the same time.
Published: 2021
Full Text: View/download PDF

6. ProGOMap: Automatic Generation of Mappings From Property Graphs to Ontologies

Author: Mohamed Hashem, Nagwa Badr, Walaa Gad, and Naglaa Fathy
Subjects: Graph database, Theoretical computer science, General Computer Science, Ontology learning, Relational database, Computer science, General Engineering, computer.file_format, Ontology (information science), resource description framework, computer.software_genre, ontology alignment, Domain (software engineering), Data modeling, ontology engineering, TK1-9971, graph model heterogeneity, Property graph database, General Materials Science, Electrical engineering. Electronics. Nuclear engineering, RDF, String metric, computer
Abstract: Property Graph databases (PGs) are emerging as efficient graph stores with flexible schemata. This raises the need to have a unified view over heterogenous data produced from these stores. Ontology based Data Access (OBDA) has become the most dominant approach to integrate heterogeneous data sources by providing a unified conceptual view (ontology) over them. The corner stone of any OBDA system is to define mappings between the data source and the target (domain) ontology. However, manual mapping generation is time consuming and requires great efforts. This paper proposes ProGOMap (Property Graph to Ontology Mapper) system that automatically generates mappings from property graphs to a domain ontology. ProGOMap starts by generating a putative ontology with direct axioms from PG. A novel ontology learning algorithm is proposed to enrich the putative ontology with subclass axioms inferred from PG. The putative ontology is then aligned to an existing domain ontology using string similarity metrics. Another algorithm is proposed to align object properties between the two ontologies considering different modelling criteria. Finally, mappings are generated from alignment results. Experiments were done on eight data sets with different scenarios to evaluate the effectiveness of the generated mappings. The experimental results achieved mapping accuracy up to 97% and 81% when addressing PG-to-ontology terminological and structural heterogeneities, respectively. Ontology learning by inferring subclass axioms from a property graph helps to address the heterogeneity between the PG and ontology models.
Published: 2021

7. Arabic real time entity resolution using inverted indexing

Author: Ghazi Al-Naymat, Banda Ramadan, and Marwah Alian
Subjects: Space (punctuation), 050101 languages & linguistics, Linguistics and Language, Matching (statistics), Arabic, Computer science, 02 engineering and technology, Library and Information Sciences, computer.software_genre, Inverted index, Language and Linguistics, Education, Similarity (network science), 0202 electrical engineering, electronic engineering, information engineering, 0501 psychology and cognitive sciences, business.industry, 05 social sciences, Search engine indexing, Object (computer science), language.human_language, language, 020201 artificial intelligence & image processing, Artificial intelligence, String metric, Computational linguistics, business, computer, Natural language processing
Abstract: Arabic datasets that have two or more records for the same world entity (i.e. person, object, etc.) make institutions suffer from low quality and degraded performance due to duplication in their Arabic datasets without having any mechanism for detecting these duplicates. The operation that distinguishes records for the same real-world entity is called Entity Resolution (ER). It is considered as a tool for linking records across databases as well as for matching query records with existing databases in real-time. Indexing is a major step in the ER process that aims at reducing the search space. Several indexing techniques are available for use with the ER process in general for English Databases. However, such techniques are not validated if they work well with other languages, such as Arabic. The Dynamic Similarity Aware Inverted Index (DySimII) is one of the indexing techniques that are utilized with dynamic databases to match query records in real time and is demonstrated to work well with English language. In this paper, we propose a framework—Arabic Real Time Entity Resolution (ARTER)—that uses DySimII with Arabic databases to perform real time ER. We also examine using different string similarity functions required for comparing records in the matching process for the aim of evaluating which similarity function is more suitable for comparing Arabic strings. A real-world Arabic database is used to conduct our experimental evaluation where two stemmers and three similarity functions are used to see the effect on DySimII with Arabic dataset. The results represent that matching accuracy is improved using Asem stemmer when the number of corrupted attributes is increased, also testing the three similarity functions show that using winkler similarity function provides better matching accuracy while N-gram provides better results when used with Asem stemmer.
Published: 2020
Full Text: View/download PDF

8. Handling data-skewness in character based string similarity join using Hadoop

Author: Devendra K. Tayal, Oscar Castillo, Amita Jain, and Kanak Meena
Subjects: Theoretical computer science, Zipf's law, Computer science, Joins, 02 engineering and technology, Computer Science Applications, Set (abstract data type), Similarity (network science), Skewness, 020204 information systems, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Join (sigma algebra), 020201 artificial intelligence & image processing, String metric, Software, Information Systems
Abstract: The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the uneven distribution of attributes occurs, and it can cause a severe load imbalance problem. When database join operations are applied to these datasets, skewness occurs exponentially. All the algorithms developed to date for the implementation of database joins are highly skew sensitive. This paper presents a new approach for handling data-skewness in a character- based string similarity join using the MapReduce framework. In the literature, no such work exists to handle data skewness in character-based string similarity join, although work for set based string similarity joins exists. Proposed work has been divided into three stages, and every stage is further divided into mapper and reducer phases, which are dedicated to a specific task. The first stage is dedicated to finding the length of strings from a dataset. For valid candidate pair generation, MR-Pass Join framework has been suggested in the second stage. MRFA concepts are incorporated for string similarity join, which is named as “MRFA-SSJ” (MapReduce Frequency Adaptive – String Similarity Join) in the third stage which is further divided into four MapReduce phases. Hence, MRFA-SSJ has been proposed to handle skewness in the string similarity join. The experiments have been implemented on three different datasets namely: DBLP, Query log and a real dataset of IP addresses & Cookies by deploying Hadoop framework. The proposed algorithm has been compared with three known algorithms and it has been noticed that all these algorithms fail when data is highly skewed, whereas our proposed method handles highly skewed data without any problem. A set-up of the 15-node cluster has been used in this experiment, and we are following the Zipf distribution law for the analysis of skewness factor. Also, a comparison among existing and proposed techniques has been shown. Existing techniques survived till Zipf factor 0.5 whereas the proposed algorithm survives up to Zipf factor 1. Hence the proposed algorithm is skew insensitive and ensures scalability with a reasonable query processing time for string similarity database join. It also ensures the even distribution of attributes.
Published: 2020
Full Text: View/download PDF

9. The Challenge of Pairing Big Datasets: Probabilistic Record Linkage Methods and Diagnosis of Their Empirical Viability

Author: Lucas Ferreira Mation and Yaohao Peng
Subjects: Economics and Econometrics, Computational complexity theory, Computer science, business.industry, Big data, Probabilistic logic, Machine learning, computer.software_genre, Code (cryptography), Artificial intelligence, Statistics, Probability and Uncertainty, Business and International Management, String metric, business, Heuristics, Implementation, computer, Finance, Record linkage
Abstract: In this paper, we evaluated the predictive performance of probabilistic record linkage algorithms, discussing the implications of different configurations of blocking keys, string similarity functions and phonetic code on the prediction’s overall performance and computational complexity. Furthermore, we carried out a bibliographical survey of the main deterministic and probabilistic record linkage methods, as well as of recent advances combining machine learning techniques and main packages and implementations available in open-source R language. The results can provide heuristics for problems of administrative records integration at the national level and have potential value for the formulation and evaluation of public policies.
Published: 2020
Full Text: View/download PDF

10. A self-verifying clustering approach to unsupervised matching of product titles

Author: Leonidas Akritidis, Athanasios Fevgas, Christos Makris, and Panayiotis Bozanis
Subjects: Linguistics and Language, Matching (statistics), Process (engineering), Computer science, Volume (computing), 02 engineering and technology, computer.software_genre, Language and Linguistics, Data point, Similarity (network science), Artificial Intelligence, 020204 information systems, Product (mathematics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, String metric, Cluster analysis, computer
Abstract: The continuous growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standard similarity metrics cannot effectively identify identical products, since similar titles often refer to different products and vice-versa. Other studies employ external data sources to enrich the titles; these solutions are rather impractical, since the process of fetching external data is inefficient. In this paper we introduce UPM, an unsupervised algorithm for matching products by their titles that is independent of any external sources. UPM consists of three stages. During the first stage, the algorithm analyzes the titles and extracts combinations of words out of them. These combinations are evaluated in stage 2 according to several criteria, and the most appropriate of them are selected to form the initial clusters. The third phase is a post-processing verification stage that refines the initial clusters by correcting the erroneous matches. This stage is designed to operate in combination with all clustering approaches, especially when the data possess properties that prevent the co-existence of two data points within the same cluster. The experimental evaluation of UPM with multiple datasets demonstrates its superiority against the state-of-the-art clustering approaches and string similarity metrics, in terms of both efficiency and effectiveness.
Published: 2020
Full Text: View/download PDF

11. Time and Space Efficient Large Scale Link Discovery using String Similarities

Author: George A. Vouros and Andreas Karampelas
Subjects: Matching (statistics), Algebra and Number Theory, Triangle inequality, Computer science, String (computer science), String searching algorithm, computer.software_genre, Theoretical Computer Science, Computational Theory and Mathematics, Metric (mathematics), Edit distance, Data mining, Pruning (decision trees), String metric, computer, Information Systems
Abstract: This paper proposes and evaluates time and space efficient methods for discovering links between matching entities in large data sets, using state of the art methods for measuring edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target dataset, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: The first method exploits the blocking structure using the triangle inequality in conjunction to the substring-matching criterion. The second method uses only the substring-matching criterion, while the third method uses the substring-matching criterion in conjunction to the frequency-matching criterion. Evaluation results show the pruning power of the different criteria used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state-of-the-art tools for large-scale link discovery.
Published: 2020
Full Text: View/download PDF

12. Optimized Signature Selection for Efficient String Similarity Search

Author: Jongik Kim, Tae-Sun Chung, and Taegyoung Lee
Subjects: partition signature scheme, hierarchical tree index, General Computer Science, Query string, Computer science, Edit distance, General Engineering, 02 engineering and technology, string similarity search, Tree (data structure), Similarity (network science), 020204 information systems, Node (computer science), optimized signature selection, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), 020201 artificial intelligence & image processing, General Materials Science, lcsh:Electrical engineering. Electronics. Nuclear engineering, String metric, lcsh:TK1-9971, Algorithm
Abstract: In this paper, we study the problem of string similarity search to retrieve in a database all strings similar to a query string within a given threshold. To measure the similarity between strings, we use edit distance. Many algorithms have been proposed under a filtering-and-verification framework to solve the problem. To reduce the overhead of edit distance verification, it is crucial to efficiently generate a small number of candidates in the filtering phase. Recently, an index structure named HSTree has been proposed for efficiently generating candidate strings. To generate candidates, they select and utilize HSTree nodes at a specific level calculated from a given threshold. In this paper, we observe that there are many alternative ways to select HSTree nodes, and propose a novel technique that selects HSTree nodes in an optimized way based on the observation. We also propose a modified HSTree, named a threaded HSTree, which connects inverted lists of an HSTree node to inverted lists of its child nodes. With a threaded HSTree, we can reduce the overhead of index lookups in HSTree nodes while selecting optimal tree nodes. Experimental results show that the proposed technique significantly outperforms the existing technique using the HSTree.
Published: 2020
Full Text: View/download PDF

13. Building a Better SQL Automarker for Database Courses

Author: Ilir Dema, Michael Liut, Muyu Wang, Naaz Sibia, and Carlos Aníbal Suárez
Subjects: SQL, Database, Process (engineering), Computer science, Aggregate (data warehouse), Public institution, computer.software_genre, Channel (programming), Similarity (psychology), ComputingMilieux_COMPUTERSANDEDUCATION, String metric, Grading (education), computer, computer.programming_language
Abstract: This work introduces and demonstrates the viability of a novel SQL automarking tool (“SQAM”) that: (1) provides a fair grade to the student, one which matches the student’s effort and understanding of the course material, and (2) to provide personalized feedback, allowing the student to remain engaged in the material and learn from their mistakes while still being in that headspace. Additionally, we strive to ensure that our tool maintains the same standards (grade and feedback) that a highly qualified member of teaching staff would produce, so we compare and contrast our automarker’s results to that of teaching assistants over several historic offerings of the same database course at a large research intensive public institution, while reducing the grading time, thus enabling the teaching staff to channel more time into instruction. Furthermore, we describe SQAM’s design and our model which applies the aggregate result of four different string similarity metrics to compute solution similarity in conjunction with our discretization process to fairly evaluate a student’s submission. Our results show that SQAM produces very similar grades to those which were historically given by teaching assistants.
Published: 2021
Full Text: View/download PDF

14. Fuzzy Classification of Multi-intent Utterances

Author: Julia Taylor Rayz and Geetanjali Bihani
Subjects: Fuzzy classification, business.industry, Computer science, media_common.quotation_subject, Fuzzy set, 02 engineering and technology, Ambiguity, computer.software_genre, Fuzzy logic, Class (biology), 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, String metric, 0305 other medical science, business, computer, Natural language, Natural language processing, Utterance, media_common
Abstract: Current intent classification approaches assign binary intent class memberships to natural language utterances while disregarding the inherent vagueness in language and the corresponding vagueness in intent class boundaries. In this work, we propose a scheme to address the ambiguity in single-intent as well as multi-intent natural language utterances by creating degree memberships over fuzzified intent classes. To our knowledge, this is the first work to address and quantify the impact of the fuzzy nature of natural language utterances over intent category memberships. Additionally, our approach overcomes the sparsity of multi-intent utterance data to train classification models by using a small database of single intent utterances to generate class memberships over multi-intent utterances. We evaluate our approach over two task-oriented dialog datasets, across different fuzzy membership generation techniques and approximate string similarity measures. Our results reveal the impact of lexical overlap between utterances of different intents, and the underlying data distributions, on the fuzzification of intent memberships. Moreover, we evaluate the accuracy of our approach by comparing the defuzzified memberships to their binary counterparts, across different combinations of membership functions and string similarity measures.
Published: 2021
Full Text: View/download PDF

15. Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

Author: Bin Yu, Briton Park, Nicholas Altieri, John DeNero, and Anobel Y. Odisho
Subjects: Pathology, medicine.medical_specialty, AcademicSubjects/SCI01060, Information extraction, Computer science, Health Informatics, Research and Applications, computer.software_genre, Annotation, Prior probability, Machine learning, medicine, cancer, natural language processing, Lung, business.industry, Deep learning, Natural language processing, Lung Cancer, Class (biology), Networking and Information Technology R&D, pathology, Artificial intelligence, AcademicSubjects/SCI01530, String metric, AcademicSubjects/MED00010, Transfer of learning, business, computer, Natural language
Abstract: ObjectiveWe develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report.Materials and MethodsOur data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods.ResultsFor our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations.ConclusionsMethods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.
Published: 2021

16. Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

Author: Richard Dobson and Hegler Tissot
Subjects: Matching (statistics), Computer Networks and Communications, Computer science, Nearest neighbor search, Similarity search, Misspelt names of drugs, Information Storage and Retrieval, Health Informatics, 02 engineering and technology, computer.software_genre, lcsh:Computer applications to medicine. Medical informatics, Medical Records, Set (abstract data type), Similarity (network science), Phonetics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Phonetic similarity, Language, Natural Language Processing, Portugal, business.industry, Research, String (computer science), Spelling, Computer Science Applications, Information extraction, Pharmaceutical Preparations, lcsh:R858-859.7, 020201 artificial intelligence & image processing, Artificial intelligence, String metric, business, computer, Natural language processing, Algorithms, Information Systems
Abstract: Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.
Published: 2019
Full Text: View/download PDF

17. Framework for syntactic string similarity measures

Author: Pasi Fränti, Damien Hostettler, Najlah Gali, and Radu Mariescu-Istodor
Subjects: 0209 industrial biotechnology, Matching (statistics), business.industry, Computer science, Semantic analysis (machine learning), String (computer science), General Engineering, 02 engineering and technology, Similarity measure, Document clustering, computer.software_genre, Automatic summarization, Computer Science Applications, 020901 industrial engineering & automation, Similarity (network science), Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Question answering, 020201 artificial intelligence & image processing, Artificial intelligence, String metric, business, computer, Natural language processing
Abstract: Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for matching short text. We thoroughly analyze the measures by dividing them into three components: character-level similarity, string segmentation, and matching technique. Soft variants of the measures are also introduced. With the help of two existing toolkits (SecondString and SimMetric), we provide an open-source Java toolkit of the proposed framework, which integrates the individual components together so that completely new combinations can be created. Experimental results reveal that the performance of the similarity measures depends on the type of the dataset. For well-maintained dataset, using a token-level measure is important but the basic (crisp) variant is usually enough. For uncontrolled dataset where typing errors are expected, the soft variants of the token-level measures are necessary. Among all tested measures, a soft token-level measure that combines set matching and q-grams at the character level perform best. A gap between human perception and syntactic measures still remains due to lacking semantic analysis.
Published: 2019
Full Text: View/download PDF

18. A novel information hiding scheme based on social networking site viewers’ public comments

Author: Danish Ali Khan, Dilip Kumar Yadav, and Susmita Mahato
Subjects: Traffic analysis, Information retrieval, Steganography, Computer Networks and Communications, Computer science, 020206 networking & telecommunications, 02 engineering and technology, Huffman coding, symbols.namesake, Information hiding, 0202 electrical engineering, electronic engineering, information engineering, symbols, 020201 artificial intelligence & image processing, Communication source, String metric, Safety, Risk, Reliability and Quality, Set (psychology), Software, Sentence
Abstract: In earlier reported chat-based steganography techniques, only direct communication between the sender and receiver was considered which can raise suspicion for further investigations by an attacker. If the adversary traces the communication between the two, he may investigate to find out the data shared during communication, can alter the content, or destroy it. In such a situation, a steganography system is required which can bypass adverse attention through indirect communication instead of direct communication. In this paper, we propose a new framework to camouflage hidden communication between the transceivers. The framework is based on online social networking and video-sharing website's communication. The secret message is communicated using comment features of these social networking and video-sharing websites in a totally new way compared with earlier approaches. Stego-comment is generated by performing synonym-substitution based on Huffman code on the auto-summarized previous comments of a chosen post. The stego-comment does not raise any suspicion to the intermediary by being technically similar to other comments which may not have any hidden message, which makes this method successful. Similarity indices of the stego-comment with respect to other comments are calculated using a String Similarity Tool which uses fuzzy comparison functions between strings for one sample dataset. The proposed method gives average bit-rate (as a measure of embedding efficiency) of 9.04 bits per sentence for a set of five different case studies which is high compared with average bit rates found in the literature. This communication cannot be traced between the communicators through traffic analysis in any easy way due to the absence of any direct communication.
Published: 2019
Full Text: View/download PDF

19. Towards a unified framework for string similarity joins

Author: Pengfei Xu, Jiaheng Lu, Department of Computer Science, and Unified DataBase Management System research group / Jiaheng Lu
Subjects: Structure (mathematical logic), Theoretical computer science, Computer science, String (computer science), General Engineering, Joins, 02 engineering and technology, Similarity measure, 113 Computer and information sciences, string processing, Similarity (network science), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), 020201 artificial intelligence & image processing, String metric, Time complexity, database
Abstract: A similarity join aims to find all similar pairs between two collections of records. Established algorithms utilise different similarity measures, either syntactic or semantic, to quantify the similarity between two records. However, when records are similar in forms of a mixture of syntactic and semantic relations, utilising a single measure becomes inadequate to disclose the real similarity between records, and hence unable to obtain high-quality join results. In this paper, we study a unified framework to find similar records by combining multiple similarity measures. To achieve this goal, we first develop a new similarity framework that unifies the existing three kinds of similarity measures simultaneously, including syntactic (typographic) similarity, synonym-based similarity, and taxonomy-based similarity. We then theoretically prove that finding the maximum unified similarity between two strings is generally NP -hard, and furthermore develop an approximate algorithm which runs in polynomial time with a non-trivial approximation guarantee. To support efficient string joins based on our unified similarity measure, we adopt the filter-and-verification framework and propose a new signature structure, called pebble , which can be simultaneously adapted to handle multiple similarity measures. The salient feature of our approach is that, it can judiciously select the best pebble signatures and the overlap thresholds to maximise the filtering power. Extensive experiments show that our methods are capable of finding similar records having mixed types of similarity relations, while exhibiting high efficiency and scalability for similarity joins. The implementation can be downloaded at https://github.com/HY-UDBMS/AU-Join.
Published: 2019
Full Text: View/download PDF

20. Balance-aware distributed string similarity-based query processing system

Author: Zhifeng Bao, Ji Sun, Zeyuan Shang, Guoliang Li, and Dong Deng
Subjects: SQL, Computer science, General Engineering, 02 engineering and technology, computer.software_genre, Query optimization, Similarity (network science), 020204 information systems, Spark (mathematics), Scalability, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, String metric, computer, computer.programming_language, DIMA, Data integration
Abstract: Data analysts spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. Similarity-based query processing is an important way to tolerate the errors and inconsistencies. However, similarity-based query processing is rather costly and traditional database cannot afford such expensive requirement. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports four core similarity operations, i.e., similarity selection, similarity join, top- k selection and top- k join. Dima extends SQL for users to easily invoke these similarity-based operations in their data analysis tasks. To avoid expensive data transmission in a distributed environment, we propose balance-aware signatures where two records are similar if they share common signatures, and we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support similarity operations. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support complex similarity-based query processing on large-scale datasets. We have conducted extensive experiments on four real-world datasets. Experimental results show that Dima outperforms state-of-the-art studies by 1--3 orders of magnitude and has good scalability.
Published: 2019
Full Text: View/download PDF

21. Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Author: Ravi Shankar Mishra, Kartik Mehta, and Nikhil Rasiwasia
Subjects: FOS: Computer and information sciences, Normalization (statistics), Matching (statistics), Computer Science - Machine Learning, Jaccard index, Computer Science - Computation and Language, Computer science, business.industry, Pattern recognition, Approximate string matching, Machine Learning (cs.LG), Similarity (network science), Word2vec, Canonical form, Artificial intelligence, String metric, business, Computation and Language (cs.CL)
Abstract: In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string matching and 19.3% improvement over best unsupervised embeddings., Accepted in ECNLP workshop of ACL-IJCNLP 2021 (https://sites.google.com/view/ecnlp)
Published: 2021

22. PolyU-CBS at the FinSim-2 Task: Combining Distributional, String-Based and Transformers-Based Features for Hypernymy Detection in the Financial Domain

Author: Emmanuele Chersoni and Chu-Ren Huang
Subjects: Finance, Computer science, business.industry, Simple (abstract algebra), String (computer science), Classifier (linguistics), String metric, business, Word (computer architecture), Transformer (machine learning model), Task (project management), Domain (software engineering)
Abstract: In this contribution, we describe the systems presented by the PolyU CBS Team at the second Shared Task on Learning Semantic Similarities for the Financial Domain (FinSim-2), where participating teams had to identify the right hypernyms for a list of target terms from the financial domain. For this task, we ran our classification experiments with several distributional, string-based, and Transformer features. Our results show that a simple logistic regression classifier, when trained on a combination of word embeddings, semantic and string similarity metrics and BERT-derived probabilities, achieves a strong performance (above 90%) in financial hypernymy detection.
Published: 2021
Full Text: View/download PDF

23. New approach to the study of nonrelativistic bosonic string in flat spacetime

Author: Sk. Moinuddin, Rabin Banerjee, and Pradip Mukherjee
Subjects: Physics, 010308 nuclear & particles physics, Constraint analysis, 01 natural sciences, High Energy Physics::Theory, Theoretical physics, symbols.namesake, 0103 physical sciences, Homogeneous space, Minkowski space, symbols, Embedding, String metric, 010306 general physics, Hamiltonian (quantum mechanics)
Abstract: A new approach to the study of nonrelativistic bosonic string in flat spacetime is introduced, based on a holistic Hamiltonian analysis of the minimal action for the string. This leads to a structurally new form of the action which is, however, equivalent to the known results since, under appropriate limits, it interpolates between the minimal action (Nambu-Goto type) where the string metric is taken to be that induced by the embedding and the Polyakov type of action where the world sheet metric components are independent fields. The equivalence among different actions is established by a detailed study of symmetries using constraint analysis. Various vexing issues in the existing literature are clarified. The interpolating action mooted here is shown to reveal the geometry of the string and may be useful in analyzing nonrelativistic string coupled with curved background.
Published: 2021
Full Text: View/download PDF

24. Jodani: A spell checking and suggesting tool for Gujarati language

Author: Bankim Patel, Kalpesh Lad, and Himadri Patel
Subjects: Root (linguistics), Machine translation, business.industry, Computer science, String (computer science), Sentiment analysis, Spell, String searching algorithm, computer.software_genre, Automatic summarization, Artificial intelligence, String metric, business, computer, Natural language processing
Abstract: The spell checker is used in the pre-processing phase of Natural Language Processing systems for applications like Opinion Mining, Text Summarization, Machine Translation, Chatbot etc. A traditional spell checker tool compares the inputted string with an available dictionary of correct words. But as it works on string matching concepts, it is not able to handle inflected words. So, there is a need for a spell checker that overcomes this limitation. In this paper, the Gujarati spell checker tool 4Jodani, is proposed which works on root words, uses string similarity measures to identify wrongly spelled words, and attempts to auto-correct the word or suggest syntactically relevant words.
Published: 2021
Full Text: View/download PDF

25. TabbyLD: A Tool for Semantic Interpretation of Spreadsheets Data

Author: Aleksandr Yu. Yurin and Nikita O. Dorodnykh
Subjects: Entity linking, Information retrieval, Semantic similarity, Computer science, Semantic interpretation, Similarity (psychology), Context (language use), Linked data, String metric, Semantic Web
Abstract: Spreadsheets are one of the most convenient ways to structure and represent statistical and other data. In this connection, automatic processing and semantic interpretation of spreadsheets data have become an active area of scientific research, especially in the context of integrating this data into the Semantic Web. In this paper, we propose a TabbyLD tool for semantic interpretation of data extracted from spreadsheets. Main features of our software connected with: (1) using original metrics for defining semantic similarity between cell values and entities of a global knowledge graph: string similarity, NER label similarity, heading similarity, semantic similarity, context similarity; (2) using a unified canonicalized form for representation of arbitrary spreadsheets; (3) integration TabbyLD with the TabbyDOC project’s tools in the context of the overall pipeline. TabbyLD architecture, main functions, a method for annotating spreadsheets including original similarity metrics, the illustrative example, and preliminary experimental evaluation are presented. In our evaluation, we used the T2Dv2 Gold Standard dataset. Experiments have shown the applicability of TabbyLD for semantic interpretation of spreadsheets data. We also identified some issues in this process.
Published: 2021
Full Text: View/download PDF

26. Semantic Ontology Alignment: Survey and Analysis

Author: Lakhdar El Amine Boudaoud
Subjects: Information retrieval, Computer science, Web page, State (computer science), Ontology (information science), String metric, Semantics, Semantic Web, Ontology alignment, Field (computer science)
Abstract: Ontology alignment is an important part in the semantic web to reach its full potential, Recently, ontologies have become competitive common on the World Wide Web where they generic semantics for annotations in Web pages, This paper aims at counting all works of the ontology alignment field and analyzing the approaches according to different techniques (terminological, structural, extensional and semantic). This can clear the way and help researchers to choose the appropriate solution to their issue. They can see the insufficiency, so they can propose new approaches for stronger alignment and also He determines possible inconsistencies in the state of the ontology, which result from the user’s actions, and suggests ways to remedy these inconsistencies.
Published: 2020
Full Text: View/download PDF

27. Research on Structured Information Extraction Method of Electronic Medical Records of Traditional Chinese Medicine

Author: Chenjun Hu, Jiadong Xie, Jiayi He, Kongfa Hu, Rongrong Jiang, and Weiming He
Subjects: Set (abstract data type), Matching (statistics), Information extraction, Information retrieval, Standardization, Computer science, Medical record, Key (cryptography), Data set (IBM mainframe), String metric, computer.software_genre, computer
Abstract: Objective: To study the extraction method of structured information, and realize the structuration and standardization of inpatient electronic medical records of TCM (Traditional Chinese Medicine). Methods: Based on the key terms included in "WS 445-2014 Electronic Medical Records Basic Data Set", a key set of electronic medical records for TCM hospitalization was constructed. Secondly, based on the string similarity algorithm and entity matching technology, the diagnosis and treatment information in the electronic medical records of TCM is extracted, and key-value pairs were formed with the keyword set to establish a structured diagnosis and treatment database. Finally, the diagnosis of traditional Chinese and Western medicine, physical and chemical examination and prescription data were further standardized to build a standardized diagnosis and treatment database. Results: The experimental results showed that the KS-CCD(Keyword sequencing of Clinical case data) method proposed in this paper can effectively extract information from inpatient electronic medical records of TCM. Conclusion: KS-CCD method is suitable for the extraction of structured information of electronic medical records of TCM. It provides rich research data for scientific research, and is conducive to the inheritance and development of experience of TCM.
Published: 2020
Full Text: View/download PDF

28. Discovering Entity Profiles Candidate for Entity Resolution on Linked Open Data Halal Food Products

Author: Ahmad Choirun Najib and Nur Aini Rakhmawati
Subjects: Graph database, Matching (graph theory), Computer science, Graph embedding, 02 engineering and technology, computer.file_format, Linked data, computer.software_genre, Set (abstract data type), Similarity (network science), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, RDF, String metric, computer
Abstract: Entity resolution is a common task in the Web of data. In the majority, the recent studies of this field aim to discover appropriate entity profiles candidate to reduce the likelihood of missing matches and to place matching entity profiles in the same blocks. We proposed a method to discover entity profiles candidate for entity resolution. We utilize Node2vec graph embedding to get entity representations and perform link prediction. We employed a graph database to generate the nodes and relations from the RDF triple dataset file. Then, the nodes and relations were transformed into vectors and saved to the vector embedding file. We calculate vector similarity between an entity source vector and all of the entity vectors on the embedding file. The vector similarity produces a set of relevant entities. The top-k results are selected as entity profiles candidate that present the most similar entities to the entity source. Finally, we do an entity resolution task by utilizing string similarity comparisons between the pair of entity source and entity profile attribute values with predetermined parameters and threshold. We assign owl:sameAs property for matches entities. The results show 87%, 80%, and 83% for precision, recall, and F-measure evaluation score, respectively.
Published: 2020
Full Text: View/download PDF

29. Query Interface Schema Extraction for Hidden Web Resources Searching

Author: Authors Zhang Huan, Yang Panfei, and Yu Zitong
Subjects: Schema (genetic algorithms), Information retrieval, Computer science, Interface (Java), Web page, Static web page, String metric, Ontology (information science), Semantics, Field (computer science)
Abstract: It is an urgent task of the Web search field to satisfy people’s demand for having effective access to the high-quality Web. Instead of specifying a URL to send an HTTP request to get the static page information, accessing hidden Web resources (deep Web) need to post queries to the query interface provided by the website. The query interface is the entrance to get the Web database information. Therefore, the research on schema extraction from deep Web query interface is a key step in hidden Web resources mining. This paper presents a novel approach to extract interface schema from deep Web based on domain ontology. Besides, it also proposes a new presentation of query interface attribute, which reflects the semantic relationships between the labels, on the basis of location, label semantic relationship and string similarity. Experimental results show that our system is feasible and efficient, and achieves high precision, recall and F-Measure value across a variety of databases.
Published: 2020
Full Text: View/download PDF

30. Neural text normalization leveraging similarities of strings and sounds

Author: Hidetaka Kamigaito, Manabu Okumura, Tatsuya Aoki, Riku Kawamura, and Hiroya Takamura
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer science, Speech recognition, 05 social sciences, 010501 environmental sciences, 01 natural sciences, Similarity (network science), 0502 economics and business, Text normalization, 050207 economics, String metric, Baseline (configuration management), Computation and Language (cs.CL), Word (computer architecture), 0105 earth and related environmental sciences
Abstract: We propose neural models that can normalize text by considering the similarities of word strings and sounds. We experimentally compared a model that considers the similarities of both word strings and sounds, a model that considers only the similarity of word strings or of sounds, and a model without the similarities as a baseline. Results showed that leveraging the word string similarity succeeded in dealing with misspellings and abbreviations, and taking into account the sound similarity succeeded in dealing with phonetic substitutions and emphasized characters. So that the proposed models achieved higher F$_1$ scores than the baseline., 6 pages, accepted to COLING2020
Published: 2020

31. Matching System for Animal-Assisted Therapy Based on the Levenshtein and Gale-Shapley Algorithms

Author: Juan Gutierrez-Cardenas and Giuliana Gutiérrez-Rondón
Subjects: Matching (statistics), Computer science, medicine.medical_treatment, 05 social sciences, Animal-assisted therapy, Field (computer science), Preference, 030227 psychiatry, Task (project management), Set (abstract data type), 03 medical and health sciences, Behavioral traits, 0302 clinical medicine, medicine, 0501 psychology and cognitive sciences, String metric, Algorithm, 050104 developmental & child psychology
Abstract: This current research is based on the implementation of an algorithm that assigns pets, cats, or dogs to persons with depressive disorders such as low self-esteem. We found that even though different institutions have made the assignments of pets to patients, we were not able to find one that uses an IT tool for this task. Because of this situation, we decided to adapt to the well-known Gale-Shapley algorithm that has been used successfully in different situations in which it needs a perfect match between two parties. The results obtained have been validated by experts in the field of animal and person psychology. Because the Gale-Shapley algorithm needs a preference array between the parts involved and due that an animal cannot establish this set of preferences, we aimed to use a string similarity-based algorithm for obtaining preferences arrays based on the behavioral traits of an animal or person.
Published: 2020
Full Text: View/download PDF

32. Automatic Grading System for Spreadsheet Formula

Author: Fitra Arifiansyah, Saiful Akbar, and Kurniandha Sukma Yunastrian
Subjects: Matching (statistics), Information retrieval, Similarity (network science), Computer science, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Key (cryptography), 020207 software engineering, 02 engineering and technology, String metric
Abstract: Spreadsheet is one of the tools that can be used to learn data analysis. Data analysis in spreadsheet can be done using formula. Spreadsheet tools can also be used for exams. For the assessment, there is a problem when the number of answers that need to be checked is large, that is it takes a long time to check all the answers. For this reason, an automatic grading system (autograder) that can evaluate formula in spreadsheet is needed. The method used in developing the autograder system is matching the answer key formula with the student's answer formula. The autograder system assesses the answer by calculating the similarity of the student's answer formula with the answer key formula. This paper explains how to build an autograder system that can evaluate the formula. At the end, an autograder system has been built successfully. It has been tested with 43 testcases and all of them are passed.
Published: 2020
Full Text: View/download PDF

33. Convolutional Embedding for Edit Distance

Author: Kaiwen Zhou, Xiao Yan, Yuxuan Wang, Xinyan Dai, Han Yang, and James Cheng
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, business.industry, Computer science, Nearest neighbor search, Deep learning, Computer Science::Neural and Evolutionary Computation, Databases (cs.DB), Sequence alignment, 02 engineering and technology, Convolutional neural network, Machine Learning (cs.LG), Euclidean distance, Computer Science - Databases, Margin (machine learning), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Embedding, 020201 artificial intelligence & image processing, Edit distance, Artificial intelligence, String metric, business, Algorithm
Abstract: Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and RNN-based GRU embedding in terms of both accuracy and efficiency by a large margin. We also show that string similarity search can be significantly accelerated using CNN-based embeddings, sometimes by orders of magnitude., Comment: Accepted by the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020
Published: 2020
Full Text: View/download PDF

34. Top-k String Similarity Joins

Author: Shuyao Qi, Nikos Mamoulis, and Panagiotis Bouros
Subjects: Theoretical computer science, Similarity (network science), Computer science, String (computer science), Joins, Join (sigma algebra), Edit distance, String metric, Aggregate function, Ranking (information retrieval)
Abstract: Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types (e.g., numerical values) and the join predicate is equality. In this work, we consider string objects assigned such ranking attributes or simply scores. Given two collection of string objects and a string similarity measure (e.g., the Edit distance), we introduce the top-k string similarity join () which returns k sufficiently similar pairs of objects with respect to a similarity threshold ϵ, which have the highest combined score computed by a monotone aggregate function γ (e.g., SUM). Such a join operation finds application in data integration, data cleaning and de-duplication scenarios, and in emerging scientific fields such as bioinformatics. We investigate how existing top-k join methods can be adapted and optimized for , taking into account the semantics and the special characteristics of string similarity joins. We present techniques to avoid computing the entire string join and indexing that enables pruning candidates with respect to both the string join and the ranking component of the query. Our extensive experimental analysis demonstrates the efficiency of our methodology for by comparing solutions that either prioritize the ranking/join component or are able to handle both components of the query at the same time.
Published: 2020
Full Text: View/download PDF

35. Boosting toponym interlinking by paying attention to both machine and deep learning

Author: Konstantinos Alexis, Giorgos Giannopoulos, and Vassilis Kaffes
Subjects: Information retrieval, Boosting (machine learning), Geospatial analysis, Artificial neural network, Computer science, business.industry, Deep learning, Language model, Artificial intelligence, String metric, business, computer.software_genre, computer
Abstract: Toponym interlinking is the problem of identifying same spatio-textual entities within two or more different data sources, based exclusively on their names. It comprises a significant task in geospatial data management and integration with application in fields such as geomarketing, cadastration, navigation, etc. Previous works have assessed the effectiveness of unsupervised string similarity functions, while more recent ones have deployed similarity-based Machine Learning techniques and language model-based Deep Learning techniques, achieving significantly higher interlinking accuracy. In this paper, we demonstrate the suitability of Attention-based neural networks on the problem, as well as the fact that all different approaches provide merit to the problem, proposing a hybrid scheme that achieves the highest accuracy reported on toponym interlinking on the widely used Geonames dataset.
Published: 2020
Full Text: View/download PDF

36. An Analysis of Automated Answer Evaluation Systems based on Machine Learning

Author: Rohan Kokate, Prajwal G. Chanore, Birpal Singh J. Kapoor, Sushil S. Kolhatkar, Mohan M. Vishwakarma, and Shubham M. Nagpure
Subjects: Information retrieval, business.industry, Computer science, Process (engineering), Digital era, 05 social sciences, 050301 education, 02 engineering and technology, Outcome (game theory), Data recovery, Weighting, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Special care, String metric, business, Grading (education), 0503 education
Abstract: Evaluation of the answers remain as one of the most important factors in the learning and teaching process. Automatic evaluation of the answers is very necessary thus, many system has been developed in this digital era. Usually, the subjective answers are in either short form or long answers. The existing system available for evaluation has shown mediocre result in evaluating and scoring the answers. In such frameworks, the data recovery technique to gauge likeness between understudies answer and references answer is utilized, yet such scoring framework doesn't give the best outcome yet. There are very few keywords available in short answers. The answers with such limited number of keywords needs special care, especially while calculating the weighting score of the answers. In the presented study, we try to summarize the existing mechanism and analyses the performance of the system used for automatic grading of the long and descriptive answers.
Published: 2020
Full Text: View/download PDF

37. Canonicalizing Organization Names for Recruitment Domain

Author: Nausheen Fatma, Niharika Sachdeva, and Nitendra Rajput
Subjects: Vocabulary, Information retrieval, Similarity (network science), Computer science, media_common.quotation_subject, Redundancy (engineering), Context (language use), Ambiguity, Recommender system, String metric, Cluster analysis, media_common
Abstract: Online recruitment industry relies on various Knowledge Bases (KB) for enabling search and recommendation systems. These KBs comprise of diverse, non-standard, and large volume of named-entities as they are created from vast unstructured user-generated content (mostly CVs). Such non-standard representation of each entity causes significant vocabulary gap in KB which results in redundancy incompleteness, and ambiguity in the retrieved information. The problem is even more challenging in domains where external sources of context do not exist. To address these challenges, we propose a two-tier architecture that (a) finds the distance parameter for clustering entities using a novel pairwise similarity between all entity mentions, and, (b) then uses these similarity (scores) to create canonical clusters representing unique entity in the KB. Our experiments on proprietary data of 25,602 unique companies and 23,690 unique institutes show that the pair-wise similarity score using Siamese network outperforms (97% and 82% F1-score) standard string similarity measures. Finally, clustering methods over the similarity scores achieve 90% and 80% micro F1-score.
Published: 2020
Full Text: View/download PDF

38. NGNC: A Flexible and Efficient Framework for Error-Tolerant Query Autocompletion

Author: Sheng Hu, Jianbin Qin, Makoto Onizuka, Yuyang Dong, Yukai Miao, and Yoshiharu Ishikawa
Subjects: Computer science, computer.software_genre, Ranking (information retrieval), Search engine, Ranking, Trie, Edit distance, Input method, Data mining, Language model, String metric, computer, Noisy channel model, Data integration
Abstract: Query autocompletion (QAC) is an important feature that automatically completes a query and saves users’ keystrokes. It has been widely adopted in Web search engines, desktop search, input method editors, etc. In some applications, especially for mobile devices, typing accurately is laborious and error-prone. Hence advanced QAC methods tolerate errors when users are typing. As such, some data integration tasks also adopt this feature to process string similarity searches. Most existing work uses edit distance to measure the similarity between the input and correct strings. These methods overlook the quality of the suggested completions, and the efficiency needs to be improved. In this paper, we present NGNC, a framework that supports error-tolerant QAC in a flexible and efficient way. The framework is designed on the basis of a noisy channel model which separates the query prediction to two estimations, one by a language model and the other by an error model. Many QAC ranking methods and spelling correction methods can be easily plugged into the framework. To address the efficiency issue, we devise a neighborhood generation method accompanied with a trie index to quickly find candidates for the error model, as well as a fast top-$k$ retrieval method by caching and pruning. We develop a QAC system based on NGNC. It is able to evaluate the combinations of various ranking and spelling correction methods using query logs and automatically choose the best combination for online query workloads. We highlight research challenges, present our solutions, overview the system architecture, and perform an experimental evaluation on a real dataset to showcase how NGNC improves the state of the art of error-tolerant QAC.
Published: 2020
Full Text: View/download PDF

39. Learning Advanced Similarities and Training Features for Toponym Interlinking

Author: Georgios Kostoulas, Vassilis Kaffes, and Giorgos Giannopoulos
Subjects: Information retrieval, 010504 meteorology & atmospheric sciences, Computer science, Process (engineering), media_common.quotation_subject, Feature extraction, 0211 other engineering and technologies, 02 engineering and technology, 01 natural sciences, Focus (linguistics), Similarity (psychology), Quality (business), String metric, Function (engineering), Geomarketing, 021101 geological & geomatics engineering, 0105 earth and related environmental sciences, media_common
Abstract: Interlinking of spatio-textual entities is an open and quite challenging research problem, with application in several commercial fields, including geomarketing, navigation and social networks. It comprises the process of identifying, between different data sources, entity descriptions that refer to the same real-world entity. In this work, we focus on toponym interlinking, that is we handle spatio-textual entities that are exclusively represented by their name; additional properties, such as categories, coordinates, etc. are considered as either absent or of too low quality to be exploited in this setting. Toponyms are inherently heterogeneous entities; quite often several alternative names exist for the same toponym, with varying degrees of similarity between these names. State of the art approaches adopt mostly generic, domain-agnostic similarity functions and use them as is, or incorporate them as training features within classifiers for performing toponym interlinking. We claim that capturing the specificities of toponyms and exploiting them into elaborate meta-similarity functions and derived training features can significantly increase the effectiveness of interlinking methods. To this end, we propose the LGM-Sim meta-similarity function and a series of novel, similarity-based and statistical training features that can be utilized in similarity-based and classification-based interlinking settings respectively. We demonstrate that the proposed methods achieve large increases in accuracy, in both settings, compared to several methods from the literature in the widely used Geonames toponym dataset.
Published: 2020
Full Text: View/download PDF

40. Clinical Concept Linking with Contextualized Neural Representations

Author: Andriy Mulyar, Mark Dredze, and Elliot Schumacher
Subjects: 0303 health sciences, Information retrieval, Computer science, business.industry, Synonym, String (computer science), Context (language use), 010501 environmental sciences, Ontology (information science), 01 natural sciences, Ranking (information retrieval), 03 medical and health sciences, Entity linking, Ranking, Knowledge base, Similarity (psychology), Ontology, String metric, business, 030304 developmental biology, 0105 earth and related environmental sciences
Abstract: In traditional approaches to entity linking, linking decisions are based on three sources of information -- the similarity of the mention string to an entity's name, the similarity of the context of the document to the entity, and broader information about the knowledge base (KB). In some domains, there is little contextual information present in the KB and thus we rely more heavily on mention string similarity. We consider one example of this, concept linking, which seeks to link mentions of medical concepts to a medical concept ontology. We propose an approach to concept linking that leverages recent work in contextualized neural models, such as ELMo (Peters et al. 2018), which create a token representation that integrates the surrounding context of the mention and concept name. We find a neural ranking approach paired with contextualized embeddings provides gains over a competitive baseline (Leaman et al. 2013). Additionally, we find that a pre-training step using synonyms from the ontology offers a useful initialization for the ranker.
Published: 2020
Full Text: View/download PDF

41. Stable assessment of the quality of similarity algorithms of character strings and their normalizations

Author: Sergej Vital'evich Znamenskij
Subjects: business.industry, Computer science, Quality assessment, media_common.quotation_subject, Geography, Planning and Development, Pattern recognition, Development, Character (mathematics), Similarity (network science), Quality (business), Artificial intelligence, String metric, business, media_common
Abstract: The choice of search tools for hidden commonality in the data of a new nature requires stable and reproducible comparative assessments of the quality of abstract algorithms for the proximity of symbol strings. Conventional estimates based on artificially generated or manually labeled tests vary significantly, rather evaluating the method of this artificial generation with respect to similarity algorithms, and estimates based on user data cannot be accurately reproduced. A simple, transparent, objective and reproducible numerical quality assessment of a string metric. Parallel texts of book translations in different languages are used. The quality of a measure is estimated by the percentage of errors in possible different tries of determining the translation of a given paragraph among two paragraphs of a book in another language, one of which is actually a translation. The stability of assessments is verified by independence from the choice of a book and a pair of languages. The numerical experiment steadily ranked by quality algorithms for abstract character string comparisons and showed a strong dependence on the choice of normalization.
Published: 2018
Full Text: View/download PDF

42. Reasoning about attribute value equivalence in relational data

Author: Zhanhuai Li, Fengfeng Fan, Qun Chen, and Lei Chen
Subjects: Computer science, Relational database, Evidential reasoning approach, Probabilistic logic, 02 engineering and technology, computer.software_genre, Hardware and Architecture, 020204 information systems, Correlation analysis, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, String metric, Functional dependency, Equivalence (measure theory), computer, Software, Information Systems
Abstract: In relational data, identifying the distinct attribute values that refer to the same real-world entities is an essential task for many data cleaning and mining applications (e.g., duplicate record detection and functional dependency mining). The state-of-the-art approaches for attribute value matching are mainly based on string similarity among attribute values. However, these approaches may not perform well in the cases where the specified string similarity metric is not a reliable indicator for attribute value equivalence. To alleviate such limitations, we propose a new framework for attribute value matching in relational data. Firstly, we propose a novel probabilistic approach to reason about attribute value equivalence by value correlation analysis. We also propose effective methods for probabilistic equivalence reasoning with multiple attributes. Next, we present a unified framework, which incorporates both string similarity measurement and value correlation analysis by evidential reasoning. Finally, we demonstrate the effectiveness of our framework empirically on real-world datasets. Through extensive experiments, we show that our framework outperforms the string-based approaches by considerable margins on matching accuracy and achieves the desired efficiency.
Published: 2018
Full Text: View/download PDF

43. A Universal String Matching Approach to Screen Content Coding

Author: Jing Guo, Shuhui Wang, Liping Zhao, Tao Lin, and Kailun Zhou
Subjects: Computer science, 030229 sport sciences, 02 engineering and technology, String searching algorithm, Computer Science Applications, World Wide Web, 03 medical and health sciences, 0302 clinical medicine, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Media Technology, Test suite, 020201 artificial intelligence & image processing, Pattern matching, Electrical and Electronic Engineering, Graphics, String metric, Algorithm, Decoding methods, Coding (social sciences)
Abstract: This paper proposes a universal string matching (USM) approach to screen content coding (SCC). USM uses a primary reference buffer and a secondary reference buffer for string matching and includes three modes: general string (GS) mode, constrained string 1 (CS1) mode, and constrained string 2 (CS2) mode. The CS1 mode and CS2 mode are constrained cases of the GS mode. Due to the diversity of the screen content, each of the three modes plays an indispensable role in coding some types of screen content . When using USM to code a coding unit (CU) , one of the three modes is selected to code the CU. Compared with high-efficiency video coding (HEVC) SCC reference software HM-16.6 + SCM-5.2 of full frame search range for intrablock copy, USM achieves an average Y BD-rate of –28.4% for five text and graphics with motion (TGM) sequences from the audio video coding standard SCC common test condition (CTC) test suite and –5.8% for eight TGM test sequences from the HEVC SCC CTC test suite in all intraconfigurations, with a nearly 10% decrease in encoding runtime and almost the same decoding runtime.
Published: 2018
Full Text: View/download PDF

44. RETRACTED ARTICLE: Research on smart city service system based on adaptive algorithm

Author: Xin Gu and Yin Zhang
Subjects: Service system, Adaptive algorithm, Computer Networks and Communications, business.industry, Computer science, 020206 networking & telecommunications, Cloud computing, 02 engineering and technology, 010501 environmental sciences, computer.software_genre, 01 natural sciences, Software, Smart city, 0202 electrical engineering, electronic engineering, information engineering, Information system, Redundancy (engineering), Data mining, String metric, business, computer, 0105 earth and related environmental sciences
Abstract: The information resources of various departments are limited, resulting in duplication of information systems and extensive redundancy of hardware and software resources. In order to change the above status, catalogue management service system for smart city is introduced. In addition, the structural level of government information resources is analyzed. Cloud platform, information recommendation, string similarity matching and other technologies are used to connect and share the system information of the departmental government. In this case, the catalogue information resources are used wisely. The LPvA-Index index structure is given based on the existing Gram–Trie tree. LPvA-Index adaptively selects the appropriate length of the Prefix filter, resulting in fewer inverted tables. In addition, filters are used during reads, thus reducing the number of candidate sets and increasing the query efficiency of strings. At the same time, the length segmentation and location segmentation methods are applied to the Gram–Trie tree. Finally, the test platform is built. The efficiency of the LPvA-Index index structure and the existing LPA-Index index structure under certain conditions is tested and compared.
Published: 2018
Full Text: View/download PDF

45. String Similarity Search: A Hash-Based Approach

Author: Can Lu, Jeffrey Xu Yu, and Hao Wei
Subjects: Discrete mathematics, Query string, Computer science, Nearest neighbor search, String (computer science), Hash function, Commentz-Walter algorithm, 020206 networking & telecommunications, 02 engineering and technology, String searching algorithm, Computer Science Applications, law.invention, Prefix, Computational Theory and Mathematics, String kernel, law, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Edit distance, Boyer–Moore string search algorithm, String metric, Information Systems
Abstract: String similarity search is a fundamental query that has been widely used for DNA sequencing, error-tolerant query auto-completion, and data cleaning needed in database, data warehouse, and data mining. In this paper, we study string similarity search based on edit distance that is supported by many database management systems such as Oracle and PostgreSQL . Given the edit distance, ${\mathsf {ed}} (s,t)$ , between two strings, $s$ and $t$ , the string similarity search is to find every string $t$ in a string database $D$ which is similar to a query string $s$ such that ${\mathsf {ed}} (s, t) \leq \tau$ for a given threshold $\tau$ . In the literature, most existing work takes a filter-and-verify approach, where the filter step is introduced to reduce the high verification cost of two strings by utilizing an index built offline for $D$ . The two up-to-date approaches are prefix filtering and local filtering. In this paper, we study string similarity search where strings can be either short or long. Our approach can support long strings, which are not well supported by the existing approaches due to the size of the index built and the time to build such index. We propose two new hash-based labeling techniques, named $\mathsf {OX}$ label and $\mathsf {XX}$ label, for string similarity search. We assign a hash-label, ${\mathsf {H}} _s$ , to a string $s$ , and prune the dissimilar strings by comparing two hash-labels, ${\mathsf {H}} _s$ and ${\mathsf {H}} _t$ , for two strings $s$ and $t$ in the filter step. The key idea is to take the dissimilar bit-patterns between two hash-labels. We discuss our hash-based approaches, address their pruning power, and give the algorithms. Our hash-based approaches achieve high efficiency, and keep its index size and index construction time one order of magnitude smaller than the existing approaches in our experiment at the same time.
Published: 2018
Full Text: View/download PDF

46. How does that name sound? Name representation learning using accent-specific speech generation

Author: Rami Puzis, Aviad Elyashar, and Michael Fire
Subjects: Information Systems and Management, business.industry, Computer science, Deep learning, 02 engineering and technology, Pronunciation, computer.software_genre, Management Information Systems, Artificial Intelligence, 020204 information systems, Encoding (memory), Online search, Stress (linguistics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Pattern matching, String metric, business, Feature learning, computer, Software, Natural language processing
Abstract: Searching for information about a specific person is a frequent online activity. In most cases, users are aided in the search process by queries containing a name in Web search engines. Typically, Web search engines provide just a few accurate results associated with a name-containing query. Most existing solutions for suggesting synonyms in online search are based on pattern matching and phonetic encoding, however very often, the performance of such solutions is less than optimal. In this paper, we propose SpokenName2Vec, a novel and generic algorithm which addresses the synonym suggestion problem by utilizing automated speech generation, and deep learning to produce novel spoken name embeddings. These embeddings capture the way people pronounce names in a particular language and accent. Utilizing a name’s pronunciation can help detect names that sound alike, but are written differently. We demonstrated the proposed approach on a large-scale dataset with more than 250,000 forenames and surnames and evaluated it on two ground truth datasets containing 7400 forenames and 25,000 surnames (including their verified synonyms). The performance of SpokenName2Vec was found superior to the 10 other algorithms evaluated, including phonetic encoding, string similarity, and machine learning algorithms. The results obtained emphasize the potential of spoken name embeddings for improved synonym suggestion.
Published: 2021
Full Text: View/download PDF

47. New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance

Author: Seung-Rohk Oh, HyunJin Kim, and ThienLuan Ho
Subjects: Bitap algorithm, Computer science, Commentz-Walter algorithm, 020206 networking & telecommunications, Hamming distance, 0102 computer and information sciences, 02 engineering and technology, String searching algorithm, Approximate string matching, 01 natural sciences, Theoretical Computer Science, 010201 computation theory & mathematics, Hardware and Architecture, 3-dimensional matching, 0202 electrical engineering, electronic engineering, information engineering, String metric, Hamming weight, Algorithm, Software, Information Systems
Abstract: This paper proposes new algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. Fixed-length approximate string matching and approximate circular string matching are special cases of approximate string matching and have numerous direct applications in bioinformatics and text searching. Firstly, a counter-vector-mismatches (CVM) algorithm is proposed to solve fixed-length approximate string matching with k-mismatches. The development of CVM algorithm is based on the parallel summation of counters located in the same machine word. Secondly, a parallel counter-vector-mismatches (PCVM) algorithm is proposed to accelerate CVM algorithm in parallel. The PCVM algorithm is integrated into two-level parallelisms that exploit not only word-level parallelism but also data parallelism via parallel environments such as multi-core processors and graphics processing units (GPUs). In the particular case of adopting GPUs, a shared-mem parallel counter-vector-mismatches (PCVMsmem) scheme can be implemented from PCVM algorithm. The PCVMsmem scheme can exploit the memory model of GPUs to optimize performance of PCVM algorithm. Finally, this paper shows several methods to adopt CVM and PCVM algorithms in case the input pattern is in circular structure. In the experiments with real DNA packages, our proposed algorithms and scheme work greatly faster than previous bit-vector-mismatches and parallel bit-vector-mismatches algorithms.
Published: 2017
Full Text: View/download PDF

48. A review on parameterized string matching algorithms

Author: Deepak Rai, Rama Singh, and Rajesh Prasad
Subjects: Theoretical computer science, Hash function, Parameterized complexity, Commentz-Walter algorithm, 0102 computer and information sciences, 02 engineering and technology, String searching algorithm, Approximate string matching, 01 natural sciences, Identification (information), 010201 computation theory & mathematics, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, String metric
Abstract: Identification of candidate genes and nucleotides are basic uses of the Bioinformatics research. Biology that deals with molecule has functional as well as structural behavior imparting need of wel...
Published: 2017
Full Text: View/download PDF

49. FrepJoin: an efficient partition-based algorithm for edit similarity join

Author: Jizhou Luo, Jianzhong Li, Shengfei Shi, and Hongzhi Wang
Subjects: Alternative methods, Exploit, Computer Networks and Communications, Computer science, 020207 software engineering, 02 engineering and technology, Hardware and Architecture, 020204 information systems, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), Edit distance, Electrical and Electronic Engineering, String metric, Algorithm
Abstract: String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
Published: 2017
Full Text: View/download PDF

50. Efficient String Matching Algorithm for Searching Large DNA and Binary Texts

Author: Mohammed Arafah, Hassan Mathkour, and Abdulrakeeb M. Al-Ssulami
Subjects: Theoretical computer science, Computer Networks and Communications, Computer science, Hash function, Commentz-Walter algorithm, 0102 computer and information sciences, 02 engineering and technology, String searching algorithm, 01 natural sciences, Rabin–Karp algorithm, Hybrid algorithm, law.invention, 010201 computation theory & mathematics, law, Binary data, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, String metric, Boyer–Moore string search algorithm, Information Systems
Abstract: The exact string matching is essential in application areas such as Bioinformatics and Intrusion Detection Systems. Speeding-up the string matching algorithm will therefore result in accelerating the searching process in DNA and binary data. Previously, there are two types of fast algorithms exist, bit-parallel based algorithms and hashing algorithms. The bit-parallel based are efficient when dealing with patterns of short lengths, less than 64, but slow on long patterns. On the other hand, hashing algorithms have optimal sublinear average case on large alphabets and long patterns, but the efficiency not so good on small alphabet such as DNA and binary texts. In this paper, the authors present hybrid algorithm to overcome the shortcomings of those previous algorithms. The proposed algorithm is based on q-gram hashing with guaranteeing the maximal shift in advance. Experimental results on random and complete human genome confirm that the proposed algorithm is efficient on various pattern lengths and small alphabet.
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

955 results on '"String metric"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources