Back to Search
Start Over
Framework for syntactic string similarity measures
- Source :
- Expert Systems with Applications. 129:169-185
- Publication Year :
- 2019
- Publisher :
- Elsevier BV, 2019.
-
Abstract
- Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for matching short text. We thoroughly analyze the measures by dividing them into three components: character-level similarity, string segmentation, and matching technique. Soft variants of the measures are also introduced. With the help of two existing toolkits (SecondString and SimMetric), we provide an open-source Java toolkit of the proposed framework, which integrates the individual components together so that completely new combinations can be created. Experimental results reveal that the performance of the similarity measures depends on the type of the dataset. For well-maintained dataset, using a token-level measure is important but the basic (crisp) variant is usually enough. For uncontrolled dataset where typing errors are expected, the soft variants of the token-level measures are necessary. Among all tested measures, a soft token-level measure that combines set matching and q-grams at the character level perform best. A gap between human perception and syntactic measures still remains due to lacking semantic analysis.
- Subjects :
- 0209 industrial biotechnology
Matching (statistics)
business.industry
Computer science
Semantic analysis (machine learning)
String (computer science)
General Engineering
02 engineering and technology
Similarity measure
Document clustering
computer.software_genre
Automatic summarization
Computer Science Applications
020901 industrial engineering & automation
Similarity (network science)
Artificial Intelligence
0202 electrical engineering, electronic engineering, information engineering
Question answering
020201 artificial intelligence & image processing
Artificial intelligence
String metric
business
computer
Natural language processing
Subjects
Details
- ISSN :
- 09574174
- Volume :
- 129
- Database :
- OpenAIRE
- Journal :
- Expert Systems with Applications
- Accession number :
- edsair.doi...........01d435989512422a210ffec969cc1e40
- Full Text :
- https://doi.org/10.1016/j.eswa.2019.03.048