101. Identifying XML Entities Via Virtual Keys
- Author
-
Jing Tian
- Subjects
Computer science ,computer.internet_protocol ,Relational database ,Efficient XML Interchange ,XML validation ,Context (language use) ,computer.file_format ,computer.software_genre ,Identification (information) ,Information gain ratio ,Data mining ,computer ,XML ,Data integration - Abstract
Since data are acquired from a number of sources, a real-world entity commonly owns multiple formats of representations. Although the heterogeneity issue maintains a prominent concern in data integration and data mining, entity identification is an effective means to solve duplicates. However, the efforts in complicated structures, such as XML data, are not as extensive as the practical relevance that has been explored in the context of relational databases. This article focuses on the duplication detection of XML element based on Sorted Neighbor Method (SNM) and Multi-Pass SNM (MPS). The XEIVK (XML Entities Identification via Virtual Keys) algorithm primarily homogenizes the structures by labels mapping to the template. Subsequently, the virtual keys are created by extracting content of nodes after determining the weight. It calculates the degree of textual similarities in an orchestrated function within a set number of clusters in terms of nodes’ information gain ratio. The experiment illustrates that XEIVK outperforms both SNM and MPS significantly on high precision, meanwhile the less time consuming benefits from the filtering strategies.
- Published
- 2017