1. An improved method for classifying XML documents based on structure and content.
- Author
-
Zhang Na, Zhang Dongzhan, Yu Ye, and Duan Jiangjiao
- Subjects
XML (Extensible Markup Language) ,DATA structures ,DATA mining ,DOCUMENT type definitions ,PROOF theory -- Data processing ,MEASURE theory ,CLASSIFICATION ,COMPUTER science - Abstract
As more and more structured or semistructured data is stored and exchanged in XML format, XML mining becomes increasingly important, especially the study of classification of XML documents becomes more widely. Considering the disadvantage of the current classification of XML documents that based on structure and content, this paper presents an improved method called NM-Similarity computing similarity measure, which maintains an high accuracy rate when XML documents are similar in structure but different in content. This method is applied in KNN (K-Nearest Neighbor) method for classification. The structure similarity between two XML documents is computed by using Euclidean distance, and the content similarity is computed by using Cosine measure. A better result can be seen when classifying XML documents which focus on content more (that is, XML documents are created from the same DTD and the structure is similar) and it is more effective on classifying XML documents. The experiments prove that when XML documents are similar in structure but different in content, NM-Similarity in this paper provides a significant improvement in improving classification accuracy rate. [ABSTRACT FROM AUTHOR]
- Published
- 2010