Back to Search Start Over

SVM-based web content mining with leaf classification unit from DOM-tree

Authors :
Seungwoo Lee
Yeongsu Kim
Source :
KST
Publication Year :
2017
Publisher :
IEEE, 2017.

Abstract

In order to analyze a news article dataset, we first extract important information such as title, date, and paragraph of the body. At the same time, we remove unnecessary information such as image, caption, footer, advertisement, navigation and recommended-news. The problem is that the formats of news articles are changing according to time and also they vary according to news source and even section of it. So, it is important for a model to generalize when predicting unseen formats of news articles. We confirmed that a machine learning based model is better to predict new data than a rule-based model by some experiments. Also, we suggest that noise information in the body possibly can be removed because we define a classification unit as a leaf node itself. On the other hand, general machine learning based models cannot remove noise information. Since they consider the classification unit as an intermediate node which consists of the set of leaf nodes, they cannot classify a leaf node itself.

Details

Database :
OpenAIRE
Journal :
2017 9th International Conference on Knowledge and Smart Technology (KST)
Accession number :
edsair.doi...........9beaf84a8507376642a9212cac6351ff