SVM-based web content mining with leaf classification unit from DOM-tree

Authors :: Seungwoo Lee
Yeongsu Kim
Source :: KST
Publication Year :: 2017
Publisher :: IEEE, 2017.
Abstract: In order to analyze a news article dataset, we first extract important information such as title, date, and paragraph of the body. At the same time, we remove unnecessary information such as image, caption, footer, advertisement, navigation and recommended-news. The problem is that the formats of news articles are changing according to time and also they vary according to news source and even section of it. So, it is important for a model to generalize when predicting unseen formats of news articles. We confirmed that a machine learning based model is better to predict new data than a rule-based model by some experiments. Also, we suggest that noise information in the body possibly can be removed because we define a classification unit as a leaf node itself. On the other hand, general machine learning based models cannot remove noise information. Since they consider the classification unit as an intermediate node which consists of the set of leaf nodes, they cannot classify a leaf node itself.

Subjects :: 060201 languages & linguistics
Computer science
Feature extraction
06 humanities and the arts
02 engineering and technology
computer.software_genre
Data modeling
Support vector machine
Set (abstract data type)
Web mining
0602 languages and literature
Node (computer science)
0202 electrical engineering, electronic engineering, information engineering
020201 artificial intelligence & image processing
Data mining
Noise (video)
Document Object Model
computer

Database :: OpenAIRE
Journal :: 2017 9th International Conference on Knowledge and Smart Technology (KST)
Accession number :: edsair.doi...........9beaf84a8507376642a9212cac6351ff

Tools