Back to Search Start Over

Content extraction based on statistic and position relationship between title and content

Authors :
Pingping Xu
Chencheng Yang
Mingdong Li
Source :
ICCC
Publication Year :
2014
Publisher :
IEEE, 2014.

Abstract

Web page content extraction is a fundamental step in the application of data mining which supplies pure data source with little noise. The original web page with fully embedded with contentirrelevant information such as JavaScript and advertisements is mixed with noise. The purity of the data makes a difference in application. Consequently, a web information extraction model based on statistical and positional relationship between the title and content is proposed in this paper. The exact localization of title will promote the precision of content extraction and inversely the accurate extracted content will have a positive feedback to ensure the right title is extracted. First and foremost, each text node is compared to the content selected from the tag of title to get the score of similarity. We can get the final score of each separate node by summing up its node attribute score. The node with the highest score will be regarded as the accurate title at present. According to the position of title, we narrow the scope of main content which is distributed after the title. With the help of statistical information of the web page we then traverse the DOM tree to obtain the content contained in the node with maximal weight. Experimental results prove that the algorithm is much better than that of previous extraction rules and applicable to extract main content from web pages.

Details

Database :
OpenAIRE
Journal :
2014 IEEE/CIC International Conference on Communications in China (ICCC)
Accession number :
edsair.doi...........d3dd38efb3486d793c3c45b431f2f05d
Full Text :
https://doi.org/10.1109/iccchina.2014.7008295