Back to Search
Start Over
Content extraction based on statistic and position relationship between title and content
- Source :
- ICCC
- Publication Year :
- 2014
- Publisher :
- IEEE, 2014.
-
Abstract
- Web page content extraction is a fundamental step in the application of data mining which supplies pure data source with little noise. The original web page with fully embedded with contentirrelevant information such as JavaScript and advertisements is mixed with noise. The purity of the data makes a difference in application. Consequently, a web information extraction model based on statistical and positional relationship between the title and content is proposed in this paper. The exact localization of title will promote the precision of content extraction and inversely the accurate extracted content will have a positive feedback to ensure the right title is extracted. First and foremost, each text node is compared to the content selected from the tag of title to get the score of similarity. We can get the final score of each separate node by summing up its node attribute score. The node with the highest score will be regarded as the accurate title at present. According to the position of title, we narrow the scope of main content which is distributed after the title. With the help of statistical information of the web page we then traverse the DOM tree to obtain the content contained in the node with maximal weight. Experimental results prove that the algorithm is much better than that of previous extraction rules and applicable to extract main content from web pages.
Details
- Database :
- OpenAIRE
- Journal :
- 2014 IEEE/CIC International Conference on Communications in China (ICCC)
- Accession number :
- edsair.doi...........d3dd38efb3486d793c3c45b431f2f05d
- Full Text :
- https://doi.org/10.1109/iccchina.2014.7008295