Start Over

Content extraction based on statistic and position relationship between title and content

Authors :: Pingping Xu
Chencheng Yang
Mingdong Li
Source :: ICCC
Publication Year :: 2014
Publisher :: IEEE, 2014.
Abstract: Web page content extraction is a fundamental step in the application of data mining which supplies pure data source with little noise. The original web page with fully embedded with contentirrelevant information such as JavaScript and advertisements is mixed with noise. The purity of the data makes a difference in application. Consequently, a web information extraction model based on statistical and positional relationship between the title and content is proposed in this paper. The exact localization of title will promote the precision of content extraction and inversely the accurate extracted content will have a positive feedback to ensure the right title is extracted. First and foremost, each text node is compared to the content selected from the tag of title to get the score of similarity. We can get the final score of each separate node by summing up its node attribute score. The node with the highest score will be regarded as the accurate title at present. According to the position of title, we narrow the scope of main content which is distributed after the title. With the help of statistical information of the web page we then traverse the DOM tree to obtain the content contained in the node with maximal weight. Experimental results prove that the algorithm is much better than that of previous extraction rules and applicable to extract main content from web pages.

Subjects :: Information retrieval
Computer science
Web page
Node (computer science)
Data mining
Noise (video)
JavaScript
computer.software_genre
Document Object Model
computer
Statistic
computer.programming_language

Details

Database :: OpenAIRE
Journal :: 2014 IEEE/CIC International Conference on Communications in China (ICCC)
Accession number :: edsair.doi...........d3dd38efb3486d793c3c45b431f2f05d
Full Text :: https://doi.org/10.1109/iccchina.2014.7008295

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Content extraction based on statistic and position relationship between title and content

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Content extraction based on statistic and position relationship between title and content

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources