Start Over

Boilerplate Removal and Content Extraction from Dynamic Web Pages

Authors :: Pan Ei San
Source :: International Journal of Computer Science, Engineering and Applications. 4:27-33
Publication Year :: 2014
Publisher :: Academy and Industry Research Collaboration Center (AIRCC), 2014.
Abstract: Web pages not only contain main content, but also other elements such as navigation panels, advertisements and links to related documents. To ensure the high quality of web page, a good boilerplate removal algorithm is needed to extract only the relevant cfrom web page. Main textual contents are just included in HTML source code which makes up the files. The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, and copyright n otices in web pages. The system removes boilerplate and extracts main content. In this system, there are two phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from HTML web page. Content Extraction algorithmdescribes to get high performance without parsing DOM trees. After observation the HTML tags, one line may not contain a piece of complete information and long texts are distributed in close lines, this system usesLine-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses these features as parameters in threshold method to classify the block are content or non- content.

Subjects :: Parsing
Source code
Information retrieval
Computer science
media_common.quotation_subject
Feature extraction
Dynamic web page
computer.software_genre
HTML element
Boilerplate text
Web page
Cluster analysis
computer
media_common

Details

ISSN :: 22309616 and 22310088
Volume :: 4
Database :: OpenAIRE
Journal :: International Journal of Computer Science, Engineering and Applications
Accession number :: edsair.doi...........06be0b18336349883086a29a1389e1d1

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Boilerplate Removal and Content Extraction from Dynamic Web Pages

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Boilerplate Removal and Content Extraction from Dynamic Web Pages

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources