Back to Search Start Over

Title identification of web article pages using HTML and visual features

Authors :
Jian Fan
Parag Joshi
Ping Luo
Source :
SPIE Proceedings.
Publication Year :
2011
Publisher :
SPIE, 2011.

Abstract

Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.

Details

ISSN :
0277786X
Database :
OpenAIRE
Journal :
SPIE Proceedings
Accession number :
edsair.doi...........d6077d1cac57cda088098afe72fb07a6
Full Text :
https://doi.org/10.1117/12.876708