Title identification of web article pages using HTML and visual features

Authors :: Jian Fan
Parag Joshi
Ping Luo
Source :: SPIE Proceedings.
Publication Year :: 2011
Publisher :: SPIE, 2011.
Abstract: Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.