Back to Search
Start Over
Title identification of web article pages using HTML and visual features
- Source :
- SPIE Proceedings.
- Publication Year :
- 2011
- Publisher :
- SPIE, 2011.
-
Abstract
- Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
Details
- ISSN :
- 0277786X
- Database :
- OpenAIRE
- Journal :
- SPIE Proceedings
- Accession number :
- edsair.doi...........d6077d1cac57cda088098afe72fb07a6
- Full Text :
- https://doi.org/10.1117/12.876708