Back to Search
Start Over
Leveraging HTML in Free Text Web Named Entity Recognition
- Source :
- COLING
- Publication Year :
- 2020
- Publisher :
- International Committee on Computational Linguistics, 2020.
-
Abstract
- HTML tags are typically discarded in free text Named Entity Recognition from Web pages. We investigate whether these discarded tags might be used to improve NER performance. We compare Text+Tags sentences with their Text-Only equivalents, over five datasets, two free text segmentation granularities and two NER models. We find an increased F1 performance for Text+Tags of between 0.9% and 13.2% over all datasets, variants and models. This performance increase, over datasets of varying entity types, HTML density and construction quality, indicates our method is flexible and adaptable. These findings imply that a similar technique might be of use in other Web-aware NLP tasks, including the enrichment of deep language models.
- Subjects :
- business.industry
Computer science
media_common.quotation_subject
0102 computer and information sciences
02 engineering and technology
computer.software_genre
01 natural sciences
HTML element
Named-entity recognition
010201 computation theory & mathematics
020204 information systems
Web page
0202 electrical engineering, electronic engineering, information engineering
Text messaging
Quality (business)
Artificial intelligence
Language model
business
computer
Natural language processing
media_common
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- Proceedings of the 28th International Conference on Computational Linguistics
- Accession number :
- edsair.doi...........5342b366ae3ccb589d39041c33793e5c