Back to Search
Start Over
Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics
- Source :
- Journal of Cheminformatics, Batista-Navarro, R, Rak, R & Ananiadou, S 2015, ' Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics ', Journal of Cheminformatics, vol. 7, no. (Suppl 1): S6, S6 . https://doi.org/10.1186/1758-2946-7-S1-S6
- Publication Year :
- 2015
- Publisher :
- BioMed Central, 2015.
-
Abstract
- Background The development of robust methods for chemical named entity recognition, a challenging natural language processing task, was previously hindered by the lack of publicly available, large-scale, gold standard corpora. The recent public release of a large chemical entity-annotated corpus as a resource for the CHEMDNER track of the Fourth BioCreative Challenge Evaluation (BioCreative IV) workshop greatly alleviated this problem and allowed us to develop a conditional random fields-based chemical entity recogniser. In order to optimise its performance, we introduced customisations in various aspects of our solution. These include the selection of specialised pre-processing analytics, the incorporation of chemistry knowledge-rich features in the training and application of the statistical model, and the addition of post-processing rules. Results Our evaluation shows that optimal performance is obtained when our customisations are integrated into the chemical entity recogniser. When its performance is compared with that of state-of-the-art methods, under comparable experimental settings, our solution achieves competitive advantage. We also show that our recogniser that uses a model trained on the CHEMDNER corpus is suitable for recognising names in a wide range of corpora, consistently outperforming two popular chemical NER tools. Conclusion The contributions resulting from this work are two-fold. Firstly, we present the details of a chemical entity recognition methodology that has demonstrated performance at a competitive, if not superior, level as that of state-of-the-art methods. Secondly, the developed suite of solutions has been made publicly available as a configurable workflow in the interoperable text mining workbench Argo. This allows interested users to conveniently apply and evaluate our solutions in the context of other chemical text mining tasks.
- Subjects :
- Text mining
Sequence labelling
Feature engineering
Research
Configurable workflows
Physical and Theoretical Chemistry
Library and Information Sciences
Conditional random fields
Chemical named entity recognition
Computer Graphics and Computer-Aided Design
Workflow optimisation
Computer Science Applications
Subjects
Details
- Language :
- English
- ISSN :
- 17582946
- Volume :
- 7
- Issue :
- Suppl 1
- Database :
- OpenAIRE
- Journal :
- Journal of Cheminformatics
- Accession number :
- edsair.pmid.dedup....c59d6db4d3f091b92b3ee2fdf60c39c1
- Full Text :
- https://doi.org/10.1186/1758-2946-7-S1-S6