Back to Search
Start Over
Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles
- Source :
- BMC Bioinformatics, BMC Bioinformatics, Vol 18, Iss 1, Pp 1-14 (2017)
- Publication Year :
- 2017
- Publisher :
- Springer Science and Business Media LLC, 2017.
-
Abstract
- Background Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations. Results The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus. Conclusions The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.
- Subjects :
- 0301 basic medicine
Computer science
Annotation
02 engineering and technology
lcsh:Computer applications to medicine. Medical informatics
Corpus
computer.software_genre
Semantics
Referent
Biochemistry
Domain (software engineering)
03 medical and health sciences
Coreference
Structural Biology
Anaphora
0202 electrical engineering, electronic engineering, information engineering
Data Mining
lcsh:QH301-705.5
Molecular Biology
Information retrieval
business.industry
Applied Mathematics
Noun phrase
Computer Science Applications
Benchmarking
Information extraction
030104 developmental biology
lcsh:Biology (General)
Identity (object-oriented programming)
lcsh:R858-859.7
020201 artificial intelligence & image processing
Artificial intelligence
Periodicals as Topic
Resolution
business
computer
Natural language processing
Research Article
Subjects
Details
- ISSN :
- 14712105
- Volume :
- 18
- Database :
- OpenAIRE
- Journal :
- BMC Bioinformatics
- Accession number :
- edsair.doi.dedup.....d8145eefda76dc2e09f972e0ad9e4421
- Full Text :
- https://doi.org/10.1186/s12859-017-1775-9