Back to Search Start Over

Extraction of data events from the computational biology literature

Authors :
Albahlal, Manal
Nenadic, Goran
Stevens, Robert
Publication Year :
2022
Publisher :
University of Manchester, 2022.

Abstract

With the current rate of research activities, it is widely accepted that scientists face a challenge of keeping up-to-date with new findings, even within a sub-field of a discipline. This difficulty extends to methods that have been used in the research. Understanding reported methods gives us confidence that the findings have resulted from an appropriate, rigorous and sound scientific process. However, the modern dynamic of science is also characterised with ever-changing methods, so scientists need to be able to learn about new ones and identify the common or most appropriate methods to use in a given situation. One of the best sources of information about methods is the scientific literature. In this thesis, we developed a computational model to automatically represent the text that describes reported methods as an abstract method workflow. We focus on computational sciences, which centre on data processing. Specifically, we consider data events as a representation of processes and changes that happen to data. A data event contains the main components of each step in computational experiments, such as input/output data, processes and operations on data, databases where the data is stored and software and tools that are used in these processes. An abstract method workflow then models relationships between data events, ordering them in a way that represents the methodology as reported in the literature. This thesis introduces ODNoRFlow, a text mining method that extracts and represents an abstract method workflow from a Methods section of a publication. It relies on a hybrid text mining approach (ODNoR) that combines machine learning and a rule-based method to recognise data event components, normalise them to existing ontologies and identify the links and relations between them. Specifically, we fine-tuned a pre-trained transformer model (BioBERT) to extract mentions of data and operations, and used an existing named entity recognition system (bioNerDS) to extract software and database mentions. Mentions were normalised to the EDAM ontology. We used a combination of syntactic rules and a pre-trained attention-based BiLSTM model to identify relations and links between components, and considered whether an automated discourse analysis tool can be used to improve the outcomes. We used the microarray analysis literature as a case study to demonstrate the feasibility of the proposed approaches. At the data event level, the approach achieved F-scores for the identification and normalisation of components between 78% (for data) and 92% (for operations), whereas the relationship extraction F-scores were between 62% and 92.5%. At the workflow level, we manually analysed automatically reconstructed workflows from 25 papers, with the F-score between 61% and 93.5%. We also applied ODNoRFlow to a large corpus of the microarray analysis literature to identify and analyse the distribution of data events components, the differences in their usage and the associations between them. Overall, the thesis provides a new computational framework that contributes to the automated extraction, representation and analysis of methods used in the computational biology literature.

Details

Language :
English
Database :
British Library EThOS
Publication Type :
Dissertation/ Thesis
Accession number :
edsble.886021
Document Type :
Electronic Thesis or Dissertation