1. Biomedical named entity recognition through improved balanced undersampling for addressing class imbalance and preserving contextual information
- Author
-
Archana, S. M. and Prakash, Jay
- Abstract
Biomedical Named Entity Recognition (Bio-NER) identifies and categorises the named entities of biomedical text data such as disease, chemical, protein, and gene. Since most of the biomedical data originates from the real world, the majority of data instances do not pertain to the specific named entity of interest. Therefore, this imbalance of data adversely impacts the performance of Bio-NER using machine learning models, as their learning objective is usually dominated by the majority of non-entity tokens. Various undersampling techniques have been introduced to address this issue. Balanced Undersampling (BUS) is one of the approaches which operates at the sentence level to enhance biomedical NER (Bio-NER). However, BUS lacks in preserving contextual information during the undersampling procedure. To overcome this limitation, we introduce an improved Balanced Undersampling method (iBUS) for Bio-NER. During the undersampling process, iBUS considers the importance of individual instances and generates a balanced dataset while retaining essential instances. To validate the effectiveness of the proposed method over competitive methods, we perform experiments using the NCBI disease dataset, CHEMDNER, and BC5CDR chemical datasets. The experimental results demonstrate the superiority of the proposed method in terms of the F1 score compared to competitive approaches.
- Published
- 2024
- Full Text
- View/download PDF