1. Exploiting the sequential nature of genomic data for improved analysis and identification.
- Author
-
Nawaz MS, Nawaz MZ, Junyi Z, Fournier-Viger P, and Qu JF
- Abstract
Genomic data is growing exponentially, posing new challenges for sequence analysis and classification, particularly for managing and understanding harmful new viruses that may later cause pandemics. Recent genome sequence classification models yield promising performance. However, the majority of them do not consider the sequential arrangement of nucleotides and amino acids, a critical aspect for uncovering their inherent structure and function. To overcome this, we introduce GenoAnaCla, a novel approach for analyzing and classifying genome sequences, based on sequential pattern mining (SPM). The proposed approach first constructs and preprocesses datasets comprising RNA virus genome sequences in three formats: nucleotide, coding region, and protein. Then, to capture sequential features for the analysis and classification of viruses, GenoAnaCla extracts frequent sequential patterns and rules in three forms and in codons. Eight classifiers are utilized, and their effectiveness is assessed by employing a variety of evaluation metrics. A performance comparison demonstrates that the suggested approach surpasses the current state-of-the-art genome sequence classification and detection techniques with a 3.18% performance increase in accuracy on average., Competing Interests: Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper., (Copyright © 2024 Elsevier Ltd. All rights reserved.)
- Published
- 2024
- Full Text
- View/download PDF