1. A mapping-free NLP-based technique for sequence search in Nanopore long-reads
- Author
-
Strzoda, Tomasz, Cruz-Garcia, Lourdes, Najim, Mustafa, Badie, Christophe, and Polanska, Joanna
- Subjects
Quantitative Biology - Genomics - Abstract
In unforeseen situations, such as nuclear power plant's or civilian radiation accidents, there is a need for effective and computationally inexpensive methods to determine the expression level of a selected gene panel, allowing for rough dose estimates in thousands of donors. The new generation in-situ mapper, fast and of low energy consumption, working at the level of single nanopore output, is in demand. We aim to create a sequence identification tool that utilizes Natural Language Processing (NLP) techniques and ensures a high level of negative predictive value (NPV) compared to the classical approach. The training dataset consisted of RNASeq data from 6 samples. Having tested multiple NLP models, the best configuration analyses the entire sequence and uses a word length of 3 base pairs with one-word neighbor on each side. For the considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and NPV 99.25%, compared to minimap2's performance in a cross-validation scenario. Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to 98.15%. Obtained NLP model, validated on an external independent genome sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced dictionary. The salmon-estimated read counts differed from the classical approach on average by 3.48% for the complete dictionary and by 5.82% for the reduced one. We conclude that for long Oxford Nanopore reads, an NLP-based approach can successfully replace classical mapping in case of emergency. The developed NLP model can be easily retrained to identify selected transcripts and/or work with various long-read sequencing techniques. Our results of the study clearly demonstrate the potential of applying techniques known from classical text processing to nucleotide sequences and represent a significant advancement in this field of science., Comment: 25 pages, 9 figures
- Published
- 2024