Back to Search Start Over

Natural language processing strategies for discovery of cell type-specific DNA regulatory elements

Authors :
Buzatu, Rafaella
Kenna, Kevin (Thesis Advisor)
Wang Y., Kenna K.
Buzatu, Rafaella
Kenna, Kevin (Thesis Advisor)
Wang Y., Kenna K.
Publication Year :
2024

Abstract

Understanding the gene transcription rules present in non-coding DNA is essential for unraveling the genetic code that establishes cellular fate. In this study, we aim to narrow down on regulatory regions and motifs within the central nervous system (CNS) that determine cell specificity. While the use of ATAC-seq data has been proven efficient in defining relevant regions of open chromatin, further analysis is required in order to obtain insights into specific regulatory elements. To that end, we propose a strategy involving natural language processing techniques to identify DNA transcription factor (TF) binding sites relevant to each cell type. We employ topic modelling for co-clustering of ATAC-seq peak sequences and cell types; as a result, we can retrieve ‘topics’ consisting of functionally related non-coding DNA regions, that provide a starting point for further analysis and identification of cell-specific feature combinations. Furthermore, we finetune a BigBird language model, pre-trained on the human genome, to distinguish between GABAergic, glutamatergic, and non-neuronal cells. The Byte-Pair Encoding tokenization method allows us to extract the most important DNA motifs for making the class predictions, as well as their corresponding attention scores, which can be mapped back to the peak sequences to identify TF binding sites. We show that this method allows identification of known regulatory elements and propose new strategies to extract more meaningful and specific information from the language models.

Details

Database :
OAIster
Notes :
EN
Publication Type :
Electronic Resource
Accession number :
edsoai.on1430693014
Document Type :
Electronic Resource