1. DeepG4: A deep learning approach to predict cell-type specific active G-quadruplex regions
- Author
-
Elissar Nassereddine, Raphaël Mourad, Vincent Rocher, and Matthieu Genais
- Subjects
Gene Expression ,chemistry.chemical_compound ,Database and Informatics Methods ,Neoplasms ,Breast Tumors ,Medicine and Health Sciences ,Biology (General) ,Genome ,Ecology ,Chromosome Biology ,Genomics ,Chromatin ,Computational Theory and Mathematics ,Oncology ,Modeling and Simulation ,293T cells ,Cell lines ,Epigenetics ,Biological cultures ,Sequence Analysis ,Network Analysis ,Algorithms ,Research Article ,Computer and Information Sciences ,Chromatin Immunoprecipitation ,QH301-705.5 ,Bioinformatics ,Computational biology ,Biology ,G-quadruplex ,Research and Analysis Methods ,Network Motifs ,DNA sequencing ,Human Genomics ,Cellular and Molecular Neuroscience ,Deep Learning ,Sequence Motif Analysis ,Breast Cancer ,Genetics ,Humans ,Nucleic acid structure ,Molecular Biology ,Transcription factor ,Ecology, Evolution, Behavior and Systematics ,Sequence (medicine) ,Biology and Life Sciences ,Cancers and Neoplasms ,Cell Biology ,G-Quadruplexes ,chemistry ,Neural Networks, Computer ,DNA - Abstract
DNA is a complex molecule carrying the instructions an organism needs to develop, live and reproduce. In 1953, Watson and Crick discovered that DNA is composed of two chains forming a double-helix. Later on, other structures of DNA were discovered and shown to play important roles in the cell, in particular G-quadruplex (G4). Following genome sequencing, several bioinformatic algorithms were developed to map G4s in vitro based on a canonical sequence motif, G-richness and G-skewness or alternatively sequence features including k-mers, and more recently machine/deep learning. Recently, new sequencing techniques were developed to map G4s in vitro (G4-seq) and G4s in vivo (G4 ChIP-seq) at few hundred base resolution. Here, we propose a novel convolutional neural network (DeepG4) to map cell-type specific active G4 regions (e.g. regions within which G4s form both in vitro and in vivo). DeepG4 is very accurate to predict active G4 regions in different cell types. Moreover, DeepG4 identifies key DNA motifs that are predictive of G4 region activity. We found that such motifs do not follow a very flexible sequence pattern as current algorithms seek for. Instead, active G4 regions are determined by numerous specific motifs. Moreover, among those motifs, we identified known transcription factors (TFs) which could play important roles in G4 activity by contributing either directly to G4 structures themselves or indirectly by participating in G4 formation in the vicinity. In addition, we used DeepG4 to predict active G4 regions in a large number of tissues and cancers, thereby providing a comprehensive resource for researchers. Availability: https://github.com/morphos30/DeepG4., Author summary DNA is a molecule carrying genetic information and found in all living cells. In 1953, Watson and Crick found that DNA has a double helix structure. However, other DNA structures were later identified, and most notably, G-quadruplex (G4). In 2000, the Human Genome Project revealed the widespread presence of G4s in the genome using algorithms. To date, all G4 mapping algorithms were developed to map G4s on naked DNA, without knowing if they could be formed in a given cell type. Here, we designed a novel artificial intelligence algorithm that could map G4 regions active in the cell from the DNA sequence and chromatin accessibility. Moreover, we identified key transcriptional factor motifs that could explain G4 activity depending on cell type. Lastly, we used our new algorithm to map active G4 regions in multiple tissues and cancers as a comprehensive resource for the G4 community.
- Published
- 2021