1. Deciphering the Knowledge of Human Genome with Graphs
- Author
-
Feng, Fan
- Subjects
- deep learning, knowledge graph, human genomics, 3D genomics
- Abstract
Transcriptional regulation in human cells is a complex process that requires the collaboration of diverse genomic elements and chemicals. To understand the mechanisms, projects including the Encyclopedia of DNA Elements (ENCODE), Roadmap Epigenomics, and 4D Nucleome (4DN) have generated thousands of genomic and epigenomic datasets. These datasets annotated functional elements for the human genome (e.g., enhancers and promoters), summarized experimental results for epigenomic features (e.g., protein binding locations), and linked different modalities with statistical models (e.g., GWAS and eQTLs). From the available data, it has become apparent that the human genome should not be over-simplified as a 1-D linear sequence. Long-range dependencies on DNA sequences play a vital role in human transcriptional regulation. For example, enhancers, the primary units of gene expression regulation, often reside hundreds of kilobases away from their target genes. Enhancers engage in physical interactions with target genes across vast genomic distances to activate them. Therefore, interpreting the human genome requires a more advanced data structure capable of capturing long-distance and complicated relationships. This dissertation discusses how to decipher the human genome as a graph. Graphs, composed of nodes (or vertices) and edges, provide a powerful framework for modeling relationships. Graphs have been proven effective in representing relationships in diverse real-world scenarios, such as social networks, transportation systems, and communication networks. In the subsequent chapters, the representations of the human genome as a graph will be introduced and explored. Chapter 2 introduces the application of chromosome conformation capture (3C) technology, which unveils physical interactions among genomic regions. Analyzing the large-scale contact maps generated by 3C technology is instrumental in uncovering the long-range dependencies of genomic entities and understanding transcriptional regulation. Therefore, we developed computational tools including scHiCTools and Quagga to extract structural features from these maps. In Chapter 3, we addressed the importance of high-resolution and high-quality chromatin contact maps. Therefore, we developed a computational model, CAESAR, to connect epigenomics and high-resolution chromatin structure. CAESAR successfully imputes an unprecedented number of high-resolution human chromatin contact maps, which allows users to easily navigate these fine-scale chromatin structures and the corresponding regulatory mechanisms. Beyond 3D interactions, numerous data consortia and databases unveil the characteristics of genomic entities and their relationships. Despite the invaluable insights provided by these consortia, the separately stored tabular data remain in a 1D sequential framework, posing inconveniences for genomic research and scientific discoveries. To address this challenge, we introduce the Genomic Knowledgebase (GenomicKB) in Chapter 4. GenomicKB is a knowledge graph that seamlessly integrates datasets and annotations related to the human genome into a knowledge graph. Through a graph-based interpretation of the human genome, we anticipate that genomic research will increasingly become data-driven. GenomicKB aims to provide high-quality and integrated data for large-scale machine learning methods, thereby facilitating scientific discoveries.
- Published
- 2024