1. Characterizing Regulatory Elements and Non-Coding Variants in the Human Genome
- Author
-
Siraj, Layla, Lander, Eric S, Finucane, Hilary K, Gusev, Alexander, Raychaudhuri, Soumya, and Sunyaev, Shamil R
- Subjects
Biological programs ,Complex disease ,Gene regulation ,Genomics ,Regulatory networks ,Transcription factor binding ,Genetics ,Biophysics ,Bioinformatics - Abstract
Regulatory elements and the non-coding variants within them govern the spatiotemporal expression of genes as part of coordinated sets of networks, mediated by the combinatorial binding of transcription factors. The mechanisms by which non-coding variants affect the regulatory ability of the elements in which they reside, as well as the structural organization of regulatory elements, remain poorly understood. In this dissertation, I approach the characterization of regulatory elements and non-coding variants through three distinct and complementary perspectives. In chapter 2, I employ a functional approach. I present and analyze a rich resource of functional effect data for over 300,000 fine-mapped complex trait variants and robust controls. I demonstrate that massively parallel reporter assays (MPRAs) provide important and salient functional effect information for elements residing in endogenous regulatory elements. I present mechanistic evidence for epistasis between non-coding variants and dissect cases of multiple causal variants across independent signals and within the same signal. I also characterize the individual nucleotide contribution across the entire regulatory element for 164 loci and uncover new sequence motifs contributing to regulatory element activity. In chapter 3, I employ a positional and biochemical approach. In characterizing regulatory elements by the transcription factor binding sites that lie within, I ultimately uncover serious confounding effects of cut coverage and residual enzymatic bias that hamper the ability to infer TF binding using ATAC-seq data. I also present a framework for ascertaining residual bias in footprinting algorithms. Finally, in chapter 4, I employ a statistical approach. I use the natural language processing model of Latent Dirichlet Allocation in order to identify the biological programs common to subsets of non-coding variants and phenotypes. Using data from the United Kingdom Biobank, I generated 15 clusters and employed cell-type specific enrichment of nearby genes to biologically annotate. I present our preliminary findings, with 4 biologically meaningful clusters, and discuss improvements and challenges ahead in comprehensively characterizing biological programs.
- Published
- 2023