3,054 results on '"Gerstein, Mark"'
Search Results
2. Step-Back Profiling: Distilling User History for Personalized Scientific Writing
- Author
-
Tang, Xiangru, Zhang, Xingyao, Shao, Yanjun, Wu, Jie, Zhao, Yilun, Cohan, Arman, Gong, Ming, Zhang, Dongmei, and Gerstein, Mark
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large language models (LLM) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals, particularly in real-world scenarios like scientific writing. Addressing this challenge, we introduce STEP-BACK PROFILING to personalize LLMs by distilling user history into concise profiles, including essential traits and preferences of users. To conduct the experiments, we construct a Personalized Scientific Writing (PSW) dataset to study multi-user personalization. PSW requires the models to write scientific papers given specialized author groups with diverse academic backgrounds. As for the results, we demonstrate the effectiveness of capturing user characteristics via STEP-BACK PROFILING for collaborative writing. Moreover, our approach outperforms the baselines by up to 3.6 points on the general personalization benchmark (LaMP), including 7 personalization LLM tasks. Our ablation studies validate the contributions of different components in our method and provide insights into our task definition. Our dataset and code are available at \url{https://github.com/gersteinlab/step-back-profiling}.
- Published
- 2024
3. Single-cell genomics and regulatory networks for 388 human brains.
- Author
-
Emani, Prashant, Liu, Jason, Clarke, Declan, Jensen, Matthew, Warrell, Jonathan, Gupta, Chirag, Meng, Ran, Lee, Che Yu, Xu, Siwei, Dursun, Cagatay, Lou, Shaoke, Chen, Yuhang, Chu, Zhiyuan, Galeev, Timur, Hwang, Ahyeon, Li, Yunyang, Ni, Pengyu, Zhou, Xiao, Bakken, Trygve, Bendl, Jaroslav, Bicks, Lucy, Chatterjee, Tanima, Cheng, Lijun, Cheng, Yuyan, Dai, Yi, Duan, Ziheng, Flaherty, Mary, Fullard, John, Gancz, Michael, Garrido-Martín, Diego, Gaynor-Gillett, Sophia, Grundman, Jennifer, Hawken, Natalie, Henry, Ella, Hoffman, Gabriel, Huang, Ao, Jiang, Yunzhe, Jin, Ting, Jorstad, Nikolas, Kawaguchi, Riki, Khullar, Saniya, Liu, Jianyin, Liu, Junhao, Liu, Shuang, Ma, Shaojie, Margolis, Michael, Mazariegos, Samantha, Moore, Jill, Moran, Jennifer, Nguyen, Eric, Phalke, Nishigandha, Pjanic, Milos, Pratt, Henry, Quintero, Diana, Rajagopalan, Ananya, Riesenmy, Tiernon, Shedd, Nicole, Shi, Manman, Spector, Megan, Terwilliger, Rosemarie, Travaglini, Kyle, Wamsley, Brie, Wang, Gaoyuan, Xia, Yan, Xiao, Shaohua, Yang, Andrew, Zheng, Suchen, Gandal, Michael, Lee, Donghoon, Lein, Ed, Roussos, Panos, Sestan, Nenad, Weng, Zhiping, White, Kevin, Won, Hyejung, Girgenti, Matthew, Zhang, Jing, Wang, Daifeng, Geschwind, Daniel, and Gerstein, Mark
- Subjects
Humans ,Aging ,Brain ,Cell Communication ,Chromatin ,Gene Regulatory Networks ,Genomics ,Mental Disorders ,Prefrontal Cortex ,Quantitative Trait Loci ,Single-Cell Analysis - Abstract
Single-cell genomics is a powerful tool for studying heterogeneous tissues such as the brain. Yet little is understood about how genetic variants influence cell-level gene expression. Addressing this, we uniformly processed single-nuclei, multiomics datasets into a resource comprising >2.8 million nuclei from the prefrontal cortex across 388 individuals. For 28 cell types, we assessed population-level variation in expression and chromatin across gene families and drug targets. We identified >550,000 cell type-specific regulatory elements and >1.4 million single-cell expression quantitative trait loci, which we used to build cell-type regulatory and cell-to-cell communication networks. These networks manifest cellular changes in aging and neuropsychiatric disorders. We further constructed an integrative model accurately imputing single-cell expression and simulating perturbations; the model prioritized ~250 disease-risk genes and drug targets with associated cell types.
- Published
- 2024
4. Cross-ancestry atlas of gene, isoform, and splicing regulation in the developing human brain
- Author
-
Wen, Cindy, Margolis, Michael, Dai, Rujia, Zhang, Pan, Przytycki, Pawel F, Vo, Daniel D, Bhattacharya, Arjun, Matoba, Nana, Tang, Miao, Jiao, Chuan, Kim, Minsoo, Tsai, Ellen, Hoh, Celine, Aygün, Nil, Walker, Rebecca L, Chatzinakos, Christos, Clarke, Declan, Pratt, Henry, Peters, Mette A, Gerstein, Mark, Daskalakis, Nikolaos P, Weng, Zhiping, Jaffe, Andrew E, Kleinman, Joel E, Hyde, Thomas M, Weinberger, Daniel R, Bray, Nicholas J, Sestan, Nenad, Geschwind, Daniel H, Roeder, Kathryn, Gusev, Alexander, Pasaniuc, Bogdan, Stein, Jason L, Love, Michael I, Pollard, Katherine S, Liu, Chunyu, Gandal, Michael J, Akbarian, Schahram, Abyzov, Alexej, Ahituv, Nadav, Arasappan, Dhivya, Almagro Armenteros, Jose Juan, Beliveau, Brian J, Bendl, Jaroslav, Berretta, Sabina, Bharadwaj, Rahul A, Bicks, Lucy, Brennand, Kristen, Capauto, Davide, Champagne, Frances A, Chatterjee, Tanima, Chatzinakos, Chris, Chen, Yuhang, Chen, H Isaac, Cheng, Yuyan, Cheng, Lijun, Chess, Andrew, Chien, Jo-fan, Chu, Zhiyuan, Clement, Ashley, Collado-Torres, Leonardo, Cooper, Gregory M, Crawford, Gregory E, Davila-Velderrain, Jose, Deep-Soboslay, Amy, Deng, Chengyu, DiPietro, Christopher P, Dracheva, Stella, Drusinsky, Shiron, Duan, Ziheng, Duong, Duc, Dursun, Cagatay, Eagles, Nicholas J, Edelstein, Jonathan, Emani, Prashant S, Fullard, John F, Galani, Kiki, Galeev, Timur, Gaynor, Sophia, Girdhar, Kiran, Goes, Fernando S, Greenleaf, William, Grundman, Jennifer, Guo, Hanmin, Guo, Qiuyu, Gupta, Chirag, Hadas, Yoav, Hallmayer, Joachim, Han, Xikun, Haroutunian, Vahram, Hawken, Natalie, He, Chuan, Henry, Ella, Hicks, Stephanie C, Ho, Marcus, Ho, Li-Lun, Hoffman, Gabriel E, Huang, Yiling, Huuki-Myers, Louise A, and Hwang, Ahyeon
- Subjects
Biological Sciences ,Biomedical and Clinical Sciences ,Genetics ,Biological Psychology ,Psychology ,Mental Illness ,Mental Health ,Human Genome ,Neurosciences ,Brain Disorders ,Mental health ,Humans ,Alternative Splicing ,Atlases as Topic ,Autism Spectrum Disorder ,Brain ,Gene Expression Regulation ,Developmental ,Gene Regulatory Networks ,Genome-Wide Association Study ,Protein Isoforms ,Quantitative Trait Loci ,Schizophrenia ,Transcriptome ,Mental Disorders ,PsychENCODE Consortium† ,PsychENCODE Consortium ,General Science & Technology - Abstract
Neuropsychiatric genome-wide association studies (GWASs), including those for autism spectrum disorder and schizophrenia, show strong enrichment for regulatory elements in the developing brain. However, prioritizing risk genes and mechanisms is challenging without a unified regulatory atlas. Across 672 diverse developing human brains, we identified 15,752 genes harboring gene, isoform, and/or splicing quantitative trait loci, mapping 3739 to cellular contexts. Gene expression heritability drops during development, likely reflecting both increasing cellular heterogeneity and the intrinsic properties of neuronal maturation. Isoform-level regulation, particularly in the second trimester, mediated the largest proportion of GWAS heritability. Through colocalization, we prioritized mechanisms for about 60% of GWAS loci across five disorders, exceeding adult brain findings. Finally, we contextualized results within gene and isoform coexpression networks, revealing the comprehensive landscape of transcriptome regulation in development and disease.
- Published
- 2024
5. Massively parallel characterization of regulatory elements in the developing human cortex
- Author
-
Deng, Chengyu, Whalen, Sean, Steyert, Marilyn, Ziffra, Ryan, Przytycki, Pawel F, Inoue, Fumitaka, Pereira, Daniela A, Capauto, Davide, Norton, Scott, Vaccarino, Flora M, Pollen, Alex A, Nowakowski, Tomasz J, Ahituv, Nadav, Pollard, Katherine S, Akbarian, Schahram, Abyzov, Alexej, Arasappan, Dhivya, Almagro Armenteros, Jose Juan, Beliveau, Brian J, Bendl, Jaroslav, Berretta, Sabina, Bharadwaj, Rahul A, Bhattacharya, Arjun, Bicks, Lucy, Brennand, Kristen, Champagne, Frances A, Chatterjee, Tanima, Chatzinakos, Chris, Chen, Yuhang, Chen, H Isaac, Cheng, Yuyan, Cheng, Lijun, Chess, Andrew, Chien, Jo-fan, Chu, Zhiyuan, Clarke, Declan, Clement, Ashley, Collado-Torres, Leonardo, Cooper, Gregory M, Crawford, Gregory E, Dai, Rujia, Daskalakis, Nikolaos P, Davila-Velderrain, Jose, Deep-Soboslay, Amy, DiPietro, Christopher P, Dracheva, Stella, Drusinsky, Shiron, Duan, Ziheng, Duong, Duc, Dursun, Cagatay, Eagles, Nicholas J, Edelstein, Jonathan, Emani, Prashant S, Fullard, John F, Galani, Kiki, Galeev, Timur, Gandal, Michael J, Gaynor, Sophia, Gerstein, Mark, Geschwind, Daniel H, Girdhar, Kiran, Goes, Fernando S, Greenleaf, William, Grundman, Jennifer, Guo, Hanmin, Guo, Qiuyu, Gupta, Chirag, Hadas, Yoav, Hallmayer, Joachim, Han, Xikun, Haroutunian, Vahram, Hawken, Natalie, He, Chuan, Henry, Ella, Hicks, Stephanie C, Ho, Marcus, Ho, Li-Lun, Hoffman, Gabriel E, Huang, Yiling, Huuki-Myers, Louise A, Hwang, Ahyeon, Hyde, Thomas M, Iatrou, Artemis, Jajoo, Aarti, Jensen, Matthew, Jiang, Lihua, Jin, Peng, Jin, Ting, Jops, Connor, Jourdon, Alexandre, Kawaguchi, Riki, Kellis, Manolis, Khullar, Saniya, Kleinman, Joel E, Kleopoulos, Steven P, and Kozlenkov, Alex
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Biomedical and Clinical Sciences ,Stem Cell Research - Embryonic - Human ,Stem Cell Research ,Human Genome ,Genetics ,Neurosciences ,Underpinning research ,Aetiology ,1.1 Normal biological development and functioning ,2.1 Biological and endogenous factors ,Neurological ,Humans ,Cerebral Cortex ,Chromatin ,Deep Learning ,Enhancer Elements ,Genetic ,Gene Expression Regulation ,Developmental ,Neurogenesis ,Neurons ,Organoids ,Regulatory Sequences ,Nucleic Acid ,Promoter Regions ,Genetic ,Regulatory Elements ,Transcriptional ,PsychENCODE Consortium‡ ,PsychENCODE Consortium ,General Science & Technology - Abstract
Nucleotide changes in gene regulatory elements are important determinants of neuronal development and diseases. Using massively parallel reporter assays in primary human cells from mid-gestation cortex and cerebral organoids, we interrogated the cis-regulatory activity of 102,767 open chromatin regions, including thousands of sequences with cell type-specific accessibility and variants associated with brain gene regulation. In primary cells, we identified 46,802 active enhancer sequences and 164 variants that alter enhancer activity. Activity was comparable in organoids and primary cells, suggesting that organoids provide an adequate model for the developing cortex. Using deep learning we decoded the sequence basis and upstream regulators of enhancer activity. This work establishes a comprehensive catalog of functional gene regulatory elements and variants in human neuronal development.
- Published
- 2024
6. MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise
- Author
-
Deng, Chunyuan, Tang, Xiangru, Zhao, Yilun, Wang, Hanming, Wang, Haoran, Zhou, Wangchunshu, Cohan, Arman, and Gerstein, Mark
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Recently, large language models (LLMs) have evolved into interactive agents, proficient in planning, tool use, and task execution across a wide variety of tasks. However, without specific agent tuning, open-source models like LLaMA currently struggle to match the efficiency of GPT- 4, particularly given the scarcity of agent-tuning datasets for fine-tuning. In response, we introduce \textsc{Mimir}: a streamlined platform offering a customizable pipeline that enables users to leverage both private knowledge and publicly available, legally compliant datasets at scale for \textbf{personalized agent tuning}. Additionally, \textsc{Mimir} supports the generation of general instruction-tuning datasets from the same input. This dual capability ensures that language agents developed through the platform possess both specific agent abilities and general competencies. \textsc{Mimir} integrates these features into a cohesive end-to-end platform, facilitating everything from the uploading of personalized files to one-click agent fine-tuning.
- Published
- 2024
7. $\zeta$-QVAE: A Quantum Variational Autoencoder utilizing Regularized Mixed-state Latent Representations
- Author
-
Wang, Gaoyuan, Warrell, Jonathan, Emani, Prashant S., and Gerstein, Mark
- Subjects
Quantum Physics - Abstract
A major challenge in near-term quantum computing is its application to large real-world datasets due to scarce quantum hardware resources. One approach to enabling tractable quantum models for such datasets involves compressing the original data to manageable dimensions while still representing essential information for downstream analysis. In classical machine learning, variational autoencoders (VAEs) facilitate efficient data compression, representation learning for subsequent tasks, and novel data generation. However, no model has been proposed that exactly captures all of these features for direct application to quantum data on quantum computers. Some existing quantum models for data compression lack regularization of latent representations, thus preventing direct use for generation and control of generalization. Others are hybrid models with only some internal quantum components, impeding direct training on quantum data. To bridge this gap, we present a fully quantum framework, $\zeta$-QVAE, which encompasses all the capabilities of classical VAEs and can be directly applied for both classical and quantum data compression. Our model utilizes regularized mixed states to attain optimal latent representations. It accommodates various divergences for reconstruction and regularization. Furthermore, by accommodating mixed states at every stage, it can utilize the full-data density matrix and allow for a "global" training objective. Doing so, in turn, makes efficient optimization possible and has potential implications for private and federated learning. In addition to exploring the theoretical properties of $\zeta$-QVAE, we demonstrate its performance on representative genomics and synthetic data. Our results consistently indicate that $\zeta$-QVAE exhibits similar or better performance compared to matched classical models.
- Published
- 2024
8. A Survey of Generative AI for de novo Drug Design: New Frontiers in Molecule and Protein Generation
- Author
-
Tang, Xiangru, Dai, Howard, Knight, Elizabeth, Wu, Fang, Li, Yunyang, Li, Tianxiao, and Gerstein, Mark
- Subjects
Quantitative Biology - Biomolecules ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
- Published
- 2024
9. ChatCell: Facilitating Single-Cell Analysis with Natural Language
- Author
-
Fang, Yin, Liu, Kangwei, Zhang, Ningyu, Deng, Xinle, Yang, Penghui, Chen, Zhuo, Tang, Xiangru, Gerstein, Mark, Fan, Xiaohui, and Chen, Huajun
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Computational Engineering, Finance, and Science ,Computer Science - Human-Computer Interaction ,Computer Science - Machine Learning - Abstract
As Large Language Models (LLMs) rapidly evolve, their influence in science is becoming increasingly prominent. The emerging capabilities of LLMs in task generalization and free-form dialogue can significantly advance fields like chemistry and biology. However, the field of single-cell biology, which forms the foundational building blocks of living organisms, still faces several challenges. High knowledge barriers and limited scalability in current methods restrict the full exploitation of LLMs in mastering single-cell data, impeding direct accessibility and rapid iteration. To this end, we introduce ChatCell, which signifies a paradigm shift by facilitating single-cell analysis with natural language. Leveraging vocabulary adaptation and unified sequence generation, ChatCell has acquired profound expertise in single-cell biology and the capability to accommodate a diverse range of analysis tasks. Extensive experiments further demonstrate ChatCell's robust performance and potential to deepen single-cell insights, paving the way for more accessible and intuitive exploration in this pivotal field. Our project homepage is available at https://zjunlp.github.io/project/ChatCell., Comment: I have decided to temporarily withdraw this draft as I am in the process of making further revisions to improve its content. Code: https://github.com/zjunlp/ChatCell Dataset: https://huggingface.co/datasets/zjunlp/ChatCell-Instructions Demo: https://chat.openai.com/g/g-vUwj222gQ-chatcell
- Published
- 2024
10. Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
- Author
-
Tang, Xiangru, Jin, Qiao, Zhu, Kunlun, Yuan, Tongxin, Zhang, Yichi, Zhou, Wangchunshu, Qu, Meng, Zhao, Yilun, Tang, Jian, Zhang, Zhuosheng, Cohan, Arman, Lu, Zhiyong, and Gerstein, Mark
- Subjects
Computer Science - Computers and Society ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents, called scientific LLM agents, also introduce novel vulnerabilities that demand careful consideration for safety. However, there exists a notable gap in the literature, as there has been no comprehensive exploration of these vulnerabilities. This perspective paper fills this gap by conducting a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures. We begin by providing a comprehensive overview of the potential risks inherent to scientific LLM agents, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we delve into the origins of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding scientific agents and advocate for the development of improved models, robust benchmarks, and comprehensive regulations to address these issues effectively.
- Published
- 2024
11. The Development of a Practical Artificial Intelligence Tool for Diagnosing and Evaluating Autism Spectrum Disorder: Multicenter Study
- Author
-
Chen, Tao, Chen, Ye, Yuan, Mengxue, Gerstein, Mark, Li, Tingyu, Liang, Huiying, Froehlich, Tanya, and Lu, Long
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
BackgroundAutism spectrum disorder (ASD) is a complex neurodevelopmental disorder with an unknown etiology. Early diagnosis and intervention are key to improving outcomes for patients with ASD. Structural magnetic resonance imaging (sMRI) has been widely used in clinics to facilitate the diagnosis of brain diseases such as brain tumors. However, sMRI is less frequently used to investigate neurological and psychiatric disorders, such as ASD, owing to the subtle, if any, anatomical changes of the brain. ObjectiveThis study aimed to investigate the possibility of identifying structural patterns in the brain of patients with ASD as potential biomarkers in the diagnosis and evaluation of ASD in clinics. MethodsWe developed a novel 2-level histogram-based morphometry (HBM) classification framework in which an algorithm based on a 3D version of the histogram of oriented gradients (HOG) was used to extract features from sMRI data. We applied this framework to distinguish patients with ASD from healthy controls using 4 datasets from the second edition of the Autism Brain Imaging Data Exchange, including the ETH Zürich (ETH), NYU Langone Medical Center: Sample 1, Oregon Health and Science University, and Stanford University (SU) sites. We used a stratified 10-fold cross-validation method to evaluate the model performance, and we applied the Naive Bayes approach to identify the predictive ASD-related brain regions based on classification contributions of each HOG feature. ResultsOn the basis of the 3D HOG feature extraction method, our proposed HBM framework achieved an area under the curve (AUC) of >0.75 in each dataset, with the highest AUC of 0.849 in the ETH site. We compared the 3D HOG algorithm with the original 2D HOG algorithm, which showed an accuracy improvement of >4% in each dataset, with the highest improvement of 14% (6/42) in the SU site. A comparison of the 3D HOG algorithm with the scale-invariant feature transform algorithm showed an AUC improvement of >18% in each dataset. Furthermore, we identified ASD-related brain regions based on the sMRI images. Some of these regions (eg, frontal gyrus, temporal gyrus, cingulate gyrus, postcentral gyrus, precuneus, caudate, and hippocampus) are known to be implicated in ASD in prior neuroimaging literature. We also identified less well-known regions that may play unrecognized roles in ASD and be worth further investigation. ConclusionsOur research suggested that it is possible to identify neuroimaging biomarkers that can distinguish patients with ASD from healthy controls based on the more cost-effective sMRI images of the brain. We also demonstrated the potential of applying data-driven artificial intelligence technology in the clinical setting of neurological and psychiatric disorders, which usually harbor subtle anatomical changes in the brain that are often invisible to the human eye.
- Published
- 2020
- Full Text
- View/download PDF
12. Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
- Author
-
Zhang, Zhuosheng, Yao, Yao, Zhang, Aston, Tang, Xiangru, Ma, Xinbei, He, Zhiwei, Wang, Yiming, Gerstein, Mark, Wang, Rui, Liu, Gongshen, and Zhao, Hai
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Human-Computer Interaction ,Computer Science - Multiagent Systems - Abstract
Large language models (LLMs) have dramatically enhanced the field of language intelligence, as demonstrably evidenced by their formidable empirical performance across a spectrum of complex reasoning tasks. Additionally, theoretical proofs have illuminated their emergent reasoning capabilities, providing a compelling showcase of their advanced cognitive abilities in linguistic contexts. Critical to their remarkable efficacy in handling complex reasoning tasks, LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer. The CoT reasoning approach has not only exhibited proficiency in amplifying reasoning performance but also in enhancing interpretability, controllability, and flexibility. In light of these merits, recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents, which adeptly adhere to language instructions and execute actions within varied environments. This survey paper orchestrates a thorough discourse, penetrating vital research dimensions, encompassing: (i) the foundational mechanics of CoT techniques, with a focus on elucidating the circumstances and justification behind its efficacy; (ii) the paradigm shift in CoT; and (iii) the burgeoning of language agents fortified by CoT approaches. Prospective research avenues envelop explorations into generalization, efficiency, customization, scaling, and safety. This paper caters to a wide audience, including beginners seeking comprehensive knowledge of CoT reasoning and language agents, as well as experienced researchers interested in foundational mechanics and engaging in cutting-edge discussions on these topics. A repository for the related papers is available at https://github.com/Zoeyyao27/CoT-Igniting-Agent.
- Published
- 2023
13. ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
- Author
-
Tang, Xiangru, Liu, Yuliang, Cai, Zefan, Shao, Yanjun, Lu, Junjie, Zhang, Yichi, Deng, Zexuan, Hu, Helan, An, Kaikai, Huang, Ruijun, Si, Shuzheng, Chen, Sheng, Zhao, Haozhe, Chen, Liang, Wang, Yan, Liu, Tianyu, Jiang, Zhiwei, Chang, Baobao, Fang, Yin, Qin, Yujia, Zhou, Wangchunshu, Zhao, Yilun, Cohan, Arman, and Gerstein, Mark
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution. Our code, dataset, and models are available at https://github.com/gersteinlab/ML-bench.
- Published
- 2023
14. MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
- Author
-
Tang, Xiangru, Zou, Anni, Zhang, Zhuosheng, Li, Ziming, Zhao, Yilun, Zhang, Xingyao, Cohan, Arman, and Gerstein, Mark
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large language models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and reasoning over specialized knowledge. To address these issues, we propose MedAgents, a novel multi-disciplinary collaboration framework for the medical domain. MedAgents leverages LLM-based agents in a role-playing setting that participate in a collaborative multi-round discussion, thereby enhancing LLM proficiency and reasoning capabilities. This training-free framework encompasses five critical steps: gathering domain experts, proposing individual analyses, summarising these analyses into a report, iterating over discussions until a consensus is reached, and ultimately making a decision. Our work focuses on the zero-shot setting, which is applicable in real-world scenarios. Experimental results on nine datasets (MedQA, MedMCQA, PubMedQA, and six subtasks from MMLU) establish that our proposed MedAgents framework excels at mining and harnessing the medical expertise within LLMs, as well as extending its reasoning abilities. Our code can be found at https://github.com/gersteinlab/MedAgents.
- Published
- 2023
15. Investigating Data Contamination in Modern Benchmarks for Large Language Models
- Author
-
Deng, Chunyuan, Zhao, Yilun, Tang, Xiangru, Gerstein, Mark, and Cohan, Arman
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52\% and 57\%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field., Comment: NAACL 2024 Version
- Published
- 2023
16. Improved prediction of ligand-protein binding affinities by meta-modeling
- Author
-
Lee, Ho-Joon, Emani, Prashant S., and Gerstein, Mark B.
- Subjects
Computer Science - Machine Learning ,Quantitative Biology - Quantitative Methods - Abstract
The accurate screening of candidate drug ligands against target proteins through computational approaches is of prime interest to drug development efforts. Such virtual screening depends in part on methods to predict the binding affinity between ligands and proteins. Many computational models for binding affinity prediction have been developed, but with varying results across targets. Given that ensembling or meta-modeling methods have shown great promise in reducing model-specific biases, we develop a framework to integrate published force-field-based empirical docking and sequence-based deep learning models. In building this framework, we evaluate many combinations of individual base models, training databases, and several meta-modeling approaches. We show that many of our meta-models significantly improve affinity predictions over base models. Our best meta-models achieve comparable performance to state-of-the-art deep learning tools exclusively based on structures, while allowing for improved database scalability and flexibility through the explicit inclusion of features such as physicochemical properties or molecular descriptors. Overall, we demonstrate that diverse modeling approaches can be ensembled together to gain improvement in binding affinity prediction., Comment: 52 pages, 5 main tables, 6 main figures, 7 supplementary figures, and supporting information. For 11 supplementary tables and code, see https://github.com/Lee1701/Lee2023a
- Published
- 2023
17. Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
- Author
-
Tang, Xiangru, Zong, Yiming, Phang, Jason, Zhao, Yilun, Zhou, Wangchunshu, Cohan, Arman, and Gerstein, Mark
- Subjects
Computer Science - Computation and Language - Abstract
Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions -- coverage, formatting, reasoning, comprehension, pragmatics, and hallucination -- highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.
- Published
- 2023
18. BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
- Author
-
Tang, Xiangru, Qian, Bill, Gao, Rick, Chen, Jiakang, Chen, Xinyun, and Gerstein, Mark
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Pre-trained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1,026 Python functions and 1,243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT- 4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (1) Successful models accommodate a long prompt (> 2,600 tokens) with full context, including functional dependencies. (2) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% vs. up to 25%). Availability and implementation: Code is available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark. github.io/.
- Published
- 2023
19. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
- Author
-
Qin, Yujia, Liang, Shihao, Ye, Yining, Zhu, Kunlun, Yan, Lan, Lu, Yaxi, Lin, Yankai, Cong, Xin, Tang, Xiangru, Qian, Bill, Zhao, Sihan, Hong, Lauren, Tian, Runchu, Xie, Ruobing, Zhou, Jie, Gerstein, Mark, Li, Dahai, Liu, Zhiyuan, and Sun, Maosong
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.
- Published
- 2023
20. GersteinLab at MEDIQA-Chat 2023: Clinical Note Summarization from Doctor-Patient Conversations through Fine-tuning and In-context Learning
- Author
-
Tang, Xiangru, Tran, Andrew, Tan, Jeffrey, and Gerstein, Mark
- Subjects
Computer Science - Computation and Language - Abstract
This paper presents our contribution to the MEDIQA-2023 Dialogue2Note shared task, encompassing both subtask A and subtask B. We approach the task as a dialogue summarization problem and implement two distinct pipelines: (a) a fine-tuning of a pre-trained dialogue summarization model and GPT-3, and (b) few-shot in-context learning (ICL) using a large language model, GPT-4. Both methods achieve excellent results in terms of ROUGE-1 F1, BERTScore F1 (deberta-xlarge-mnli), and BLEURT, with scores of 0.4011, 0.7058, and 0.5421, respectively. Additionally, we predict the associated section headers using RoBERTa and SciBERT based classification models. Our team ranked fourth among all teams, while each team is allowed to submit three runs as part of their submission. We also utilize expert annotations to demonstrate that the notes generated through the ICL GPT-4 are better than all other baselines. The code for our submission is available.
- Published
- 2023
21. exRNA-eCLIP intersection analysis reveals a map of extracellular RNA binding proteins and associated RNAs across major human biofluids and carriers
- Author
-
LaPlante, Emily L, Stürchler, Alessandra, Fullem, Robert, Chen, David, Starner, Anne C, Esquivel, Emmanuel, Alsop, Eric, Jackson, Andrew R, Ghiran, Ionita, Pereira, Getulio, Rozowsky, Joel, Chang, Justin, Gerstein, Mark B, Alexander, Roger P, Roth, Matthew E, Franklin, Jeffrey L, Coffey, Robert J, Raffai, Robert L, Mansuy, Isabelle M, Stavrakis, Stavros, deMello, Andrew J, Laurent, Louise C, Wang, Yi-Ting, Tsai, Chia-Feng, Liu, Tao, Jones, Jennifer, Van Keuren-Jensen, Kendall, Van Nostrand, Eric, Mateescu, Bogdan, and Milosavljevic, Aleksandar
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Genetics ,Human Genome ,Biotechnology ,Underpinning research ,1.1 Normal biological development and functioning ,Generic health relevance ,NIH ERCC ,RNA binding proteins ,RNA footprint correlation ,cell-free RNAs ,cell-free biomarkers ,eCLIP ,exRNA carriers ,human biofluids ,liquid biopsies ,public resource - Abstract
Although the role of RNA binding proteins (RBPs) in extracellular RNA (exRNA) biology is well established, their exRNA cargo and distribution across biofluids are largely unknown. To address this gap, we extend the exRNA Atlas resource by mapping exRNAs carried by extracellular RBPs (exRBPs). This map was developed through an integrative analysis of ENCODE enhanced crosslinking and immunoprecipitation (eCLIP) data (150 RBPs) and human exRNA profiles (6,930 samples). Computational analysis and experimental validation identified exRBPs in plasma, serum, saliva, urine, cerebrospinal fluid, and cell-culture-conditioned medium. exRBPs carry exRNA transcripts from small non-coding RNA biotypes, including microRNA (miRNA), piRNA, tRNA, small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), Y RNA, and lncRNA, as well as protein-coding mRNA fragments. Computational deconvolution of exRBP RNA cargo reveals associations of exRBPs with extracellular vesicles, lipoproteins, and ribonucleoproteins across human biofluids. Overall, we mapped the distribution of exRBPs across human biofluids, presenting a resource for the community.
- Published
- 2023
22. Disentangled Wasserstein Autoencoder for T-Cell Receptor Engineering
- Author
-
Li, Tianxiao, Guo, Hongyu, Grazioli, Filippo, Gerstein, Mark, and Min, Martin Renqiang
- Subjects
Quantitative Biology - Biomolecules - Abstract
In protein biophysics, the separation between the functionally important residues (forming the active site or binding surface) and those that create the overall structure (the fold) is a well-established and fundamental concept. Identifying and modifying those functional sites is critical for protein engineering but computationally non-trivial, and requires significant domain knowledge. To automate this process from a data-driven perspective, we propose a disentangled Wasserstein autoencoder with an auxiliary classifier, which isolates the function-related patterns from the rest with theoretical guarantees. This enables one-pass protein sequence editing and improves the understanding of the resulting sequences and editing actions involved. To demonstrate its effectiveness, we apply it to T-cell receptors (TCRs), a well-studied structure-function case. We show that our method can be used to alter the function of TCRs without changing the structural backbone, outperforming several competing methods in generation quality and efficiency, and requiring only 10% of the running time needed by baseline models. To our knowledge, this is the first approach that utilizes disentangled representations for TCR engineering.
- Published
- 2022
23. DeepVelo: Single-cell transcriptomic deep velocity field learning with neural ordinary differential equations.
- Author
-
Chen, Zhanlin, King, William, Hwang, Aheyon, Gerstein, Mark, and Zhang, Jing
- Abstract
Recent advances in single-cell sequencing technologies have provided unprecedented opportunities to measure the gene expression profile and RNA velocity of individual cells. However, modeling transcriptional dynamics is computationally challenging because of the high-dimensional, sparse nature of the single-cell gene expression measurements and the nonlinear regulatory relationships. Here, we present DeepVelo, a neural network-based ordinary differential equation that can model complex transcriptome dynamics by describing continuous-time gene expression changes within individual cells. We apply DeepVelo to public datasets from different sequencing platforms to (i) formulate transcriptome dynamics on different time scales, (ii) measure the instability of cell states, and (iii) identify developmental driver genes via perturbation analysis. Benchmarking against the state-of-the-art methods shows that DeepVelo can learn a more accurate representation of the velocity field. Furthermore, our perturbation studies reveal that single-cell dynamical systems could exhibit chaotic properties. In summary, DeepVelo allows data-driven discoveries of differential equations that delineate single-cell transcriptome dynamics.
- Published
- 2022
24. Broad transcriptomic dysregulation occurs across the cerebral cortex in ASD
- Author
-
Gandal, Michael J, Haney, Jillian R, Wamsley, Brie, Yap, Chloe X, Parhami, Sepideh, Emani, Prashant S, Chang, Nathan, Chen, George T, Hoftman, Gil D, de Alba, Diego, Ramaswami, Gokul, Hartl, Christopher L, Bhattacharya, Arjun, Luo, Chongyuan, Jin, Ting, Wang, Daifeng, Kawaguchi, Riki, Quintero, Diana, Ou, Jing, Wu, Ye Emily, Parikshak, Neelroop N, Swarup, Vivek, Belgard, T Grant, Gerstein, Mark, Pasaniuc, Bogdan, and Geschwind, Daniel H
- Subjects
Biological Sciences ,Biomedical and Clinical Sciences ,Genetics ,Human Genome ,Autism ,Mental Health ,Eye Disease and Disorders of Vision ,Neurosciences ,Brain Disorders ,Pediatric ,Intellectual and Developmental Disabilities (IDD) ,Aetiology ,1.1 Normal biological development and functioning ,2.1 Biological and endogenous factors ,Underpinning research ,Neurological ,Mental health ,Humans ,Autism Spectrum Disorder ,Cerebral Cortex ,Neurons ,RNA ,Transcriptome ,Autopsy ,Sequence Analysis ,RNA ,Primary Visual Cortex ,Neuroglia ,Genetic Variation ,General Science & Technology - Abstract
Neuropsychiatric disorders classically lack defining brain pathologies, but recent work has demonstrated dysregulation at the molecular level, characterized by transcriptomic and epigenetic alterations1-3. In autism spectrum disorder (ASD), this molecular pathology involves the upregulation of microglial, astrocyte and neural-immune genes, the downregulation of synaptic genes, and attenuation of gene-expression gradients in cortex1,2,4-6. However, whether these changes are limited to cortical association regions or are more widespread remains unknown. To address this issue, we performed RNA-sequencing analysis of 725 brain samples spanning 11 cortical areas from 112 post-mortem samples from individuals with ASD and neurotypical controls. We find widespread transcriptomic changes across the cortex in ASD, exhibiting an anterior-to-posterior gradient, with the greatest differences in primary visual cortex, coincident with an attenuation of the typical transcriptomic differences between cortical regions. Single-nucleus RNA-sequencing and methylation profiling demonstrate that this robust molecular signature reflects changes in cell-type-specific gene expression, particularly affecting excitatory neurons and glia. Both rare and common ASD-associated genetic variation converge within a downregulated co-expression module involving synaptic signalling, and common variation alone is enriched within a module of upregulated protein chaperone genes. These results highlight widespread molecular changes across the cerebral cortex in ASD, extending beyond association cortex to broadly involve primary sensory regions.
- Published
- 2022
25. Scalable privacy-preserving cancer type prediction with homomorphic encryption
- Author
-
Sarkar, Esha, Chielle, Eduardo, Gursoy, Gamze, Chen, Leo, Gerstein, Mark, and Maniatakos, Michail
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Artificial Intelligence - Abstract
Machine Learning (ML) alleviates the challenges of high-dimensional data analysis and improves decision making in critical applications like healthcare. Effective cancer type from high-dimensional genetic mutation data can be useful for cancer diagnosis and treatment, if the distinguishable patterns between cancer types are identified. At the same time, analysis of high-dimensional data is computationally expensive and is often outsourced to cloud services. Privacy concerns in outsourced ML, especially in the field of genetics, motivate the use of encrypted computation, like Homomorphic Encryption (HE). But restrictive overheads of encrypted computation deter its usage. In this work, we explore the challenges of privacy preserving cancer detection using a real-world dataset consisting of more than 2 million genetic information for several cancer types. Since the data is inherently high-dimensional, we explore smaller ML models for cancer prediction to enable fast inference in the privacy preserving domain. We develop a solution for privacy preserving cancer inference which first leverages the domain knowledge on somatic mutations to efficiently encode genetic mutations and then uses statistical tests for feature selection. Our logistic regression model, built using our novel encoding scheme, achieves 0.98 micro-average area under curve with 13% higher test accuracy than similar studies. We exhaustively test our model's predictive capabilities by analyzing the genes used by the model. Furthermore, we propose a fast matrix multiplication algorithm that can efficiently handle high-dimensional data. Experimental results show that, even with 40,000 features, our proposed matrix multiplication algorithm can speed up concurrent inference of multiple individuals by approximately 10x and inference of a single individual by approximately 550x, in comparison to standard matrix multiplication.
- Published
- 2022
26. Higher-Order Generalization Bounds: Learning Deep Probabilistic Programs via PAC-Bayes Objectives
- Author
-
Warrell, Jonathan and Gerstein, Mark
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Deep Probabilistic Programming (DPP) allows powerful models based on recursive computation to be learned using efficient deep-learning optimization techniques. Additionally, DPP offers a unified perspective, where inference and learning algorithms are treated on a par with models as stochastic programs. Here, we offer a framework for representing and learning flexible PAC-Bayes bounds as stochastic programs using DPP-based methods. In particular, we show that DPP techniques may be leveraged to derive generalization bounds that draw on the compositionality of DPP representations. In turn, the bounds we introduce offer principled training objectives for higher-order probabilistic programs. We offer a definition of a higher-order generalization bound, which naturally encompasses single- and multi-task generalization perspectives (including transfer- and meta-learning) and a novel class of bound based on a learned measure of model complexity. Further, we show how modified forms of all higher-order bounds can be efficiently optimized as objectives for DPP training, using variational techniques. We test our framework using single- and multi-task generalization settings on synthetic and biological data, showing improved performance and generalization prediction using flexible DPP model representations and learned complexity measures., Comment: 19 pages, 2 figures
- Published
- 2022
27. Venus: An efficient virus infection detection and fusion site discovery method using single-cell and bulk RNA-seq data.
- Author
-
Lee, Che, Chen, Yuhang, Duan, Ziheng, Xu, Min, Girgenti, Matthew, Xu, Ke, Gerstein, Mark, and Zhang, Jing
- Subjects
Humans ,HIV Infections ,Liver Neoplasms ,RNA-Seq ,Sequence Analysis ,RNA ,Software ,Viruses - Abstract
Early and accurate detection of viruses in clinical and environmental samples is essential for effective public healthcare, treatment, and therapeutics. While PCR detects potential pathogens with high sensitivity, it is difficult to scale and requires knowledge of the exact sequence of the pathogen. With the advent of next-gen single-cell sequencing, it is now possible to scrutinize viral transcriptomics at the finest possible resolution-cells. This newfound ability to investigate individual cells opens new avenues to understand viral pathophysiology with unprecedented resolution. To leverage this ability, we propose an efficient and accurate computational pipeline, named Venus, for virus detection and integration site discovery in both single-cell and bulk-tissue RNA-seq data. Specifically, Venus addresses two main questions: whether a tissue/cell type is infected by viruses or a virus of interest? And if infected, whether and where has the virus inserted itself into the human genome? Our analysis can be broken into two parts-validation and discovery. Firstly, for validation, we applied Venus on well-studied viral datasets, such as HBV- hepatocellular carcinoma and HIV-infection treated with antiretroviral therapy. Secondly, for discovery, we analyzed datasets such as HIV-infected neurological patients and deeply sequenced T-cells. We detected viral transcripts in the novel target of the brain and high-confidence integration sites in immune cells. In conclusion, here we describe Venus, a publicly available software which we believe will be a valuable virus investigation tool for the scientific community at large.
- Published
- 2022
28. Phase 2 of extracellular RNA communication consortium charts next-generation approaches for extracellular RNA research
- Author
-
Mateescu, Bogdan, Jones, Jennifer C, Alexander, Roger P, Alsop, Eric, An, Ji Yeong, Asghari, Mohammad, Boomgarden, Alex, Bouchareychas, Laura, Cayota, Alfonso, Chang, Hsueh-Chia, Charest, Al, Chiu, Daniel T, Coffey, Robert J, Das, Saumya, De Hoff, Peter, deMello, Andrew, D’Souza-Schorey, Crislyn, Elashoff, David, Eliato, Kiarash R, Franklin, Jeffrey L, Galas, David J, Gerstein, Mark B, Ghiran, Ionita H, Go, David B, Gould, Stephen, Grogan, Tristan R, Higginbotham, James N, Hladik, Florian, Huang, Tony Jun, Huo, Xiaoye, Hutchins, Elizabeth, Jeppesen, Dennis K, Jovanovic-Talisman, Tijana, Kim, Betty YS, Kim, Sung, Kim, Kyoung-Mee, Kim, Yong, Kitchen, Robert R, Knouse, Vaughan, LaPlante, Emily L, Lebrilla, Carlito B, Lee, L James, Lennon, Kathleen M, Li, Guoping, Li, Feng, Li, Tieyi, Liu, Tao, Liu, Zirui, Maddox, Adam L, McCarthy, Kyle, Meechoovet, Bessie, Maniya, Nalin, Meng, Yingchao, Milosavljevic, Aleksandar, Min, Byoung-Hoon, Morey, Amber, Ng, Martin, Nolan, John, De Oliveira, Getulio P, Paulaitis, Michael E, Phu, Tuan Anh, Raffai, Robert L, Reátegui, Eduardo, Roth, Matthew E, Routenberg, David A, Rozowsky, Joel, Rufo, Joseph, Senapati, Satyajyoti, Shachar, Sigal, Sharma, Himani, Sood, Anil K, Stavrakis, Stavros, Stürchler, Alessandra, Tewari, Muneesh, Tosar, Juan P, Tucker-Schwartz, Alexander K, Turchinovich, Andrey, Valkov, Nedyalka, Van Keuren-Jensen, Kendall, Vickers, Kasey C, Vojtech, Lucia, Vreeland, Wyatt N, Wang, Ceming, Wang, Kai, Wang, ZeYu, Welsh, Joshua A, Witwer, Kenneth W, Wong, David TW, Xia, Jianping, Xie, Ya-Hong, Yang, Kaichun, Zaborowski, Mikołaj P, Zhang, Chenguang, Zhang, Qin, Zivkovic, Angela M, and Laurent, Louise C
- Subjects
Biological Sciences ,Biomedical and Clinical Sciences ,Genetics ,Biochemistry ,Biological sciences ,Cell biology ,Molecular biology - Abstract
The extracellular RNA communication consortium (ERCC) is an NIH-funded program aiming to promote the development of new technologies, resources, and knowledge about exRNAs and their carriers. After Phase 1 (2013-2018), Phase 2 of the program (ERCC2, 2019-2023) aims to fill critical gaps in knowledge and technology to enable rigorous and reproducible methods for separation and characterization of both bulk populations of exRNA carriers and single EVs. ERCC2 investigators are also developing new bioinformatic pipelines to promote data integration through the exRNA atlas database. ERCC2 has established several Working Groups (Resource Sharing, Reagent Development, Data Analysis and Coordination, Technology Development, nomenclature, and Scientific Outreach) to promote collaboration between ERCC2 members and the broader scientific community. We expect that ERCC2's current and future achievements will significantly improve our understanding of exRNA biology and the development of accurate and efficient exRNA-based diagnostic, prognostic, and theranostic biomarker assays.
- Published
- 2022
29. Standardized annotation of translated open reading frames
- Author
-
Mudge, Jonathan M, Ruiz-Orera, Jorge, Prensner, John R, Brunet, Marie A, Calvet, Ferriol, Jungreis, Irwin, Gonzalez, Jose Manuel, Magrane, Michele, Martinez, Thomas F, Schulz, Jana Felicitas, Yang, Yucheng T, Albà, M Mar, Aspden, Julie L, Baranov, Pavel V, Bazzini, Ariel A, Bruford, Elspeth, Martin, Maria Jesus, Calviello, Lorenzo, Carvunis, Anne-Ruxandra, Chen, Jin, Couso, Juan Pablo, Deutsch, Eric W, Flicek, Paul, Frankish, Adam, Gerstein, Mark, Hubner, Norbert, Ingolia, Nicholas T, Kellis, Manolis, Menschaert, Gerben, Moritz, Robert L, Ohler, Uwe, Roucou, Xavier, Saghatelian, Alan, Weissman, Jonathan S, and van Heesch, Sebastiaan
- Subjects
Molecular Sequence Annotation ,Open Reading Frames ,Protein Biosynthesis ,Ribosomes - Published
- 2022
30. Author Correction: Perspectives on ENCODE
- Author
-
Snyder, Michael P, Gingeras, Thomas R, Moore, Jill E, Weng, Zhiping, Gerstein, Mark B, Ren, Bing, Hardison, Ross C, Stamatoyannopoulos, John A, Graveley, Brenton R, Feingold, Elise A, Pazin, Michael J, Pagan, Michael, Gilchrist, Daniel A, Hitz, Benjamin C, Cherry, J Michael, Bernstein, Bradley E, Mendenhall, Eric M, Zerbino, Daniel R, Frankish, Adam, Flicek, Paul, and Myers, Richard M
- Subjects
ENCODE Project Consortium ,General Science & Technology - Abstract
In this Article, the authors Rizi Ai (Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA, USA) and Shantao Li (Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA) were mistakenly omitted from the ENCODE Project Consortium author list. The original Article has been corrected online.
- Published
- 2022
31. Author Correction: Expanded encyclopaedias of DNA elements in the human and mouse genomes
- Author
-
Moore, Jill E, Purcaro, Michael J, Pratt, Henry E, Epstein, Charles B, Shoresh, Noam, Adrian, Jessika, Kawli, Trupti, Davis, Carrie A, Dobin, Alexander, Kaul, Rajinder, Halow, Jessica, Van Nostrand, Eric L, Freese, Peter, Gorkin, David U, Shen, Yin, He, Yupeng, Mackiewicz, Mark, Pauli-Behn, Florencia, Williams, Brian A, Mortazavi, Ali, Keller, Cheryl A, Zhang, Xiao-Ou, Elhajjajy, Shaimae I, Huey, Jack, Dickel, Diane E, Snetkova, Valentina, Wei, Xintao, Wang, Xiaofeng, Rivera-Mulia, Juan Carlos, Rozowsky, Joel, Zhang, Jing, Chhetri, Surya B, Zhang, Jialing, Victorsen, Alec, White, Kevin P, Visel, Axel, Yeo, Gene W, Burge, Christopher B, Lécuyer, Eric, Gilbert, David M, Dekker, Job, Rinn, John, Mendenhall, Eric M, Ecker, Joseph R, Kellis, Manolis, Klein, Robert J, Noble, William S, Kundaje, Anshul, Guigó, Roderic, Farnham, Peggy J, Cherry, J Michael, Myers, Richard M, Ren, Bing, Graveley, Brenton R, Gerstein, Mark B, Pennacchio, Len A, Snyder, Michael P, Bernstein, Bradley E, Wold, Barbara, Hardison, Ross C, Gingeras, Thomas R, Stamatoyannopoulos, John A, and Weng, Zhiping
- Subjects
ENCODE Project Consortium ,General Science & Technology - Abstract
In the version of this article initially published, two members of the ENCODE Project Consortium were missing from the author list. Rizi Ai (Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA, USA) and Shantao Li (Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA) are now included in the author list. These errors have been corrected in the online version of the article.
- Published
- 2022
32. Transcriptional determinism and stochasticity contribute to the complexity of autism-associated SHANK family genes
- Author
-
Lu, Xiaona, Ni, Pengyu, Suarez-Meade, Paola, Ma, Yu, Forrest, Emily Niemitz, Wang, Guilin, Wang, Yi, Quiñones-Hinojosa, Alfredo, Gerstein, Mark, and Jiang, Yong-hui
- Published
- 2024
- Full Text
- View/download PDF
33. Predicting A/B compartments from histone modifications using deep learning
- Author
-
Zheng, Suchen, Thakkar, Nitya, Harris, Hannah L., Liu, Susanna, Zhang, Megan, Gerstein, Mark, Aiden, Erez Lieberman, Rowley, M. Jordan, Noble, William Stafford, Gürsoy, Gamze, and Singh, Ritambhara
- Published
- 2024
- Full Text
- View/download PDF
34. Privacy-preserving cancer type prediction with homomorphic encryption
- Author
-
Sarkar, Esha, Chielle, Eduardo, Gursoy, Gamze, Chen, Leo, Gerstein, Mark, and Maniatakos, Michail
- Published
- 2023
- Full Text
- View/download PDF
35. Genetic determination of regional connectivity in modelling the spread of COVID-19 outbreak for more efficient mitigation strategies
- Author
-
Salichos, Leonidas, Warrell, Jonathan, Cevasco, Hannah, Chung, Alvin, and Gerstein, Mark
- Published
- 2023
- Full Text
- View/download PDF
36. Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation
- Author
-
Chen, Zhanlin, Goldwasser, Jeremy, Tuckman, Philip, Liu, Jason, Zhang, Jing, and Gerstein, Mark
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis., Comment: 30 pages, 6 figures
- Published
- 2021
- Full Text
- View/download PDF
37. Security Vulnerabilities and Countermeasures for the Biomedical Data Life Cycle
- Author
-
Ni, Eric, Gürsoy, Gamze, Gerstein, Mark, and Greenbaum, Dov, editor
- Published
- 2023
- Full Text
- View/download PDF
38. DECODE: a Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays
- Author
-
Chen, Zhanlin, Zhang, Jing, Liu, Jason, Dai, Yi, Lee, Donghoon, Min, Martin Renqiang, Xu, Min, and Gerstein, Mark
- Subjects
Information and Computing Sciences ,Biological Sciences ,Machine Learning ,Genetics ,Human Genome ,Animals ,Deep Learning ,Enhancer Elements ,Genetic ,Genome-Wide Association Study ,Mice ,Neural Networks ,Computer ,Software ,Mathematical Sciences ,Bioinformatics ,Biological sciences ,Information and computing sciences ,Mathematical sciences - Abstract
MotivationMapping distal regulatory elements, such as enhancers, is a cornerstone for elucidating how genetic variations may influence diseases. Previous enhancer-prediction methods have used either unsupervised approaches or supervised methods with limited training data. Moreover, past approaches have implemented enhancer discovery as a binary classification problem without accurate boundary detection, producing low-resolution annotations with superfluous regions and reducing the statistical power for downstream analyses (e.g. causal variant mapping and functional validations). Here, we addressed these challenges via a two-step model called Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays (DECODE). First, we employed direct enhancer-activity readouts from novel functional characterization assays, such as STARR-seq, to train a deep neural network for accurate cell-type-specific enhancer prediction. Second, to improve the annotation resolution, we implemented a weakly supervised object detection framework for enhancer localization with precise boundary detection (to a 10 bp resolution) using Gradient-weighted Class Activation Mapping.ResultsOur DECODE binary classifier outperformed a state-of-the-art enhancer prediction method by 24% in transgenic mouse validation. Furthermore, the object detection framework can condense enhancer annotations to only 13% of their original size, and these compact annotations have significantly higher conservation scores and genome-wide association study variant enrichments than the original predictions. Overall, DECODE is an effective tool for enhancer classification and precise localization.Availability and implementationDECODE source code and pre-processing scripts are available at decode.gersteinlab.org.Supplementary informationSupplementary data are available at Bioinformatics online.
- Published
- 2021
39. SCAN-ATAC-Sim: a scalable and efficient method for simulating single-cell ATAC-seq data from bulk-tissue experiments.
- Author
-
Chen, Zhanlin, Zhang, Jing, Liu, Jason, Zhang, Zixuan, Zhu, Jiangqi, Lee, Donghoon, Xu, Min, and Gerstein, Mark
- Abstract
SUMMARY: scATAC-seq is a powerful approach for characterizing cell-type-specific regulatory landscapes. However, it is difficult to benchmark the performance of various scATAC-seq analysis techniques (such as clustering and deconvolution) without having a priori a known set of gold-standard cell types. To simulate scATAC-seq experiments with known cell-type labels, we introduce an efficient and scalable scATAC-seq simulation method (SCAN-ATAC-Sim) that down-samples bulk ATAC-seq data (e.g. from representative cell lines or tissues). Our protocol uses a consistent but tunable signal-to-noise ratio across cell types in a scATAC-seq simulation for integrating bulk experiments with different levels of background noise, and it independently samples twice without replacement to account for the diploid genome. Because it uses an efficient weighted reservoir sampling algorithm and is highly parallelizable with OpenMP, our implementation in C++ allows millions of cells to be simulated in less than an hour on a laptop computer. AVAILABILITY AND IMPLEMENTATION: SCAN-ATAC-Sim is available at scan-atac-sim.gersteinlab.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
- Published
- 2021
40. Quantum Computing at the Frontiers of Biological Sciences
- Author
-
Emani, Prashant S., Warrell, Jonathan, Anticevic, Alan, Bekiranov, Stefan, Gandal, Michael, McConnell, Michael J., Sapiro, Guillermo, Aspuru-Guzik, Alán, Baker, Justin, Bastiani, Matteo, McClure, Patrick, Murray, John, Sotiropoulos, Stamatios N, Taylor, Jacob, Senthil, Geetha, Lehner, Thomas, Gerstein, Mark B., and Harrow, Aram W.
- Subjects
Quantum Physics ,Quantitative Biology - Genomics ,Quantitative Biology - Neurons and Cognition ,Quantitative Biology - Quantitative Methods - Abstract
The search for meaningful structure in biological data has relied on cutting-edge advances in computational technology and data science methods. However, challenges arise as we push the limits of scale and complexity in biological problems. Innovation in massively parallel, classical computing hardware and algorithms continues to address many of these challenges, but there is a need to simultaneously consider new paradigms to circumvent current barriers to processing speed. Accordingly, we articulate a view towards quantum computation and quantum information science, where algorithms have demonstrated potential polynomial and exponential computational speedups in certain applications, such as machine learning. The maturation of the field of quantum computing, in hardware and algorithm development, also coincides with the growth of several collaborative efforts to address questions across length and time scales, and scientific disciplines. We use this coincidence to explore the potential for quantum computing to aid in one such endeavor: the merging of insights from genetics, genomics, neuroimaging and behavioral phenotyping. By examining joint opportunities for computational innovation across fields, we highlight the need for a common language between biological data analysis and quantum computing. Ultimately, we consider current and future prospects for the employment of quantum computing algorithms in the biological sciences., Comment: 22 pages, 3 figures, Perspective
- Published
- 2019
- Full Text
- View/download PDF
41. Recurrent repeat expansions in human cancer genomes
- Author
-
Erwin, Graham S., Gürsoy, Gamze, Al-Abri, Rashid, Suriyaprakash, Ashwini, Dolzhenko, Egor, Zhu, Kevin, Hoerner, Christian R., White, Shannon M., Ramirez, Lucia, Vadlakonda, Ananya, Vadlakonda, Alekhya, von Kraut, Konor, Park, Julia, Brannon, Charlotte M., Sumano, Daniel A., Kirtikar, Raushun A., Erwin, Alicia A., Metzner, Thomas J., Yuen, Ryan K. C., Fan, Alice C., Leppert, John T., Eberle, Michael A., Gerstein, Mark, and Snyder, Michael P.
- Published
- 2023
- Full Text
- View/download PDF
42. DiNeR: a Differential graphical model for analysis of co-regulation Network Rewiring
- Author
-
Zhang, Jing, Liu, Jason, Lee, Donghoon, Lou, Shaoke, Chen, Zhanlin, Gürsoy, Gamze, and Gerstein, Mark
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Genetics ,Cancer ,Human Genome ,Hematology ,Networking and Information Technology R&D (NITRD) ,Aetiology ,2.1 Biological and endogenous factors ,Generic health relevance ,Chromatin Immunoprecipitation ,Gene Expression Regulation ,Gene Regulatory Networks ,Genome ,Humans ,K562 Cells ,Leukemia ,Myelogenous ,Chronic ,BCR-ABL Positive ,Models ,Genetic ,Protein Binding ,Software ,Transcription Factors ,Transcription ,Genetic ,Transcription factor co-regulation network ,ENCODE ,TF dysregulation ,Network changes ,Mathematical Sciences ,Information and Computing Sciences ,Bioinformatics ,Biological sciences ,Information and computing sciences ,Mathematical sciences - Abstract
BACKGROUND:During transcription, numerous transcription factors (TFs) bind to targets in a highly coordinated manner to control the gene expression. Alterations in groups of TF-binding profiles (i.e. "co-binding changes") can affect the co-regulating associations between TFs (i.e. "rewiring the co-regulator network"). This, in turn, can potentially drive downstream expression changes, phenotypic variation, and even disease. However, quantification of co-regulatory network rewiring has not been comprehensively studied. RESULTS:To address this, we propose DiNeR, a computational method to directly construct a differential TF co-regulation network from paired disease-to-normal ChIP-seq data. Specifically, DiNeR uses a graphical model to capture the gained and lost edges in the co-regulation network. Then, it adopts a stability-based, sparsity-tuning criterion -- by sub-sampling the complete binding profiles to remove spurious edges -- to report only significant co-regulation alterations. Finally, DiNeR highlights hubs in the resultant differential network as key TFs associated with disease. We assembled genome-wide binding profiles of 104 TFs in the K562 and GM12878 cell lines, which loosely model the transition between normal and cancerous states in chronic myeloid leukemia (CML). In total, we identified 351 significantly altered TF co-regulation pairs. In particular, we found that the co-binding of the tumor suppressor BRCA1 and RNA polymerase II, a well-known transcriptional pair in healthy cells, was disrupted in tumors. Thus, DiNeR successfully extracted hub regulators and discovered well-known risk genes. CONCLUSIONS:Our method DiNeR makes it possible to quantify changes in co-regulatory networks and identify alterations to TF co-binding patterns, highlighting key disease regulators. Our method DiNeR makes it possible to quantify changes in co-regulatory networks and identify alterations to TF co-binding patterns, highlighting key disease regulators.
- Published
- 2020
43. RADAR: annotation and prioritization of variants in the post-transcriptional regulome of RNA-binding proteins
- Author
-
Zhang, Jing, Liu, Jason, Lee, Donghoon, Feng, Jo-Jo, Lochovsky, Lucas, Lou, Shaoke, Rutenberg-Schoenberg, Michael, and Gerstein, Mark
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Genetics ,Human Genome ,1.1 Normal biological development and functioning ,Underpinning research ,Breast Neoplasms ,Genomics ,Humans ,RNA Processing ,Post-Transcriptional ,RNA-Binding Proteins ,Software ,RNA-binding protein ,Post-transcriptional regulation ,Variant prioritization ,Variant functional impact ,Environmental Sciences ,Information and Computing Sciences ,Bioinformatics - Abstract
RNA-binding proteins (RBPs) play key roles in post-transcriptional regulation and disease. Their binding sites cover more of the genome than coding exons; nevertheless, most noncoding variant prioritization methods only focus on transcriptional regulation. Here, we integrate the portfolio of ENCODE-RBP experiments to develop RADAR, a variant-scoring framework. RADAR uses conservation, RNA structure, network centrality, and motifs to provide an overall impact score. Then, it further incorporates tissue-specific inputs to highlight disease-specific variants. Our results demonstrate RADAR can successfully pinpoint variants, both somatic and germline, associated with RBP-function dysregulation, which cannot be found by most current prioritization methods, for example, variants affecting splicing.
- Published
- 2020
44. Author Correction: Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples.
- Author
-
Bailey, Matthew H, Meyerson, William U, Dursi, Lewis Jonathan, Wang, Liang-Bo, Dong, Guanlan, Liang, Wen-Wei, Weerasinghe, Amila, Li, Shantao, Li, Yize, Kelso, Sean, MC3 Working Group, PCAWG novel somatic mutation calling methods working group, Saksena, Gordon, Ellrott, Kyle, Wendl, Michael C, Wheeler, David A, Getz, Gad, Simpson, Jared T, Gerstein, Mark B, Ding, Li, and PCAWG Consortium
- Subjects
MC3 Working Group ,PCAWG novel somatic mutation calling methods working group ,PCAWG Consortium - Abstract
Correction to this paper has been published: https://doi.org/10.1038/s41467-020-20128-w.
- Published
- 2020
45. The association between evening social media use and delayed sleep may be causal: Suggestive evidence from 120 million Reddit timestamps
- Author
-
Meyerson, William U., Fineberg, Sarah K., Andrade, Fernanda C., Corlett, Philip, Gerstein, Mark B., and Hoyle, Rick H.
- Published
- 2023
- Full Text
- View/download PDF
46. Minor intron splicing is critical for survival of lethal prostate cancer
- Author
-
Augspach, Anke, Drake, Kyle D., Roma, Luca, Qian, Ellen, Lee, Se Ri, Clarke, Declan, Kumar, Sushant, Jaquet, Muriel, Gallon, John, Bolis, Marco, Triscott, Joanna, Galván, José A., Chen, Yu, Thalmann, George N., Kruithof-de Julio, Marianna, Theurillat, Jean-Philippe P., Wuchty, Stefan, Gerstein, Mark, Piscuoglio, Salvatore, Kanadia, Rahul N., and Rubin, Mark A.
- Published
- 2023
- Full Text
- View/download PDF
47. Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples.
- Author
-
Bailey, Matthew H, Meyerson, William U, Dursi, Lewis Jonathan, Wang, Liang-Bo, Dong, Guanlan, Liang, Wen-Wei, Weerasinghe, Amila, Li, Shantao, Li, Yize, Kelso, Sean, MC3 Working Group, PCAWG novel somatic mutation calling methods working group, Saksena, Gordon, Ellrott, Kyle, Wendl, Michael C, Wheeler, David A, Getz, Gad, Simpson, Jared T, Gerstein, Mark B, Ding, Li, and PCAWG Consortium
- Subjects
MC3 Working Group ,PCAWG novel somatic mutation calling methods working group ,PCAWG Consortium ,Humans ,Neoplasms ,DNA ,Intergenic ,Retrospective Studies ,Base Composition ,Mutation ,Genome ,Human ,Exons ,Databases ,Genetic ,Exome ,Whole Genome Sequencing ,Whole Exome Sequencing ,Cancer ,Biotechnology ,Genetics ,Human Genome - Abstract
The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF
- Published
- 2020
48. Supervised enhancer prediction with epigenetic pattern recognition and targeted validation
- Author
-
Sethi, Anurag, Gu, Mengting, Gumusgoz, Emrah, Chan, Landon, Yan, Koon-Kiu, Rozowsky, Joel, Barozzi, Iros, Afzal, Veena, Akiyama, Jennifer A, Plajzer-Frick, Ingrid, Yan, Chengfei, Novak, Catherine S, Kato, Momoe, Garvin, Tyler H, Pham, Quan, Harrington, Anne, Mannion, Brandon J, Lee, Elizabeth A, Fukuda-Yuzawa, Yoko, Visel, Axel, Dickel, Diane E, Yip, Kevin Y, Sutton, Richard, Pennacchio, Len A, and Gerstein, Mark
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Machine Learning and Artificial Intelligence ,Human Genome ,Networking and Information Technology R&D (NITRD) ,Genetics ,1.1 Normal biological development and functioning ,Animals ,Cell Line ,Drosophila ,Epigenesis ,Genetic ,Histones ,Humans ,Mice ,Mice ,Transgenic ,Pattern Recognition ,Automated ,Reproducibility of Results ,Technology ,Medical and Health Sciences ,Developmental Biology ,Biological sciences - Abstract
Enhancers are important non-coding elements, but they have traditionally been hard to characterize experimentally. The development of massively parallel assays allows the characterization of large numbers of enhancers for the first time. Here, we developed a framework using Drosophila STARR-seq to create shape-matching filters based on meta-profiles of epigenetic features. We integrated these features with supervised machine-learning algorithms to predict enhancers. We further demonstrated that our model could be transferred to predict enhancers in mammals. We comprehensively validated the predictions using a combination of in vivo and in vitro approaches, involving transgenic assays in mice and transduction-based reporter assays in human cell lines (153 enhancers in total). The results confirmed that our model can accurately predict enhancers in different species without re-parameterization. Finally, we examined the transcription factor binding patterns at predicted enhancers versus promoters. We demonstrated that these patterns enable the construction of a secondary model that effectively distinguishes enhancers and promoters.
- Published
- 2020
49. Perspectives on ENCODE
- Author
-
Snyder, Michael P, Gingeras, Thomas R, Moore, Jill E, Weng, Zhiping, Gerstein, Mark B, Ren, Bing, Hardison, Ross C, Stamatoyannopoulos, John A, Graveley, Brenton R, Feingold, Elise A, Pazin, Michael J, Pagan, Michael, Gilchrist, Daniel A, Hitz, Benjamin C, Cherry, J Michael, Bernstein, Bradley E, Mendenhall, Eric M, Zerbino, Daniel R, Frankish, Adam, Flicek, Paul, and Myers, Richard M
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Genetics ,Biotechnology ,Human Genome ,1.1 Normal biological development and functioning ,Animals ,Binding Sites ,Chromatin ,DNA Methylation ,Databases ,Genetic ,Gene Expression Regulation ,Genome ,Genome ,Human ,Genomics ,Histones ,Humans ,Mice ,Molecular Sequence Annotation ,Quality Control ,Regulatory Sequences ,Nucleic Acid ,Transcription Factors ,ENCODE Project Consortium ,General Science & Technology - Abstract
The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.
- Published
- 2020
50. Expanded encyclopaedias of DNA elements in the human and mouse genomes
- Author
-
Moore, Jill E, Purcaro, Michael J, Pratt, Henry E, Epstein, Charles B, Shoresh, Noam, Adrian, Jessika, Kawli, Trupti, Davis, Carrie A, Dobin, Alexander, Kaul, Rajinder, Halow, Jessica, Van Nostrand, Eric L, Freese, Peter, Gorkin, David U, Shen, Yin, He, Yupeng, Mackiewicz, Mark, Pauli-Behn, Florencia, Williams, Brian A, Mortazavi, Ali, Keller, Cheryl A, Zhang, Xiao-Ou, Elhajjajy, Shaimae I, Huey, Jack, Dickel, Diane E, Snetkova, Valentina, Wei, Xintao, Wang, Xiaofeng, Rivera-Mulia, Juan Carlos, Rozowsky, Joel, Zhang, Jing, Chhetri, Surya B, Zhang, Jialing, Victorsen, Alec, White, Kevin P, Visel, Axel, Yeo, Gene W, Burge, Christopher B, Lécuyer, Eric, Gilbert, David M, Dekker, Job, Rinn, John, Mendenhall, Eric M, Ecker, Joseph R, Kellis, Manolis, Klein, Robert J, Noble, William S, Kundaje, Anshul, Guigó, Roderic, Farnham, Peggy J, Cherry, J Michael, Myers, Richard M, Ren, Bing, Graveley, Brenton R, Gerstein, Mark B, Pennacchio, Len A, Snyder, Michael P, Bernstein, Bradley E, Wold, Barbara, Hardison, Ross C, Gingeras, Thomas R, Stamatoyannopoulos, John A, and Weng, Zhiping
- Subjects
Biological Sciences ,Bioinformatics and Computational Biology ,Genetics ,Human Genome ,1.1 Normal biological development and functioning ,Animals ,Chromatin ,DNA ,DNA Footprinting ,DNA Methylation ,DNA Replication Timing ,Databases ,Genetic ,Deoxyribonuclease I ,Genome ,Genome ,Human ,Genomics ,Histones ,Humans ,Mice ,Mice ,Transgenic ,Molecular Sequence Annotation ,RNA-Binding Proteins ,Registries ,Regulatory Sequences ,Nucleic Acid ,Transcription ,Genetic ,Transposases ,ENCODE Project Consortium ,General Science & Technology - Abstract
The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.
- Published
- 2020
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.