Lu, Kezhi, Yang, Kuo, Sun, Hailong, Zhang, Qian, Zheng, Qiguang, Xu, Kuan, Chen, Jianxin, and Zhou, Xuezhong
Phenotypes (i.e., symptoms and clinical signs) are essential for clinical diagnosis and research related to symptom science and precision health. As clinical observational manifestations of a disease, symptoms are clinically significant because they act as direct causes for patients to seek medical care and the primary indicators for clinicians to provide diagnosis/treatments. However, a comprehensive phenotypic knowledge base and high-quality symptom–gene associations are lacking. Therefore, a thorough understanding of the relationships between symptoms and other entities is urgently needed to support scientific research and clinical health care. In this paper, we constructed a systematic, large-scale, and high-quality symp tom- g ene a ssociations n etwork system named SympGAN (accessible at http://www.sympgan.org/). We provide access to the database with millions of associations between symptoms, genes, diseases, and drugs, as well as the system for users to search, analyze, knowledge inference, and present data visualization. We utilize state-of-the-art machine learning and deep learning algorithms as the backbone to form the final dataset. In addition, we utilize the RoBERTa-PubMed neural network for name entity recognition to assist in data screening. The knowledge graph is adopted to organize the relationships between different entities. We adopt ConvE, TuckER, and HypER methods for knowledge completion experiments to validate the quality of final knowledge graph triples. Based on the results, we provide online automatic knowledge inference interfaces. The system, SympGAN, has promising value for disease diagnosis, decision support in health care, precision health, and scientific research, as researchers and practitioners can easily access information about symptoms, diseases, targets, gene ontology, and drugs. [Display omitted] SympGAN is a comprehensive framework designed for the integration of symptom phenotypes, utilizing neural network embeddings and deep information extraction models. We have developed an integrative framework that establishes connections between symptoms and genes. This framework encompasses relationship inference through deep network embedding, literature mining using named entity recognition methods, and manual curation. Consequently, we have created a robust database and knowledge graph containing millions of associations between symptoms, genes, diseases, and drugs. SympGAN is readily accessible at http://www.sympgan.org/ , providing users with the ability to search, analyze, perform knowledge inference, and visualize information pertaining to these terminologies. The construction of SympGAN has successfully filled knowledge gaps and established millions of high-quality associations between symptoms, genotypes, diseases, and drugs. It holds tremendous potential for advancing precision health and symptom science. • We have developed SympGAN, a comprehensive, high-quality, and extensive knowledge graph-based system that encompasses the most comprehensive terminology set of 12,560 symptom phenotypes and their associations with genes, diseases, and drugs. • SympGAN has made a significant breakthrough by acquiring a comprehensive dataset for the knowledge graph, comprising 401,126 symptom–gene triples. These triples, along with their accompanying data, have undergone meticulous collection procedures to ensure exceptional quality. Our methodology involves employing the RoBERTa-PubMed model for named entity recognition (NER) and conducting literature mining from biomedical studies to gather pertinent information. Furthermore, we utilize sophisticated, high-precision algorithms to infer phenotypic associations with genes. • The website http://www.sympgan.org/ offers a comprehensive platform that enables users to conduct integrative searches and perform online knowledge inference and analysis. It serves as a centralized hub for exploring clinical knowledge associated with symptoms, as well as related diseases, genes, drugs, and molecular networks. This robust resource facilitates the interpretation and exploration of symptom phenotypes, particularly in understanding their genetic origins. By promoting precision health research and advancing the field of symptom science, http://www.sympgan.org/ significantly contributes to enhancing our understanding of symptoms and the underlying genetic factors involved. [ABSTRACT FROM AUTHOR]