Author: "Dominik Heider" - Searchworks@Jio Institute Digital Library Search Results

1. Large language models and their applications in bioinformatics

Author: Oluwafemi A. Sarumi and Dominik Heider
Subjects: Bioinformatics, Large language models, Natural language processing, Omics data, Biotechnology, TP248.13-248.65
Abstract: Recent advancements in Natural Language Processing (NLP) have been significantly driven by the development of Large Language Models (LLMs), representing a substantial leap in language-based technology capabilities. These models, built on sophisticated deep learning architectures, typically transformers, are characterized by billions of parameters and extensive training data, enabling them to achieve high accuracy across various tasks. The transformer architecture of LLMs allows them to effectively handle context and sequential information, which is crucial for understanding and generating human language. Beyond traditional NLP applications, LLMs have shown significant promise in bioinformatics, transforming the field by addressing challenges associated with large and complex biological datasets. In genomics, proteomics, and personalized medicine, LLMs facilitate identifying patterns, predicting protein structures, or understanding genetic variations. This capability is crucial, e.g., for advancing drug discovery, where accurate prediction of molecular interactions is essential. This review discusses the current trends in LLMs research and their potential to revolutionize the field of bioinformatics and accelerate novel discoveries in the life sciences.
Published: 2024
Full Text: View/download PDF

2. Privacy-preserving decentralized learning methods for biomedical applications

Author: Mohammad Tajabadi, Roman Martin, and Dominik Heider
Subjects: Federated learning, Split learning, Swarm learning, Gossip learning, Edge learning, Biotechnology, TP248.13-248.65
Abstract: In recent years, decentralized machine learning has emerged as a significant advancement in biomedical applications, offering robust solutions for data privacy, security, and collaboration across diverse healthcare environments. In this review, we examine various decentralized learning methodologies, including federated learning, split learning, swarm learning, gossip learning, edge learning, and some of their applications in the biomedical field. We delve into the underlying principles, network topologies, and communication strategies of each approach, highlighting their advantages and limitations. Ultimately, the selection of a suitable method should be based on specific needs, infrastructures, and computational capabilities.
Published: 2024
Full Text: View/download PDF

3. NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search

Author: Oluwafemi A. Sarumi, Maximilian Hahn, and Dominik Heider
Subjects: DNA similarity, Neural embeddings, Artificial intelligence, Biotechnology, TP248.13-248.65
Abstract: The availability of high throughput sequencing tools coupled with the declining costs in the production of DNA sequences has led to the generation of enormous amounts of omics data curated in several databases such as NCBI and EMBL. Identification of similar DNA sequences from these databases is one of the fundamental tasks in bioinformatics. It is essential for discovering homologous sequences in organisms, phylogenetic studies of evolutionary relationships among several biological entities, or detection of pathogens. Improving DNA similarity search is of outmost importance because of the increased complexity of the evergrowing repositories of sequences. Therefore, instead of using the conventional approach of comparing raw sequences, e.g., in fasta format, a numerical representation of the sequences can be used to calculate their similarities and optimize the search process. In this study, we analyzed different approaches for numerical embeddings, including Chaos Game Representation, hashing, and neural networks, and compared them with classical approaches such as principal component analysis. It turned out that neural networks generate embeddings that are able to capture the similarity between DNA sequences as a distance measure and outperform the other approaches on DNA similarity search, significantly.
Published: 2024
Full Text: View/download PDF

4. Turbo autoencoders for the DNA data storage channel with Autoturbo-DNA

Author: Marius Welzel, Hagen Dreßler, and Dominik Heider
Subjects: Biotechnology, Devices, Science
Abstract: Summary: DNA, with its high storage density and long-term stability, is a potential candidate for a next-generation storage device. The DNA data storage channel, composed of synthesis, amplification, storage, and sequencing, exhibits error probabilities and error profiles specific to the components of the channel. Here, we present Autoturbo-DNA, a PyTorch framework for training error-correcting, overcomplete autoencoders specifically tailored for the DNA data storage channel. It allows training different architecture combinations and using a wide variety of channel component models for noise generation during training. It further supports training the encoder to generate DNA sequences that adhere to user-defined constraints. Autoturbo-DNA exhibits error-correction capabilities close to non-neural-network state-of-the-art error correction and constrained codes for DNA data storage. Our results indicate that neural-network-based codes can be a viable alternative to traditionally designed codes for the DNA data storage channel.
Published: 2024
Full Text: View/download PDF

5. Natrix2 – Improved amplicon workflow with novel Oxford Nanopore Technologies support and enhancements in clustering, classification and taxonomic databases

Author: Aman Deep, Dana Bludau, Marius Welzel, Sandra Clemens, Dominik Heider, Jens Boenigk, and Daniela Beisser
Subjects: Ecology, QH540-549.5
Abstract: Sequencing of amplified DNA is the first step towards the generation of Amplicon Sequence Variants (ASVs) or Operational Taxonomic Units (OTUs) for biodiversity assessment and comparative analyses of environmental communities and microbiomes. Notably, the rapid advancements in sequencing technologies have paved the way for the growing utilization of third-generation long-read approaches in recent years. These sequence data imply increasing read lengths, higher error rates, and altered sequencing chemistry. Likewise, methods for amplicon classification and reference databases have progressed, leading to the expansion of taxonomic application areas and higher classification accuracy. With Natrix, a user-friendly and reducible workflow solution, processing of prokaryotic and eukaryotic environmental Illumina sequences using 16S or 18S is possible. Here, we present an updated version of the pipeline, Natrix2, which incorporates VSEARCH as an alternative clustering method with better performance for 16S metabarcoding approaches and mothur for taxonomic classification on further databases, including PR2, UNITE and SILVA. Additionally, Natrix2 includes the handling of Nanopore reads, which entails initial error correction and refinement of reads using Medaka and Racon to subsequently determine their taxonomic classification.
Published: 2023
Full Text: View/download PDF

6. Editorial: Artificial intelligence and bioinformatics applications for omics and multi-omics studies

Author: Angelo Facchiano, Dominik Heider, and Margherita Mutarelli
Subjects: artificial intelligence, multi-omics, systems biology, biomedical data science, machine learning, Genetics, QH426-470
Published: 2024
Full Text: View/download PDF

7. Unsupervised encoding selection through ensemble pruning for biomedical classification

Author: Sebastian Spänig, Alexander Michel, and Dominik Heider
Subjects: Biomedical classification, Antimicrobial peptides, Encodings, Machine learning, Ensemble learning, Computer applications to medicine. Medical informatics, R858-859.7, Analysis, QA299.6-433
Abstract: Abstract Background Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptide’s function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking. Results We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets. Conclusion The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.
Published: 2023
Full Text: View/download PDF

8. DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage

Author: Marius Welzel, Peter Michael Schwarz, Hannah F. Löchel, Tolganay Kabdullayeva, Sandra Clemens, Anke Becker, Bernd Freisleben, and Dominik Heider
Subjects: Science
Abstract: The extensive information capacity of DNA makes it an attractive alternative to traditional data storage. DNA-Aeon is a DNA data storage solution that can correct all error types commonly observed in DNA storage, while encoding data into sequences that meet user-defined constraints such as GC content, homopolymer length, and no undesired motifs.
Published: 2023
Full Text: View/download PDF

9. Why loss of Y? A pan-cancer genome analysis of tumors with loss of Y chromosome

Author: Philipp Müller, Oscar Velazquez Camacho, Ali M. Yazbeck, Christina Wölwer, Weiwei Zhai, Johannes Schumacher, Dominik Heider, Reinhard Buettner, Alexander Quaas, and Axel M. Hillmer
Subjects: Loss of Y, Y chromosome, Aneuploidy, TCGA, Pan-Cancer, Biotechnology, TP248.13-248.65
Abstract: Loss of the Y chromosome (LoY) is frequently observed in somatic cells of elderly men. However, LoY is highly increased in tumor tissue and correlates with an overall worse prognosis. The underlying causes and downstream effects of LoY are widely unknown. Therefore, we analyzed genomic and transcriptomic data of 13 cancer types (2375 patients) and classified tumors of male patients according to loss or retain of the Y chromosome (LoY or RoY, average LoY fraction: 0.46). The frequencies of LoY ranged from almost absence (glioblastoma, glioma, thyroid carcinoma) to 77% (kidney renal papillary cell carcinoma). Genomic instability, aneuploidy, and mutation burden were enriched in LoY tumors. In addition, we found more frequently in LoY tumors the gate keeping tumor suppressor gene TP53 mutated in three cancer types (colon adenocarcinoma, head and neck squamous carcinoma, lung adenocarcinoma) and oncogenes MET, CDK6, KRAS, and EGFR amplified in multiple cancer types. On the transcriptomic level, we observed MMP13, known to be involved in invasion, to be up-regulated in LoY of three adenocarcinomas and down-regulation of the tumor suppressor gene GPC5 in LoY of three cancer types. Furthermore, we found enrichment of a smoking-related mutation signature in LoY tumors of head and neck and lung cancer. Strikingly, we observed a correlation between cancer type-specific sex bias in incidence rates and frequencies of LoY, in line with the hypothesis that LoY increases cancer risk in males. Overall, LoY is a frequent phenomenon in cancer that is enriched in genomically unstable tumors. It correlates with genomic features beyond the Y chromosome and might contribute to higher incidence rates in males.
Published: 2023
Full Text: View/download PDF

10. DNAsmart: Multiple attribute ranking tool for DNA data storage systems

Author: Chisom Ezekannagha, Marius Welzel, Dominik Heider, and Georges Hattab
Subjects: Data storage, DNA, Medium, Attribute, Ranking, Visual analytic, Biotechnology, TP248.13-248.65
Abstract: In an ever-growing need for data storage capacity, the Deoxyribonucleic Acid (DNA) molecule gains traction as a new storage medium with a larger capacity, higher density, and a longer lifespan over conventional storage media. To effectively use DNA for data storage, it is important to understand the different methods of encoding information in DNA and compare their effectiveness. This requires evaluating which decoded DNA sequences carry the most encoded information based on various attributes. However, navigating the field of coding theory requires years of experience and domain expertise. For instance, domain experts rely on various mathematical functions and attributes to score and evaluate their encodings. To enable such analytical tasks, we provide an interactive and visual analytical framework for multi-attribute ranking in DNA storage systems. Our framework follows a three-step view with user-settable parameters. It enables users to find the optimal en-/de-coding approaches by setting different weights and combining multiple attributes. We assess the validity of our work through a task-specific user study on domain experts by relying on three tasks. Results indicate that all participants completed their tasks successfully under two minutes, then rated the framework for design choices, perceived usefulness, and intuitiveness. In addition, two real-world use cases are shared and analyzed as direct applications of the proposed tool. DNAsmart enables the ranking of decoded sequences based on multiple attributes. In sum, this work unveils the evaluation of en-/de-coding approaches accessible and tractable through visualization and interactivity to solve comparison and ranking tasks.
Published: 2023
Full Text: View/download PDF

11. Multivalent binding kinetics resolved by fluorescence proximity sensing

Author: Clemens Schulte, Alice Soldà, Sebastian Spänig, Nathan Adams, Ivana Bekić, Werner Streicher, Dominik Heider, Ralf Strasser, and Hans Michael Maric
Subjects: Biology (General), QH301-705.5
Abstract: Fluorescence proximity sensing enables high throughput determination of binding affinities and kinetics of peptide inhibitors with varying valency and multivalent architecture.
Published: 2022
Full Text: View/download PDF

12. Sharing Data With Shared Benefits: Artificial Intelligence Perspective

Author: Mohammad Tajabadi, Linus Grabenhenrich, Adèle Ribeiro, Michael Leyer, and Dominik Heider
Subjects: Computer applications to medicine. Medical informatics, R858-859.7, Public aspects of medicine, RA1-1270
Abstract: Artificial intelligence (AI) and data sharing go hand in hand. In order to develop powerful AI models for medical and health applications, data need to be collected and brought together over multiple centers. However, due to various reasons, including data privacy, not all data can be made publicly available or shared with other parties. Federated and swarm learning can help in these scenarios. However, in the private sector, such as between companies, the incentive is limited, as the resulting AI models would be available for all partners irrespective of their individual contribution, including the amount of data provided by each party. Here, we explore a potential solution to this challenge as a viewpoint, aiming to establish a fairer approach that encourages companies to engage in collaborative data analysis and AI modeling. Within the proposed approach, each individual participant could gain a model commensurate with their respective data contribution, ultimately leading to better diagnostic tools for all participants in a fair manner.
Published: 2023
Full Text: View/download PDF

13. Corrigendum to 'Dissecting the genetic heterogeneity of gastric cancer'

Author: Timo Hess, Carlo Maj, Jan Gehlen, Oleg Borisov, Stephan L. Haas, Ines Gockel, Michael Vieth, Guillaume Piessen, Hakan Alakus, Yogesh Vashist, Carina Pereira, Michael Knapp, Vitalia Schüller, Alexander Quaas, Heike I. Grabsch, Jessica Trautmann, Ewa Malecka-Wojciesko, Anna Mokrowiecka, Jan Speller, Andreas Mayr, Julia Schröder, Axel M. Hillmer, Dominik Heider, Florian Lordick, Ángeles Pérez-Aísa, Rafael Campo, Jesús Espinel, Fernando Geijo, Concha Thomson, Luis Bujanda, Federico Sopeña, Ángel Lanas, María Pellisé, Claudia Pauligk, Thorsten Oliver Goetze, Carolin Zelck, Julian Reingruber, Emadeldin Hassanin, Peter Elbe, Sandra Alsabeah, Mats Lindblad, Magnus Nilsson, Nicole Kreuser, René Thieme, Francesca Tavano, Roberta Pastorino, Dario Arzani, Roberto Persiani, Jin-On Jung, Henrik Nienhüser, Katja Ott, Ralf R. Schumann, Oliver Kumpf, Susen Burock, Volker Arndt, Anna Jakubowska, Małgorzta Ławniczak, Victor Moreno, Vicente Martín, Manolis Kogevinas, Marina Pollán, Justyna Dąbrowska, Antonio Salas, Olivier Cussenot, Anne Boland-Auge, Delphine Daian, Jean-Francois Deleuze, Erika Salvi, Maris Teder-Laving, Gianluca Tomasello, Margherita Ratti, Chiara Senti, Valli De Re, Agostino Steffan, Arnulf H. Hölscher, Katharina Messerle, Christiane Josephine Bruns, Armands Sīviņš, Inga Bogdanova, Jurgita Skieceviciene, Justina Arstikyte, Markus Moehler, Hauke Lang, Peter P. Grimminger, Martin Kruschewski, Nikolaos Vassos, Claus Schildberg, Philipp Lingohr, Karsten Ridwelski, Hans Lippert, Nadine Fricker, Peter Krawitz, Per Hoffmann, Markus M. Nöthen, Lothar Veits, Jakob R. Izbicki, Adrianna Mostowska, Federico Martinón-Torres, Daniele Cusi, Rolf Adolfsson, Geraldine Cancel-Tassin, Aksana Höblinger, Ernst Rodermann, Monika Ludwig, Gisela Keller, Andres Metspalu, Hermann Brenner, Joerg Heller, Markus Neef, Michael Schepke, Franz Ludwig Dumoulin, Lutz Hamann, Renato Cannizzaro, Michele Ghidini, Dominik Plaßmann, Michael Geppert, Peter Malfertheiner, Olivier Glehen, Tomasz Skoczylas, Marek Majewski, Jan Lubiński, Orazio Palmieri, Stefania Boccia, Anna Latiano, Nuria Aragones, Thomas Schmidt, Mário Dinis-Ribeiro, Rui Medeiros, Salah-Eddin Al-Batran, Mārcis Leja, Juozas Kupcinskas, María A. García-González, Marino Venerito, and Johannes Schumacher
Subjects: Medicine, Medicine (General), R5-920
Published: 2023
Full Text: View/download PDF

14. The FeatureCloud Platform for Federated Learning in Biomedicine: Unified Approach

Author: Julian Matschinske, Julian Späth, Mohammad Bakhtiari, Niklas Probul, Mohammad Mahdi Kazemi Majdabadi, Reza Nasirigerdeh, Reihaneh Torkzadehmahani, Anne Hartebrodt, Balazs-Attila Orban, Sándor-József Fejér, Olga Zolotareva, Supratim Das, Linda Baumbach, Josch K Pauling, Olivera Tomašević, Béla Bihari, Marcus Bloice, Nina C Donner, Walid Fdhila, Tobias Frisch, Anne-Christin Hauschild, Dominik Heider, Andreas Holzinger, Walter Hötzendorfer, Jan Hospes, Tim Kacprowski, Markus Kastelitz, Markus List, Rudolf Mayer, Mónika Moga, Heimo Müller, Anastasia Pustozerova, Richard Röttger, Christina C Saak, Anna Saranti, Harald H H W Schmidt, Christof Tschohl, Nina K Wenke, and Jan Baumbach
Subjects: Computer applications to medicine. Medical informatics, R858-859.7, Public aspects of medicine, RA1-1270
Abstract: BackgroundMachine learning and artificial intelligence have shown promising results in many areas and are driven by the increasing amount of available data. However, these data are often distributed across different institutions and cannot be easily shared owing to strict privacy regulations. Federated learning (FL) allows the training of distributed machine learning models without sharing sensitive data. In addition, the implementation is time-consuming and requires advanced programming skills and complex technical infrastructures. ObjectiveVarious tools and frameworks have been developed to simplify the development of FL algorithms and provide the necessary technical infrastructure. Although there are many high-quality frameworks, most focus only on a single application case or method. To our knowledge, there are no generic frameworks, meaning that the existing solutions are restricted to a particular type of algorithm or application field. Furthermore, most of these frameworks provide an application programming interface that needs programming knowledge. There is no collection of ready-to-use FL algorithms that are extendable and allow users (eg, researchers) without programming knowledge to apply FL. A central FL platform for both FL algorithm developers and users does not exist. This study aimed to address this gap and make FL available to everyone by developing FeatureCloud, an all-in-one platform for FL in biomedicine and beyond. MethodsThe FeatureCloud platform consists of 3 main components: a global frontend, a global backend, and a local controller. Our platform uses a Docker to separate the local acting components of the platform from the sensitive data systems. We evaluated our platform using 4 different algorithms on 5 data sets for both accuracy and runtime. ResultsFeatureCloud removes the complexity of distributed systems for developers and end users by providing a comprehensive platform for executing multi-institutional FL analyses and implementing FL algorithms. Through its integrated artificial intelligence store, federated algorithms can easily be published and reused by the community. To secure sensitive raw data, FeatureCloud supports privacy-enhancing technologies to secure the shared local models and assures high standards in data privacy to comply with the strict General Data Protection Regulation. Our evaluation shows that applications developed in FeatureCloud can produce highly similar results compared with centralized approaches and scale well for an increasing number of participating sites. ConclusionsFeatureCloud provides a ready-to-use platform that integrates the development and execution of FL algorithms while reducing the complexity to a minimum and removing the hurdles of federated infrastructure. Thus, we believe that it has the potential to greatly increase the accessibility of privacy-preserving and distributed data analyses in biomedicine and beyond.
Published: 2023
Full Text: View/download PDF

15. Comparative analysis of full-length 16s ribosomal RNA genome sequencing in human fecal samples using primer sets with different degrees of degeneracy

Author: Christian Waechter, Leon Fehse, Marius Welzel, Dominik Heider, Lek Babalija, Juan Cheko, Julian Mueller, Jochen Pöling, Thomas Braun, Sabine Pankuweit, Eberhard Weihe, Ralf Kinscherf, Bernhard Schieffer, Ulrich Luesebrink, Muhidien Soufi, and Volker Ruppert
Subjects: 16S rRNA, gut microbiome, human fecal microbiome, next-generation sequencing (NGS), nanopore sequencing, oxford nanopore technologies (ONT), Genetics, QH426-470
Abstract: Next-generation sequencing has revolutionized the field of microbiology research and greatly expanded our knowledge of complex bacterial communities. Nanopore sequencing provides distinct advantages, combining cost-effectiveness, ease of use, high throughput, and high taxonomic resolution through its ability to process long amplicons, such as the entire 16s rRNA genome. We examine the performance of the conventional 27F primer (27F-I) included in the 16S Barcoding Kit distributed by Oxford Nanopore Technologies (ONT) and that of a more degenerate 27F primer (27F-II) in the context of highly complex bacterial communities in 73 human fecal samples. The results show striking differences in both taxonomic diversity and relative abundance of a substantial number of taxa between the two primer sets. Primer 27F-I reveals a significantly lower biodiversity and, for example, at the taxonomic level of the phyla, a dominance of Firmicutes and Proteobacteria as determined by relative abundances, as well as an unusually high ratio of Firmicutes/Bacteriodetes when compared to the more degenerate primer set (27F-II). Considering the findings in the context of the gut microbiomes common in Western industrial societies, as reported in the American Gut Project, the more degenerate primer set (27F-II) reflects the composition and diversity of the fecal microbiome significantly better than the 27F-I primer. This study provides a fundamentally relevant comparative analysis of the in situ performance of two primer sets designed for sequencing of the entire 16s rRNA genome and suggests that the more degenerate primer set (27F-II) should be preferred for nanopore sequencing-based analyses of the human fecal microbiome.
Published: 2023
Full Text: View/download PDF

16. AI-based multi-PRS models outperform classical single-PRS models

Author: Jan Henric Klau, Carlo Maj, Hannah Klinkhammer, Peter M. Krawitz, Andreas Mayr, Axel M. Hillmer, Johannes Schumacher, and Dominik Heider
Subjects: polygenic risk score, machine learning, deep learning, breast cancer, regression, Genetics, QH426-470
Abstract: Polygenic risk scores (PRS) calculate the risk for a specific disease based on the weighted sum of associated alleles from different genetic loci in the germline estimated by regression models. Recent advances in genetics made it possible to create polygenic predictors of complex human traits, including risks for many important complex diseases, such as cancer, diabetes, or cardiovascular diseases, typically influenced by many genetic variants, each of which has a negligible effect on overall risk. In the current study, we analyzed whether adding additional PRS from other diseases to the prediction models and replacing the regressions with machine learning models can improve overall predictive performance. Results showed that multi-PRS models outperform single-PRS models significantly on different diseases. Moreover, replacing regression models with machine learning models, i.e., deep learning, can also improve overall accuracy.
Published: 2023
Full Text: View/download PDF

17. Novel protein biomarkers for pneumonia and acute exacerbations in COPD: a pilot study

Author: Anna Lena Jung, Maria Han, Kathrin Griss, Wilhelm Bertrams, Christoph Nell, Timm Greulich, Andreas Klemmer, Hendrik Pott, Dominik Heider, Claus F. Vogelmeier, Stefan Hippenstiel, Norbert Suttorp, and Bernd Schmeck
Subjects: pneumonia, COPD, acute exacerbation, severity, biomarker, inflammation, Medicine (General), R5-920
Abstract: IntroductionCommunity-acquired pneumonia (CAP) and acute exacerbations of chronic obstructive pulmonary disease (AECOPD) result in high morbidity, mortality, and socio-economic burden. The usage of easily accessible biomarkers informing on disease entity, severity, prognosis, and pathophysiological endotypes is limited in clinical practice. Here, we have analyzed selected plasma markers for their value in differential diagnosis and severity grading in a clinical cohort.MethodsA pilot cohort of hospitalized patients suffering from CAP (n = 27), AECOPD (n = 10), and healthy subjects (n = 22) were characterized clinically. Clinical scores (PSI, CURB, CRB65, GOLD I-IV, and GOLD ABCD) were obtained, and interleukin-6 (IL-6), interleukin-8 (IL-8), interleukin-2-receptor (IL-2R), lipopolysaccharide-binding protein (LBP), resistin, thrombospondin-1 (TSP-1), lactotransferrin (LTF), neutrophil gelatinase-associated lipocalin (NGAL), neutrophil-elastase-2 (ELA2), hepatocyte growth factor (HGF), soluble Fas (sFas), as well as TNF-related apoptosis-inducing ligand (TRAIL) were measured in plasma.ResultsIn CAP patients and healthy volunteers, we found significantly different levels of ELA2, HGF, IL-2R, IL-6, IL-8, LBP, resistin, LTF, and TRAIL. The panel of LBP, sFas, and TRAIL could discriminate between uncomplicated and severe CAP. AECOPD patients showed significantly different levels of LTF and TRAIL compared to healthy subjects. Ensemble feature selection revealed that CAP and AECOPD can be discriminated by IL-6, resistin, together with IL-2R. These factors even allow the differentiation between COPD patients suffering from an exacerbation or pneumonia.DiscussionTaken together, we identified immune mediators in patient plasma that provide information on differential diagnosis and disease severity and can therefore serve as biomarkers. Further studies are required for validation in bigger cohorts.
Published: 2023
Full Text: View/download PDF

18. Dissecting the genetic heterogeneity of gastric cancerResearch in context

Author: Timo Hess, Carlo Maj, Jan Gehlen, Oleg Borisov, Stephan L. Haas, Ines Gockel, Michael Vieth, Guillaume Piessen, Hakan Alakus, Yogesh Vashist, Carina Pereira, Michael Knapp, Vitalia Schüller, Alexander Quaas, Heike I. Grabsch, Jessica Trautmann, Ewa Malecka-Wojciesko, Anna Mokrowiecka, Jan Speller, Andreas Mayr, Julia Schröder, Axel M. Hillmer, Dominik Heider, Florian Lordick, Ángeles Pérez-Aísa, Rafael Campo, Jesús Espinel, Fernando Geijo, Concha Thomson, Luis Bujanda, Federico Sopeña, Ángel Lanas, María Pellisé, Claudia Pauligk, Thorsten Oliver Goetze, Carolin Zelck, Julian Reingruber, Emadeldin Hassanin, Peter Elbe, Sandra Alsabeah, Mats Lindblad, Magnus Nilsson, Nicole Kreuser, René Thieme, Francesca Tavano, Roberta Pastorino, Dario Arzani, Roberto Persiani, Jin-On Jung, Henrik Nienhüser, Katja Ott, Ralf R. Schumann, Oliver Kumpf, Susen Burock, Volker Arndt, Anna Jakubowska, Małgorzta Ławniczak, Victor Moreno, Vicente Martín, Manolis Kogevinas, Marina Pollán, Justyna Dąbrowska, Antonio Salas, Olivier Cussenot, Anne Boland-Auge, Delphine Daian, Jean-Francois Deleuze, Erika Salvi, Maris Teder-Laving, Gianluca Tomasello, Margherita Ratti, Chiara Senti, Valli De Re, Agostino Steffan, Arnulf H. Hölscher, Katharina Messerle, Christiane Josephine Bruns, Armands Sīviņš, Inga Bogdanova, Jurgita Skieceviciene, Justina Arstikyte, Markus Moehler, Hauke Lang, Peter P. Grimminger, Martin Kruschewski, Nikolaos Vassos, Claus Schildberg, Philipp Lingohr, Karsten Ridwelski, Hans Lippert, Nadine Fricker, Peter Krawitz, Per Hoffmann, Markus M. Nöthen, Lothar Veits, Jakob R. Izbicki, Adrianna Mostowska, Federico Martinón-Torres, Daniele Cusi, Rolf Adolfsson, Geraldine Cancel-Tassin, Aksana Höblinger, Ernst Rodermann, Monika Ludwig, Gisela Keller, Andres Metspalu, Hermann Brenner, Joerg Heller, Markus Neef, Michael Schepke, Franz Ludwig Dumoulin, Lutz Hamann, Renato Cannizzaro, Michele Ghidini, Dominik Plaßmann, Michael Geppert, Peter Malfertheiner, Olivier Gehlen, Tomasz Skoczylas, Marek Majewski, Jan Lubiński, Orazio Palmieri, Stefania Boccia, Anna Latiano, Nuria Aragones, Thomas Schmidt, Mário Dinis-Ribeiro, Rui Medeiros, Salah-Eddin Al-Batran, Mārcis Leja, Juozas Kupcinskas, María A. García-González, Marino Venerito, and Johannes Schumacher
Subjects: Gastric cancer, Oesophageal adenocarcinoma, Genome-wide association study (GWAS), Transcriptome-wide association study (TWAS), Medicine, Medicine (General), R5-920
Abstract: Summary: Background: Gastric cancer (GC) is clinically heterogenous according to location (cardia/non-cardia) and histopathology (diffuse/intestinal). We aimed to characterize the genetic risk architecture of GC according to its subtypes. Another aim was to examine whether cardia GC and oesophageal adenocarcinoma (OAC) and its precursor lesion Barrett’s oesophagus (BO), which are all located at the gastro-oesophageal junction (GOJ), share polygenic risk architecture. Methods: We did a meta-analysis of ten European genome-wide association studies (GWAS) of GC and its subtypes. All patients had a histopathologically confirmed diagnosis of gastric adenocarcinoma. For the identification of risk genes among GWAS loci we did a transcriptome-wide association study (TWAS) and expression quantitative trait locus (eQTL) study from gastric corpus and antrum mucosa. To test whether cardia GC and OAC/BO share genetic aetiology we also used a European GWAS sample with OAC/BO. Findings: Our GWAS consisting of 5816 patients and 10,999 controls highlights the genetic heterogeneity of GC according to its subtypes. We newly identified two and replicated five GC risk loci, all of them with subtype-specific association. The gastric transcriptome data consisting of 361 corpus and 342 antrum mucosa samples revealed that an upregulated expression of MUC1, ANKRD50, PTGER4, and PSCA are plausible GC-pathomechanisms at four GWAS loci. At another risk locus, we found that the blood-group 0 exerts protective effects for non-cardia and diffuse GC, while blood-group A increases risk for both GC subtypes. Furthermore, our GWAS on cardia GC and OAC/BO (10,279 patients, 16,527 controls) showed that both cancer entities share genetic aetiology at the polygenic level and identified two new risk loci on the single-marker level. Interpretation: Our findings show that the pathophysiology of GC is genetically heterogenous according to location and histopathology. Moreover, our findings point to common molecular mechanisms underlying cardia GC and OAC/BO. Funding: German Research Foundation (DFG).
Published: 2023
Full Text: View/download PDF

19. Inkjet-printed quantum dots on paper as concept towards high-density long-term data storage

Author: Nils Mengel, Marius Welzel, Woldemar Niedenthal, Markus Stein, Dominik Heider, and Sangam Chatterjee
Subjects: quantum dots, data storage, photoluminescence, concept, Science, Physics, QC1-999
Abstract: Handling and storing the immense amounts of data native to the information age is a major challenge in terms of technological sustainability and energy demand. To date, tape storage remains the most widespread method for data archiving, while DNA data storage appears to offer the best data density and long-term stability in the future. However, DNA data storage is still in its infancy primarily due to economic and accessibility challenges. This emphasizes the need for more practical and readily available alternatives. We present a method for data storage utilizing inkjet printable quantum dots on paper with photoluminescence (PL) readout. Our proof of principle study showcases the ability to print and stack multiple bits of data on a single spot by exploiting the unique PL properties of quantum dots. This approach utilizes easily accessible resources, including a consumer-grade printer and paper as the substrate. Additionally, we perform initial stability tests, investigate scalability by controlling emission intensity, and evaluate the potential data density achievable by our approach.
Published: 2024
Full Text: View/download PDF

20. Phenotypic differences between female and male individuals with suspicion of autism spectrum disorder

Author: Sanna Stroth, Johannes Tauscher, Nicole Wolff, Charlotte Küpper, Luise Poustka, Stefan Roepke, Veit Roessner, Dominik Heider, and Inge Kamp-Becker
Subjects: Female autism, Sex, ASD, ADOS, ADI-R, Diagnostics, Neurology. Diseases of the nervous system, RC346-429
Abstract: Abstract Background Although autism spectrum disorder (ASD) is a common developmental disorder, our knowledge about a behavioral and neurobiological female phenotype is still scarce. As the conceptualization and understanding of ASD are mainly based on the investigation of male individuals, females with ASD may not be adequately identified by routine clinical diagnostics. The present machine learning approach aimed to identify diagnostic information from the Autism Diagnostic Observation Schedule (ADOS) that discriminates best between ASD and non-ASD in females and males. Methods Random forests (RF) were used to discover patterns of symptoms in diagnostic data from the ADOS (modules 3 and 4) in 1057 participants with ASD (18.1% female) and 1230 participants with non-ASD (17.9% % female). Predictive performances of reduced feature models were explored and compared between females and males without intellectual disabilities. Results Reduced feature models relied on considerably fewer features from the ADOS in females compared to males, while still yielding similar classification performance (e.g., sensitivity, specificity). Limitations As in previous studies, the current sample of females with ASD is smaller than the male sample and thus, females may still be underrepresented, limiting the statistical power to detect small to moderate effects. Conclusion Our results do not suggest the need for new or altered diagnostic algorithms for females with ASD. Although we identified some phenotypic differences between females and males, the existing diagnostic tools seem to sufficiently capture the core autistic features in both groups.
Published: 2022
Full Text: View/download PDF

21. sPLINK: a hybrid federated tool as a robust alternative to meta-analysis in genome-wide association studies

Author: Reza Nasirigerdeh, Reihaneh Torkzadehmahani, Julian Matschinske, Tobias Frisch, Markus List, Julian Späth, Stefan Weiss, Uwe Völker, Esa Pitkänen, Dominik Heider, Nina Kerstin Wenke, Georgios Kaissis, Daniel Rueckert, Tim Kacprowski, and Jan Baumbach
Subjects: sPLINK, PLINK, Federated learning, Genome-wide association studies, GWAS, Meta-analysis, Biology (General), QH301-705.5, Genetics, QH426-470
Abstract: Abstract Meta-analysis has been established as an effective approach to combining summary statistics of several genome-wide association studies (GWAS). However, the accuracy of meta-analysis can be attenuated in the presence of cross-study heterogeneity. We present sPLINK, a hybrid federated and user-friendly tool, which performs privacy-aware GWAS on distributed datasets while preserving the accuracy of the results. sPLINK is robust against heterogeneous distributions of data across cohorts while meta-analysis considerably loses accuracy in such scenarios. sPLINK achieves practical runtime and acceptable network usage for chi-square and linear/logistic regression tests. sPLINK is available at https://exbio.wzw.tum.de/splink .
Published: 2022
Full Text: View/download PDF

22. MOVIS: A multi-omics software solution for multi-modal time-series clustering, embedding, and visualizing tasks

Author: Aleksandar Anžel, Dominik Heider, and Georges Hattab
Subjects: Time-series, Multi-omics, Visualization, Data exploration, Temporal multi-omics, Longitudinal multi-omics, Biotechnology, TP248.13-248.65
Abstract: Thanks to recent advances in sequencing and computational technologies, many researchers with biological and/or medical backgrounds are now producing multiple data sets with an embedded temporal dimension. Multi-modalities enable researchers to explore and investigate different biological and physico-chemical processes with various technologies. Motivated to explore multi-omics data and time-series multi-omics specifically, the exploration process has been hindered by the separation introduced by each omics-type. To effectively explore such temporal data sets, discover anomalies, find patterns, and better understand their intricacies, expertise in computer science and bioinformatics is required. Here we present MOVIS, a modular time-series multi-omics exploration tool with a user-friendly web interface that facilitates the data exploration of such data. It brings into equal participation each time-series omic-type for analysis and visualization. As of the time of writing, two time-series multi-omics data sets have been integrated and successfully reproduced. The resulting visualizations are task-specific, reproducible, and publication-ready. MOVIS is built on open-source software and is easily extendable to accommodate different analytical tasks. An online version of MOVIS is available under https://movis.mathematik.uni-marburg.de/ and on Docker Hub (https://hub.docker.com/r/aanzel/movis).
Published: 2022
Full Text: View/download PDF

23. Multi-label classification for multi-drug resistance prediction of Escherichia coli

Author: Yunxiao Ren, Trinad Chakraborty, Swapnil Doijad, Linda Falgenhauer, Jane Falgenhauer, Alexander Goesmann, Oliver Schwengers, and Dominik Heider
Subjects: Multi-drug resistance, Machine learning, Multi-label classification, Biotechnology, TP248.13-248.65
Abstract: Antimicrobial resistance (AMR) is a global health and development threat. In particular, multi-drug resistance (MDR) is increasingly common in pathogenic bacteria. It has become a serious problem to public health, as MDR can lead to the failure of treatment of patients. MDR is typically the result of mutations and the accumulation of multiple resistance genes within a single cell. Machine learning methods have a wide range of applications for AMR prediction. However, these approaches typically focus on single drug resistance prediction and do not incorporate information on accumulating antimicrobial resistance traits over time. Thus, identifying multi-drug resistance simultaneously and rapidly remains an open challenge. In our study, we could demonstrate that multi-label classification (MLC) methods can be used to model multi-drug resistance in pathogens. Importantly, we found the ensemble of classifier chains (ECC) model achieves accurate MDR prediction and outperforms other MLC methods. Thus, our study extends the available tools for MDR prediction and paves the way for improving diagnostics of infections in patients. Furthermore, the MLC methods we introduced here would contribute to reducing the threat of antimicrobial resistance and related deaths in the future by improving the speed and accuracy of the identification of pathogens and resistance.
Published: 2022
Full Text: View/download PDF

24. Guideline for software life cycle in health informatics

Author: Anne-Christin Hauschild, Roman Martin, Sabrina Celine Holst, Joachim Wienbeck, and Dominik Heider
Subjects: Health informatics, Bioinformatics, Software engineering, Science
Abstract: Summary: The long-lasting trend of medical informatics is to adapt novel technologies in the medical context. In particular, incorporating artificial intelligence to support clinical decision-making can significantly improve monitoring, diagnostics, and prognostics for the patient’s and medic’s sake. However, obstacles hinder a timely technology transfer from research to the clinic. Due to the pressure for novelty in the research context, projects rarely implement quality standards.Here, we propose a guideline for academic software life cycle processes tailored to the needs and capabilities of research organizations. While the complete implementation of a software life cycle according to commercial standards is not feasible in scientific work, we propose a subset of elements that we are convinced will provide a significant benefit while keeping the effort within a feasible range.Ultimately, the emerging quality checks for academic software development can pave the way for an accelerated deployment of academic advances in clinical practice.
Published: 2022
Full Text: View/download PDF

25. Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making

Author: Jacqueline Beinecke and Dominik Heider
Subjects: Machine learning, Clinical data, Data augmentation, Synthetic data, Computer applications to medicine. Medical informatics, R858-859.7, Analysis, QA299.6-433
Abstract: Abstract Clinical data sets have very special properties and suffer from many caveats in machine learning. They typically show a high-class imbalance, have a small number of samples and a large number of parameters, and have missing values. While feature selection approaches and imputation techniques address the former problems, the class imbalance is typically addressed using augmentation techniques. However, these techniques have been developed for big data analytics, and their suitability for clinical data sets is unclear. This study analyzed different augmentation techniques for use in clinical data sets and subsequent employment of machine learning-based classification. It turns out that Gaussian Noise Up-Sampling (GNUS) is not always but generally, is as good as SMOTE and ADASYN and even outperform those on some datasets. However, it has also been shown that augmentation does not improve classification at all in some cases.
Published: 2021
Full Text: View/download PDF

26. Machine learning with asymmetric abstention for biomedical decision-making

Author: Mariem Gandouz, Hajo Holzmann, and Dominik Heider
Subjects: Medical data science, Machine learning, Classification, Diagnostics, Computer applications to medicine. Medical informatics, R858-859.7
Abstract: Abstract Machine learning and artificial intelligence have entered biomedical decision-making for diagnostics, prognostics, or therapy recommendations. However, these methods need to be interpreted with care because of the severe consequences for patients. In contrast to human decision-making, computational models typically make a decision also with low confidence. Machine learning with abstention better reflects human decision-making by introducing a reject option for samples with low confidence. The abstention intervals are typically symmetric intervals around the decision boundary. In the current study, we use asymmetric abstention intervals, which we demonstrate to be better suited for biomedical data that is typically highly imbalanced. We evaluate symmetric and asymmetric abstention on three real-world biomedical datasets and show that both approaches can significantly improve classification performance. However, asymmetric abstention rejects as many or fewer samples compared to symmetric abstention and thus, should be used in imbalanced data.
Published: 2021
Full Text: View/download PDF

27. A theoretical basis for bioindication in complex ecosystems

Author: Theodor Sperlea, Dominik Heider, and Georges Hattab
Subjects: Ecology, Biomonitoring, Ecological theory, Information theory, Bioindication, QH540-549.5
Abstract: With increasing levels of anthropogenic stress, monitoring the state of ecosystems or their pollution levels is of growing importance. Central to this is an information transfer that results in biotic factors reflecting the ecosystem’s condition or any ecosystem state variable. However, most information theories cannot describe this process as they do not apply to complex systems such as ecosystems. In this paper, we draw upon theoretical ecology, information theory, semiotics, and the study of complex systems in sociology to develop a theoretical basis for bioindication that takes the complex nature of ecosystems into account. From this follows that the relationship between the bioindicator and the variable(s) of interest in bioindication must be regarded as one of structural coupling, and biomonitoring schemes with the goal of bioindication are best described as cases of second-order observation. These theoretical results are highly relevant for the development of machine learning-based methods for bioindication.
Published: 2022
Full Text: View/download PDF

28. Design considerations for advancing data storage with synthetic DNA for long-term archiving

Author: Chisom Ezekannagha, Anke Becker, Dominik Heider, and Georges Hattab
Subjects: Data, Synthetic DNA, Encoding, Data storage, Design considerations, Medicine (General), R5-920, Biology (General), QH301-705.5
Abstract: Deoxyribonucleic acid (DNA) is increasingly emerging as a serious medium for long-term archival data storage because of its remarkable high-capacity, high-storage-density characteristics and its lasting ability to store data for thousands of years. Various encoding algorithms are generally required to store digital information in DNA and to maintain data integrity. Indeed, since DNA is the information carrier, its performance under different processing and storage conditions significantly impacts the capabilities of the data storage system. Therefore, the design of a DNA storage system must meet specific design considerations to be less error-prone, robust and reliable. In this work, we summarize the general processes and technologies employed when using synthetic DNA as a storage medium. We also share the design considerations for sustainable engineering to include viability. We expect this work to provide insight into how sustainable design can be used to develop an efficient and robust synthetic DNA-based storage system for long-term archiving.
Published: 2022
Full Text: View/download PDF

29. Mushroom data creation, curation, and simulation to support classification tasks

Author: Dennis Wagner, Dominik Heider, and Georges Hattab
Subjects: Medicine, Science
Abstract: Abstract Predicting if a set of mushrooms is edible or not corresponds to the task of classifying them into two groups—edible or poisonous—on the basis of a classification rule. To support this binary task, we have collected the largest and most comprehensive attribute based data available. In this work, we detail the creation, curation and simulation of a data set for binary classification. Thanks to natural language processing, the primary data are based on a text book for mushroom identification and contain 173 species from 23 families. While the secondary data comprise simulated or hypothetical entries that are structurally comparable to the 1987 data, it serves as pilot data for classification tasks. We evaluated different machine learning algorithms, namely, naive Bayes, logistic regression, and linear discriminant analysis (LDA), and random forests (RF). We found that the RF provided the best results with a five-fold Cross-Validation accuracy and F2-score of 1.0 ( $$\mu =1$$ μ = 1 , $$\sigma =0$$ σ = 0 ), respectively. The results of our pilot are conclusive and indicate that our data were not linearly separable. Unlike the 1987 data which showed good results using a linear decision boundary with the LDA. Our data set contains 23 families and is the largest available. We further provide a fully reproducible workflow and provide the data under the FAIR principles.
Published: 2021
Full Text: View/download PDF

30. The visual story of data storage: From storage properties to user interfaces

Author: Aleksandar Anžel, Dominik Heider, and Georges Hattab
Subjects: Storage, Device, Medium, Usage, Capacity, Lifespan, Biotechnology, TP248.13-248.65
Abstract: About fifty times more data has been created than there are stars in the observable universe. Current trends in data creation and consumption mean that the devices and storage media we use will require more physical space. Novel data storage media such as DNA are considered a viable alternative. Yet, the introduction of new storage technologies should be accompanied by an evaluation of user requirements. To assess such needs, we designed and conducted a survey to rank different storage properties adapted for visualization. That is, accessibility, capacity, usage, mutability, lifespan, addressability, and typology. Withal, we reported different storage devices over time while ranking them by their properties. Our results indicated a timeline of three distinct periods: magnetic, optical and electronic, and alternative media. Moreover, by investigating user interfaces across different operating systems, we observed a predominant presence of bar charts and tree maps for the usage of a medium and its file directory hierarchy, respectively. Taken together with the results of our survey, this allowed us to create a customized user interface that includes data visualizations that can be toggled for both user groups: Experts and Public.
Published: 2021
Full Text: View/download PDF

31. MOSGA 2: Comparative genomics and validation tools

Author: Roman Martin, Hagen Dreßler, Georges Hattab, Thomas Hackl, Matthias G. Fischer, and Dominik Heider
Subjects: Genome annotation, Comparative genomics, Phylogenetics, Quality control, Framework, Pipeline, Biotechnology, TP248.13-248.65
Abstract: Due to the highly growing number of available genomic information, the need for accessible and easy-to-use analysis tools is increasing. To facilitate eukaryotic genome annotations, we created MOSGA. In this work, we show how MOSGA 2 is developed by including several advanced analyses for genomic data. Since the genomic data quality greatly impacts the annotation quality, we included multiple tools to validate and ensure high-quality user-submitted genome assemblies. Moreover, thanks to the integration of comparative genomics methods, users can benefit from a broader genomic view by analyzing multiple genomic data sets simultaneously. Further, we demonstrate the new functionalities of MOSGA 2 by different use-cases and practical examples. MOSGA 2 extends the already established application to the quality control of the genomic data and integrates and analyzes multiple genomes in a larger context, e.g., by phylogenetics.
Published: 2021
Full Text: View/download PDF

32. Chaos game representation and its applications in bioinformatics

Author: Hannah Franziska Löchel and Dominik Heider
Subjects: Chaos game representation, Bioinformatics, Sequence analysis, Alignment-free sequence comparison, DNA and protein encoding, Machine learning, Biotechnology, TP248.13-248.65
Abstract: Chaos game representation (CGR), a milestone in graphical bioinformatics, has become a powerful tool regarding alignment-free sequence comparison and feature encoding for machine learning. The algorithm maps a sequence to 2-dimensional space, while an extension of the CGR, the so-called frequency matrix representation (FCGR), transforms sequences of different lengths into equal-sized images or matrices. The CGR is a generalized Markov chain and includes various properties, which allow a unique representation of a sequence. Therefore, it has a broad spectrum of applications in bioinformatics, such as sequence comparison and phylogenetic analysis and as an encoding of sequences for machine learning. This review introduces the construction of CGRs and FCGRs, their applications on DNA and proteins, and gives an overview of recent applications and progress in bioinformatics.
Published: 2021
Full Text: View/download PDF

33. Context-Aware Phylogenetic Trees for Phylogeny-Based Taxonomy Visualization

Author: Gizem Kaya, Chisom Ezekannagha, Dominik Heider, and Georges Hattab
Subjects: phylogeny, taxonomy, genomics, phylogeny-based taxonomy, visualization, phylogenetic tree, Genetics, QH426-470
Abstract: Sustained efforts in next-generation sequencing technologies are changing the field of taxonomy. The increase in the number of resolved genomes has made the traditional taxonomy of species antiquated. With phylogeny-based methods, taxonomies are being updated and refined. Although such methods bridge the gap between phylogeny and taxonomy, phylogeny-based taxonomy currently lacks interactive visualization approaches. Motivated by enriching and increasing the consistency of evolutionary and taxonomic studies alike, we propose Context-Aware Phylogenetic Trees (CAPT) as an interactive web tool to support users in exploration- and validation-based tasks. To complement phylogenetic information with phylogeny-based taxonomy, we offer linking two interactive visualizations which compose two simultaneous views: the phylogenetic tree view and the taxonomic icicle view. Thanks to its space-filling properties, the icicle visualization follows the intuition behind taxonomies where different hierarchical rankings with equal number of child elements can be represented with same-sized rectangular areas. In other words, it provides partitions of different sizes depending on the number of elements they contain. The icicle view integrates seven taxonomic rankings: domain, phylum, class, order, family, genus, and species. CAPT enriches the clades in the phylogenetic tree view with context from the genomic data and supports interactive techniques such as linking and brushing to highlight correspondence between the two views. Four different use cases, extracted from the Genome Taxonomy DataBase, were employed to create four scenarios using our approach. CAPT was successfully used to explore the phylogenetic trees as well as the taxonomic data by providing context and using the interaction techniques. This tool is essential to increase the accuracy of categorization of newly identified species and validate updated taxonomies. The source code and data are freely available at https://github.com/ghattab/CAPT.
Published: 2022
Full Text: View/download PDF

34. Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

Author: Marius Welzel, Anja Lange, Dominik Heider, Michael Schwarz, Bernd Freisleben, Manfred Jensen, Jens Boenigk, and Daniela Beisser
Subjects: Amplicon sequencing, Operational Taxonomic Units, Amplicon Sequence Variants, Snakemake, Pipline, Illumina, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Sequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires efficient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an efficient workflow management system. Results We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub ( https://github.com/MW55/Natrix ) or as a Docker container on DockerHub ( https://hub.docker.com/r/mw55/natrix ). Conclusion Natrix is a user-friendly and highly extensible workflow for processing Illumina amplicon data.
Published: 2020
Full Text: View/download PDF

35. Liver parameters as part of a non-invasive model for prediction of all-cause mortality after myocardial infarction

Author: Theodor Baars, Jan-Peter Sowa, Ursula Neumann, Stefanie Hendricks, Mona Jinawy, Julia Kälsch, Guido Gerken, Tienush Rassaf, Dominik Heider, and Ali Canbay
Subjects: liver enzymes, percutaneous coronary intervention, non-invasive prediction, troponin, Medicine
Abstract: Introduction Liver parameters are associated with cardiovascular disease risk and severity of stenosis. It is unclear whether liver parameters could predict the long-term outcome of patients after acute myocardial infarction (AMI). We performed an unbiased analysis of the predictive value of serum parameters for long-term prognosis after AMI. Material and methods In a retrospective, observational, single-center, cohort study, 569 patients after AMI were enrolled and followed up until 6 years for major adverse cardiovascular events, including cardiac death. Patients were classified into non-survivors (n = 156) and survivors (n = 413). Demographic and laboratory data were analyzed using ensemble feature selection (EFS) and logistic regression. Correlations were performed for serum parameters. Results Age (73; 64; p < 0.01), alanine aminotransferase (ALT; 93 U/l; 40 U/l; p < 0.01), aspartate aminotransferase (AST; 162 U/l; 66 U/l; p < 0.01), C-reactive protein (CRP; 4.7 U/l; 1.6 U/l; p < 0.01), creatinine (1.6; 1.3; p < 0.01), -glutamyltransferase (GGT; 71 U/l; 46 U/l; p < 0.01), urea (29.5; 20.5; p < 0.01), estimated glomerular filtration rate (eGFR; 49.6; 61.4; p < 0.01), troponin (13.3; 7.6; p < 0.01), myoglobin (639; 302; p < 0.01), and cardiovascular risk factors (hypercholesterolemia p < 0.02, family history p < 0.01, and smoking p < 0.01) differed significantly between non-survivors and survivors. Age, AST, CRP, eGFR, myoglobin, sodium, urea, creatinine, and troponin correlated significantly with death (r = –0.29; 0.14; 0.31; –0.27; 0.20; –0.13; 0.33; 0.24; 0.12). A prediction model was built including age, CRP, eGFR, myoglobin, and urea, achieving an AUROC of 77.6% to predict long-term survival after AMI. Conclusions Non-invasive parameters, including liver and renal markers, can predict long-term outcome of patients after AMI.
Published: 2019
Full Text: View/download PDF

36. A multi-omics study on quantifying antimicrobial resistance in European freshwater lakes

Author: Sebastian Spänig, Lisa Eick, Julia K. Nuy, Daniela Beisser, Margaret Ip, Dominik Heider, and Jens Boenigk
Subjects: Pathogens, Antimicrobial Resistance, Multi-Omics, European Freshwater Lakes, Environmental sciences, GE1-350
Abstract: The surveillance of wastewater for the Covid-19 virus during this unprecedented pandemic and mapped to the distribution and magnitude of the infected in the population near real-time exemplifies the importance of tracking rapidly changing trends of pathogens or public health problems at a large scale. The rising trends of antimicrobial resistance (AMR) with multidrug-resistant pathogens from the environmental water have similarly gained much attention in recent years. Wastewater-based epidemiology from water samples has shown that a wide range of AMR-related genes is frequently detected. Albeit sewage is treated before release and thus, the abundance of pathogens should be significantly reduced or even pathogen-free, several studies indicated the contrary. Pathogens are still measurable in the released water, ultimately entering freshwaters, such as rivers and lakes. Furthermore, socio-economic and environmental factors, such as chemical industries and animal farming nearby, impact the presence of AMR. Many bacterial species from the environment are intrinsically resistant and also contribute to the resistome of freshwater lakes. This study collected the most extensive standardized freshwater data set from hundreds of European lakes and conducted a comprehensive multi-omics analysis on antimicrobial resistance from these freshwater lakes. Our research shows that genes encoding for AMR against tetracyclines, cephalosporins, and quinolones were commonly identified, while for some, such as sulfonamides, resistance was less frequently present. We provide an estimation of the characteristic resistance of AMR in European lakes, which can be used as a comprehensive resistome dataset to facilitate and monitor temporal changes in the development of AMR in European freshwater lakes.
Published: 2021
Full Text: View/download PDF

37. Deep Transfer Learning Enables Robust Prediction of Antimicrobial Resistance for Novel Antibiotics

Author: Yunxiao Ren, Trinad Chakraborty, Swapnil Doijad, Linda Falgenhauer, Jane Falgenhauer, Alexander Goesmann, Oliver Schwengers, and Dominik Heider
Subjects: transfer learning, antimicrobial resistance, small data with imbalanced label, Therapeutics. Pharmacology, RM1-950
Abstract: Antimicrobial resistance (AMR) has become one of the serious global health problems, threatening the effective treatment of a growing number of infections. Machine learning and deep learning show great potential in rapid and accurate AMR predictions. However, a large number of samples for the training of these models is essential. In particular, for novel antibiotics, limited training samples and data imbalance hinder the models’ generalization performance and overall accuracy. We propose a deep transfer learning model that can improve model performance for AMR prediction on small, imbalanced datasets. As our approach relies on transfer learning and secondary mutations, it is also applicable to novel antibiotics and emerging resistances in the future and enables quick diagnostics and personalized treatments.
Published: 2022
Full Text: View/download PDF

38. Is the Combination of ADOS and ADI-R Necessary to Classify ASD? Rethinking the 'Gold Standard' in Diagnosing ASD

Author: Inge Kamp-Becker, Johannes Tauscher, Nicole Wolff, Charlotte Küpper, Luise Poustka, Stefan Roepke, Veit Roessner, Dominik Heider, and Sanna Stroth
Subjects: machine learning, random forest, autism spectrum disorder, clinical characteristics, differential diagnosis behavioral aspects, ADOS, Psychiatry, RC435-571
Abstract: Diagnosing autism spectrum disorder (ASD) requires extensive clinical expertise and training as well as a focus on differential diagnoses. The diagnostic process is particularly complex given symptom overlap with other mental disorders and high rates of co-occurring physical and mental health concerns. The aim of this study was to conduct a data-driven selection of the most relevant diagnostic information collected from a behavior observation and an anamnestic interview in two clinical samples of children/younger adolescents and adolescents/adults with suspected ASD. Via random forests, the present study discovered patterns of symptoms in the diagnostic data of 2310 participants (46% ASD, 54% non-ASD, age range 4–72 years) using data from the combined Autism Diagnostic Observation Schedule (ADOS) and Autism Diagnostic Interview—Revised (ADI-R) and ADOS data alone. Classifiers built on reduced subsets of diagnostic features yield satisfactory sensitivity and specificity values. For adolescents/adults specificity values were lower compared to those for children/younger adolescents. The models including ADOS and ADI-R data were mainly built on ADOS items and in the adolescent/adult sample the classifier including only ADOS items performed even better than the classifier including information from both instruments. Results suggest that reduced subsets of ADOS and ADI-R items may suffice to effectively differentiate ASD from other mental disorders. The imbalance of ADOS and ADI-R items included in the models leads to the assumption that, particularly in adolescents and adults, the ADI-R may play a lesser role than current behavior observations.
Published: 2021
Full Text: View/download PDF

39. Fostering reproducibility, reusability, and technology transfer in health informatics

Author: Anne-Christin Hauschild, Lisa Eick, Joachim Wienbeck, and Dominik Heider
Subjects: health informatics, Bioinformatics, software engineering, software robustness, Science
Abstract: Summary: Computational methods can transform healthcare. In particular, health informatics with artificial intelligence has shown tremendous potential when applied in various fields of medical research and has opened a new era for precision medicine. The development of reusable biomedical software for research or clinical practice is time-consuming and requires rigorous compliance with quality requirements as defined by international standards.However, research projects rarely implement such measures, hindering smooth technology transfer into the research community or manufacturers as well as reproducibility and reusability.Here, we present a guideline for quality management systems (QMS) for academic organizations incorporating the essential components while confining the requirements to an easily manageable effort. It provides a starting point to implement a QMS tailored to specific needs effortlessly and greatly facilitates technology transfer in a controlled manner, thereby supporting reproducibility and reusability.Ultimately, the emerging standardized workflows can pave the way for an accelerated deployment in clinical practice.
Published: 2021
Full Text: View/download PDF

40. Identification of the most indicative and discriminative features from diagnostic instruments for children with autism

Author: Sanna Stroth, Johannes Tauscher, Nicole Wolff, Charlotte Küpper, Luise Poustka, Stefan Roepke, Veit Roessner, Dominik Heider, and Inge Kamp‐Becker
Subjects: ADI‐R, ADOS, autism spectrum disorder, diagnostic‐gold‐standard, differential‐diagnosis, machine learning, Pediatrics, RJ1-570, Psychiatry, RC435-571
Abstract: Abstract Background Diagnosing autism spectrum disorder (ASD) is complex and time‐consuming. The present work systematically examines the importance of items from the Autism Diagnostic Interview‐Revised (ADI‐R) and Autism Diagnostic Observation Schedule (ADOS) in discerning children with and without ASD. Knowledge of the most discriminative features and their underlying concepts may prove valuable for the future training tools that assist clinicians to substantiate or extenuate a suspicion of ASD in nonverbal and minimally verbal children. Methods In two samples of nonverbal (N = 466) and minimally verbal (N = 566) children with ASD (N = 509) and other mental disorders or developmental delays (N = 523), we applied random forests (RFs) to (i) the combination of ADI‐R and ADOS data versus (ii) ADOS data alone. We compared the predictive performance of reduced feature models against outcomes provided by models containing all features. Results For nonverbal children, the RF classifier indicated social orientation to be most powerful in differentiating ASD from non‐ASD cases. In minimally verbal children, we find language/speech peculiarities in combination with facial/nonverbal expressions and reciprocity to be most distinctive. Conclusion Based on machine learning strategies, we carve out those symptoms of ASD that prove to be central for the differentiation of ASD cases from those with other developmental or mental disorders (high specificity in minimally verbal children). These core concepts ought to be considered in the future training tools for clinicians.
Published: 2021
Full Text: View/download PDF

41. Encodings and models for antimicrobial peptide classification for multi-resistant pathogens

Author: Sebastian Spänig and Dominik Heider
Subjects: Machine learning, Antimicrobial peptides, Encodings, Computer applications to medicine. Medical informatics, R858-859.7, Analysis, QA299.6-433
Abstract: Abstract Antimicrobial peptides (AMPs) are part of the inherent immune system. In fact, they occur in almost all organisms including, e.g., plants, animals, and humans. Remarkably, they show effectivity also against multi-resistant pathogens with a high selectivity. This is especially crucial in times, where society is faced with the major threat of an ever-increasing amount of antibiotic resistant microbes. In addition, AMPs can also exhibit antitumor and antiviral effects, thus a variety of scientific studies dealt with the prediction of active peptides in recent years. Due to their potential, even the pharmaceutical industry is keen on discovering and developing novel AMPs. However, AMPs are difficult to verify in vitro, hence researchers conduct sequence similarity experiments against known, active peptides. Unfortunately, this approach is very time-consuming and limits potential candidates to sequences with a high similarity to known AMPs. Machine learning methods offer the opportunity to explore the huge space of sequence variations in a timely manner. These algorithms have, in principal, paved the way for an automated discovery of AMPs. However, machine learning models require a numerical input, thus an informative encoding is very important. Unfortunately, developing an appropriate encoding is a major challenge, which has not been entirely solved so far. For this reason, the development of novel amino acid encodings is established as a stand-alone research branch. The present review introduces state-of-the-art encodings of amino acids as well as their properties in sequence and structure based aggregation. Moreover, albeit a well-chosen encoding is essential, performant classifiers are required, which is reflected by a tendency towards specifically designed models in the literature. Furthermore, we introduce these models with a particular focus on encodings derived from support vector machines and deep learning approaches. Albeit a strong focus has been set on AMP predictions, not all of the mentioned encodings have been elaborated as part of antimicrobial research studies, but rather as general protein or peptide representations.
Published: 2019
Full Text: View/download PDF

42. Correction: Ten simple rules to colorize biological data visualization.

Author: Georges Hattab, Theresa-Marie Rhyne, and Dominik Heider
Subjects: Biology (General), QH301-705.5
Abstract: [This corrects the article DOI: 10.1371/journal.pcbi.1008259.].
Published: 2021
Full Text: View/download PDF

43. Ten simple rules to colorize biological data visualization.

Author: Georges Hattab, Theresa-Marie Rhyne, and Dominik Heider
Subjects: Biology (General), QH301-705.5
Published: 2020
Full Text: View/download PDF

44. CORDITE: The Curated CORona Drug InTERactions Database for SARS-CoV-2

Author: Roman Martin, Hannah F. Löchel, Marius Welzel, Georges Hattab, Anne-Christin Hauschild, and Dominik Heider
Subjects: Medicine, Virology, Drugs, Bioinformatics, Biological Database, Science
Abstract: Summary: Since the outbreak in 2019, researchers are trying to find effective drugs against the SARS-CoV-2 virus based on de novo drug design and drug repurposing. The former approach is very time consuming and needs extensive testing in humans, whereas drug repurposing is more promising, as the drugs have already been tested for side effects, etc. At present, there is no treatment for COVID-19 that is clinically effective, but there is a huge amount of data from studies that analyze potential drugs.We developed CORDITE to efficiently combine state-of-the-art knowledge on potential drugs and make it accessible to scientists and clinicians. The web interface also provides access to an easy-to-use API that allows a wide use for other software and applications, e.g., for meta-analysis, design of new clinical studies, or simple literature search. CORDITE is currently empowering many scientists across all continents and accelerates research in the knowledge domains of virology and drug design.
Published: 2020
Full Text: View/download PDF

45. Editorial: Artificial Intelligence Bioinformatics: Development and Application of Tools for Omics and Inter-Omics Studies

Author: Davide Chicco, Dominik Heider, and Angelo Facchiano
Subjects: artificial intelligence, bioinformatics, genomics, omics, inter-omics, machine learning, Genetics, QH426-470
Published: 2020
Full Text: View/download PDF

46. A Potential Role for Bile Acid Signaling in Celiac Disease-Associated Fatty Liver

Author: Paul Manka, Svenja Sydor, Julia M. Schänzer-Ocklenburg, Malte Brandenburg, Jan Best, Ramiro Vilchez-Vargas, Alexander Link, Dominik Heider, Susanne Brodesser, Anja Figge, Andreas Jähnert, Jason D. Coombes, Francisco Javier Cubero, Alisan Kahraman, Moon-Sung Kim, Julia Kälsch, Sonja Kinner, Klaas Nico Faber, Han Moshage, Guido Gerken, Wing-Kin Syn, Ali Canbay, and Lars P. Bechmann
Subjects: celiac disease, hepatic steatosis, bile acids, FGF19, non-alcoholic fatty liver disease (NAFLD), Microbiology, QR1-502
Abstract: Celiac disease (CeD) is a chronic autoimmune disorder characterized by an intolerance to storage proteins of many grains. CeD is frequently associated with liver damage and steatosis. Bile acid (BA) signaling has been identified as an important mediator in gut–liver interaction and the pathogenesis of non-alcoholic fatty liver disease (NAFLD). Here, we aimed to analyze BA signaling and liver injury in CeD patients. Therefore, we analyzed data of 20 CeD patients on a gluten-free diet compared to 20 healthy controls (HC). We furthermore analyzed transaminase levels, markers of cell death, BA, and fatty acid metabolism. Hepatic steatosis was determined via transient elastography, by MRI and non-invasive scores. In CeD, we observed an increase of the apoptosis marker M30 and more hepatic steatosis as compared to HC. Fibroblast growth factor 19 (FGF19) was repressed in CeD, while low levels were associated with steatosis, especially in patients with high levels of anti-tissue transglutaminase antibodies (anti-tTG). When comparing anti-tTG-positive CeD patients to individuals without detectable anti-tTG levels, hepatic steatosis was accentuated. CeD patients with significant sonographic steatosis (defined by CAP ≥ 283 db/m) were exclusively anti-tTG-positive. In summary, our results suggest that even in CeD patients in clinical remission under gluten-free diet, alterations in gut–liver axis, especially BA signaling, might contribute to steatotic liver injury and should be further addressed in future studies and clinical practice.
Published: 2022
Full Text: View/download PDF

47. SEDE-GPS: socio-economic data enrichment based on GPS information

Author: Theodor Sperlea, Stefan Füser, Jens Boenigk, and Dominik Heider
Subjects: GPS, Data enrichment, Database, Ecology, Microbial ecology, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Microbes are essentail components of all ecosystems because they drive many biochemical processes and act as primary producers. In freshwater ecosystems, the biodiversity in and the composition of microbial communities can be used as indicators for environmental quality. Recently, some environmental features have been identified that influence microbial ecosystems. However, the impact of human action on lake microbiomes is not well understood. This is, in part, due to the fact that environmental data is, albeit theoretically accessible, not easily available. Results In this work, we present SEDE-GPS, a tool that gathers data that are relevant to the environment of an user-provided GPS coordinate. To this end, it accesses a list of public and corporate databases and aggregates the information in a single file, which can be used for further analysis. To showcase the use of SEDE-GPS, we enriched a lake microbial ecology sequencing dataset with around 18,000 socio-economic, climate, and geographic features. The sources of SEDE-GPS are public databases such as Eurostat, the Climate Data Center, and OpenStreetMap, as well as corporate sources such as Twitter. Using machine learning and feature selection methods, we were able to identify features in the data provided by SEDE-GPS that can be used to predict lake microbiome alpha diversity. Conclusion The results presented in this study show that SEDE-GPS is a handy and easy-to-use tool for comprehensive data enrichment for studies of ecology and other processes that are affected by environmental features. Furthermore, we present lists of environmental, socio-economic, and climate features that are predictive for microbial biodiversity in lake ecosystems. These lists indicate that human action has a major impact on lake microbiomes. SEDE-GPS and its source code is available for download at http://SEDE-GPS.heiderlab.de
Published: 2018
Full Text: View/download PDF

48. eccCL: parallelized GPU implementation of Ensemble Classifier Chains

Author: Mona Riemenschneider, Alexander Herbst, Ari Rasch, Sergei Gorlatch, and Dominik Heider
Subjects: Classifier chains, Multi label classification, High performance computing, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Multi-label classification has recently gained great attention in diverse fields of research, e.g., in biomedical application such as protein function prediction or drug resistance testing in HIV. In this context, the concept of Classifier Chains has been shown to improve prediction accuracy, especially when applied as Ensemble Classifier Chains. However, these techniques lack computational efficiency when applied on large amounts of data, e.g., derived from next-generation sequencing experiments. By adapting algorithms for the use of graphics processing units, computational efficiency can be greatly improved due to parallelization of computations. Results Here, we provide a parallelized and optimized graphics processing unit implementation (eccCL) of Classifier Chains and Ensemble Classifier Chains. Additionally to the OpenCL implementation, we provide an R-Package with an easy to use R-interface for parallelized graphics processing unit usage. Conclusion eccCL is a handy implementation of Classifier Chains on GPUs, which is able to process up to over 25,000 instances per second, and thus can be used efficiently in high-throughput experiments. The software is available at http://www.heiderlab.de .
Published: 2017
Full Text: View/download PDF

49. EFS: an ensemble feature selection tool implemented as R-package and web-application

Author: Ursula Neumann, Nikita Genze, and Dominik Heider
Subjects: Machine learning, Feature selection, Ensemble learning, R-package, Computer applications to medicine. Medical informatics, R858-859.7, Analysis, QA299.6-433
Abstract: Abstract Background Feature selection methods aim at identifying a subset of features that improve the prediction performance of subsequent classification models and thereby also simplify their interpretability. Preceding studies demonstrated that single feature selection methods can have specific biases, whereas an ensemble feature selection has the advantage to alleviate and compensate for these biases. Results The software EFS (Ensemble Feature Selection) makes use of multiple feature selection methods and combines their normalized outputs to a quantitative ensemble importance. Currently, eight different feature selection methods have been integrated in EFS, which can be used separately or combined in an ensemble. Conclusion EFS identifies relevant features while compensating specific biases of single methods due to an ensemble approach. Thereby, EFS can improve the prediction accuracy and interpretability in subsequent binary classification models. Availability EFS can be downloaded as an R-package from CRAN or used via a web application at http://EFS.heiderlab.de .
Published: 2017
Full Text: View/download PDF

50. ContraDRG: Automatic Partial Charge Prediction by Machine Learning

Author: Roman Martin and Dominik Heider
Subjects: PRODRG, ATB, machine learning, molecular dynamics simulations, partial charge prediction, Genetics, QH426-470
Abstract: In recent years, machine learning techniques have been widely used in biomedical research to predict unseen data based on models trained on experimentally derived data. In the current study, we used machine learning algorithms to emulate computationally complex predictions in a reverse engineering–like manner and developed ContraDRG, a software that can be used to predict partial charges for small molecules based on PRODRG and Automated Topology Builder (ATB) predictions. Both tools generate molecular topology files, including the partial atomic charge, by using different procedures. We show that ContraDRG can accurately predict partial charges in a fraction of the time, because it exploits existing complex models with intensive calculations by using machine learning techniques and thus can also be applied for screening projects with large amounts of molecules. We provide ContraDRG as a web server, which can be used to automatically assign partial charges to incoming user-specified molecules by using our machine learning models. In this study, we compared ContraDRG with PRODRG and ATB in regard of predictivity by statistical methods. ContraDRG allows predicting ATB-derived partial charges with an R2 value up to 0.980 and for PRODRG up to 1.00. While ATB requires hours or days for the quantum mechanical accurate calculation and refinements, ContraDRG does its approximation within seconds.
Published: 2019
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

208 results on '"Dominik Heider"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources