36 results on '"Dominik Heider"'
Search Results
2. Unsupervised encoding selection through ensemble pruning for biomedical classification
- Author
-
Sebastian Spänig, Alexander Michel, and Dominik Heider
- Subjects
Biomedical classification ,Antimicrobial peptides ,Encodings ,Machine learning ,Ensemble learning ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Analysis ,QA299.6-433 - Abstract
Abstract Background Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptide’s function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking. Results We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets. Conclusion The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.
- Published
- 2023
- Full Text
- View/download PDF
3. AI-based multi-PRS models outperform classical single-PRS models
- Author
-
Jan Henric Klau, Carlo Maj, Hannah Klinkhammer, Peter M. Krawitz, Andreas Mayr, Axel M. Hillmer, Johannes Schumacher, and Dominik Heider
- Subjects
polygenic risk score ,machine learning ,deep learning ,breast cancer ,regression ,Genetics ,QH426-470 - Abstract
Polygenic risk scores (PRS) calculate the risk for a specific disease based on the weighted sum of associated alleles from different genetic loci in the germline estimated by regression models. Recent advances in genetics made it possible to create polygenic predictors of complex human traits, including risks for many important complex diseases, such as cancer, diabetes, or cardiovascular diseases, typically influenced by many genetic variants, each of which has a negligible effect on overall risk. In the current study, we analyzed whether adding additional PRS from other diseases to the prediction models and replacing the regressions with machine learning models can improve overall predictive performance. Results showed that multi-PRS models outperform single-PRS models significantly on different diseases. Moreover, replacing regression models with machine learning models, i.e., deep learning, can also improve overall accuracy.
- Published
- 2023
- Full Text
- View/download PDF
4. Multi-label classification for multi-drug resistance prediction of Escherichia coli
- Author
-
Yunxiao Ren, Trinad Chakraborty, Swapnil Doijad, Linda Falgenhauer, Jane Falgenhauer, Alexander Goesmann, Oliver Schwengers, and Dominik Heider
- Subjects
Multi-drug resistance ,Machine learning ,Multi-label classification ,Biotechnology ,TP248.13-248.65 - Abstract
Antimicrobial resistance (AMR) is a global health and development threat. In particular, multi-drug resistance (MDR) is increasingly common in pathogenic bacteria. It has become a serious problem to public health, as MDR can lead to the failure of treatment of patients. MDR is typically the result of mutations and the accumulation of multiple resistance genes within a single cell. Machine learning methods have a wide range of applications for AMR prediction. However, these approaches typically focus on single drug resistance prediction and do not incorporate information on accumulating antimicrobial resistance traits over time. Thus, identifying multi-drug resistance simultaneously and rapidly remains an open challenge. In our study, we could demonstrate that multi-label classification (MLC) methods can be used to model multi-drug resistance in pathogens. Importantly, we found the ensemble of classifier chains (ECC) model achieves accurate MDR prediction and outperforms other MLC methods. Thus, our study extends the available tools for MDR prediction and paves the way for improving diagnostics of infections in patients. Furthermore, the MLC methods we introduced here would contribute to reducing the threat of antimicrobial resistance and related deaths in the future by improving the speed and accuracy of the identification of pathogens and resistance.
- Published
- 2022
- Full Text
- View/download PDF
5. Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making
- Author
-
Jacqueline Beinecke and Dominik Heider
- Subjects
Machine learning ,Clinical data ,Data augmentation ,Synthetic data ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Analysis ,QA299.6-433 - Abstract
Abstract Clinical data sets have very special properties and suffer from many caveats in machine learning. They typically show a high-class imbalance, have a small number of samples and a large number of parameters, and have missing values. While feature selection approaches and imputation techniques address the former problems, the class imbalance is typically addressed using augmentation techniques. However, these techniques have been developed for big data analytics, and their suitability for clinical data sets is unclear. This study analyzed different augmentation techniques for use in clinical data sets and subsequent employment of machine learning-based classification. It turns out that Gaussian Noise Up-Sampling (GNUS) is not always but generally, is as good as SMOTE and ADASYN and even outperform those on some datasets. However, it has also been shown that augmentation does not improve classification at all in some cases.
- Published
- 2021
- Full Text
- View/download PDF
6. Machine learning with asymmetric abstention for biomedical decision-making
- Author
-
Mariem Gandouz, Hajo Holzmann, and Dominik Heider
- Subjects
Medical data science ,Machine learning ,Classification ,Diagnostics ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Machine learning and artificial intelligence have entered biomedical decision-making for diagnostics, prognostics, or therapy recommendations. However, these methods need to be interpreted with care because of the severe consequences for patients. In contrast to human decision-making, computational models typically make a decision also with low confidence. Machine learning with abstention better reflects human decision-making by introducing a reject option for samples with low confidence. The abstention intervals are typically symmetric intervals around the decision boundary. In the current study, we use asymmetric abstention intervals, which we demonstrate to be better suited for biomedical data that is typically highly imbalanced. We evaluate symmetric and asymmetric abstention on three real-world biomedical datasets and show that both approaches can significantly improve classification performance. However, asymmetric abstention rejects as many or fewer samples compared to symmetric abstention and thus, should be used in imbalanced data.
- Published
- 2021
- Full Text
- View/download PDF
7. Chaos game representation and its applications in bioinformatics
- Author
-
Hannah Franziska Löchel and Dominik Heider
- Subjects
Chaos game representation ,Bioinformatics ,Sequence analysis ,Alignment-free sequence comparison ,DNA and protein encoding ,Machine learning ,Biotechnology ,TP248.13-248.65 - Abstract
Chaos game representation (CGR), a milestone in graphical bioinformatics, has become a powerful tool regarding alignment-free sequence comparison and feature encoding for machine learning. The algorithm maps a sequence to 2-dimensional space, while an extension of the CGR, the so-called frequency matrix representation (FCGR), transforms sequences of different lengths into equal-sized images or matrices. The CGR is a generalized Markov chain and includes various properties, which allow a unique representation of a sequence. Therefore, it has a broad spectrum of applications in bioinformatics, such as sequence comparison and phylogenetic analysis and as an encoding of sequences for machine learning. This review introduces the construction of CGRs and FCGRs, their applications on DNA and proteins, and gives an overview of recent applications and progress in bioinformatics.
- Published
- 2021
- Full Text
- View/download PDF
8. Is the Combination of ADOS and ADI-R Necessary to Classify ASD? Rethinking the 'Gold Standard' in Diagnosing ASD
- Author
-
Inge Kamp-Becker, Johannes Tauscher, Nicole Wolff, Charlotte Küpper, Luise Poustka, Stefan Roepke, Veit Roessner, Dominik Heider, and Sanna Stroth
- Subjects
machine learning ,random forest ,autism spectrum disorder ,clinical characteristics ,differential diagnosis behavioral aspects ,ADOS ,Psychiatry ,RC435-571 - Abstract
Diagnosing autism spectrum disorder (ASD) requires extensive clinical expertise and training as well as a focus on differential diagnoses. The diagnostic process is particularly complex given symptom overlap with other mental disorders and high rates of co-occurring physical and mental health concerns. The aim of this study was to conduct a data-driven selection of the most relevant diagnostic information collected from a behavior observation and an anamnestic interview in two clinical samples of children/younger adolescents and adolescents/adults with suspected ASD. Via random forests, the present study discovered patterns of symptoms in the diagnostic data of 2310 participants (46% ASD, 54% non-ASD, age range 4–72 years) using data from the combined Autism Diagnostic Observation Schedule (ADOS) and Autism Diagnostic Interview—Revised (ADI-R) and ADOS data alone. Classifiers built on reduced subsets of diagnostic features yield satisfactory sensitivity and specificity values. For adolescents/adults specificity values were lower compared to those for children/younger adolescents. The models including ADOS and ADI-R data were mainly built on ADOS items and in the adolescent/adult sample the classifier including only ADOS items performed even better than the classifier including information from both instruments. Results suggest that reduced subsets of ADOS and ADI-R items may suffice to effectively differentiate ASD from other mental disorders. The imbalance of ADOS and ADI-R items included in the models leads to the assumption that, particularly in adolescents and adults, the ADI-R may play a lesser role than current behavior observations.
- Published
- 2021
- Full Text
- View/download PDF
9. Identification of the most indicative and discriminative features from diagnostic instruments for children with autism
- Author
-
Sanna Stroth, Johannes Tauscher, Nicole Wolff, Charlotte Küpper, Luise Poustka, Stefan Roepke, Veit Roessner, Dominik Heider, and Inge Kamp‐Becker
- Subjects
ADI‐R ,ADOS ,autism spectrum disorder ,diagnostic‐gold‐standard ,differential‐diagnosis ,machine learning ,Pediatrics ,RJ1-570 ,Psychiatry ,RC435-571 - Abstract
Abstract Background Diagnosing autism spectrum disorder (ASD) is complex and time‐consuming. The present work systematically examines the importance of items from the Autism Diagnostic Interview‐Revised (ADI‐R) and Autism Diagnostic Observation Schedule (ADOS) in discerning children with and without ASD. Knowledge of the most discriminative features and their underlying concepts may prove valuable for the future training tools that assist clinicians to substantiate or extenuate a suspicion of ASD in nonverbal and minimally verbal children. Methods In two samples of nonverbal (N = 466) and minimally verbal (N = 566) children with ASD (N = 509) and other mental disorders or developmental delays (N = 523), we applied random forests (RFs) to (i) the combination of ADI‐R and ADOS data versus (ii) ADOS data alone. We compared the predictive performance of reduced feature models against outcomes provided by models containing all features. Results For nonverbal children, the RF classifier indicated social orientation to be most powerful in differentiating ASD from non‐ASD cases. In minimally verbal children, we find language/speech peculiarities in combination with facial/nonverbal expressions and reciprocity to be most distinctive. Conclusion Based on machine learning strategies, we carve out those symptoms of ASD that prove to be central for the differentiation of ASD cases from those with other developmental or mental disorders (high specificity in minimally verbal children). These core concepts ought to be considered in the future training tools for clinicians.
- Published
- 2021
- Full Text
- View/download PDF
10. Encodings and models for antimicrobial peptide classification for multi-resistant pathogens
- Author
-
Sebastian Spänig and Dominik Heider
- Subjects
Machine learning ,Antimicrobial peptides ,Encodings ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Analysis ,QA299.6-433 - Abstract
Abstract Antimicrobial peptides (AMPs) are part of the inherent immune system. In fact, they occur in almost all organisms including, e.g., plants, animals, and humans. Remarkably, they show effectivity also against multi-resistant pathogens with a high selectivity. This is especially crucial in times, where society is faced with the major threat of an ever-increasing amount of antibiotic resistant microbes. In addition, AMPs can also exhibit antitumor and antiviral effects, thus a variety of scientific studies dealt with the prediction of active peptides in recent years. Due to their potential, even the pharmaceutical industry is keen on discovering and developing novel AMPs. However, AMPs are difficult to verify in vitro, hence researchers conduct sequence similarity experiments against known, active peptides. Unfortunately, this approach is very time-consuming and limits potential candidates to sequences with a high similarity to known AMPs. Machine learning methods offer the opportunity to explore the huge space of sequence variations in a timely manner. These algorithms have, in principal, paved the way for an automated discovery of AMPs. However, machine learning models require a numerical input, thus an informative encoding is very important. Unfortunately, developing an appropriate encoding is a major challenge, which has not been entirely solved so far. For this reason, the development of novel amino acid encodings is established as a stand-alone research branch. The present review introduces state-of-the-art encodings of amino acids as well as their properties in sequence and structure based aggregation. Moreover, albeit a well-chosen encoding is essential, performant classifiers are required, which is reflected by a tendency towards specifically designed models in the literature. Furthermore, we introduce these models with a particular focus on encodings derived from support vector machines and deep learning approaches. Albeit a strong focus has been set on AMP predictions, not all of the mentioned encodings have been elaborated as part of antimicrobial research studies, but rather as general protein or peptide representations.
- Published
- 2019
- Full Text
- View/download PDF
11. Editorial: Artificial Intelligence Bioinformatics: Development and Application of Tools for Omics and Inter-Omics Studies
- Author
-
Davide Chicco, Dominik Heider, and Angelo Facchiano
- Subjects
artificial intelligence ,bioinformatics ,genomics ,omics ,inter-omics ,machine learning ,Genetics ,QH426-470 - Published
- 2020
- Full Text
- View/download PDF
12. EFS: an ensemble feature selection tool implemented as R-package and web-application
- Author
-
Ursula Neumann, Nikita Genze, and Dominik Heider
- Subjects
Machine learning ,Feature selection ,Ensemble learning ,R-package ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Analysis ,QA299.6-433 - Abstract
Abstract Background Feature selection methods aim at identifying a subset of features that improve the prediction performance of subsequent classification models and thereby also simplify their interpretability. Preceding studies demonstrated that single feature selection methods can have specific biases, whereas an ensemble feature selection has the advantage to alleviate and compensate for these biases. Results The software EFS (Ensemble Feature Selection) makes use of multiple feature selection methods and combines their normalized outputs to a quantitative ensemble importance. Currently, eight different feature selection methods have been integrated in EFS, which can be used separately or combined in an ensemble. Conclusion EFS identifies relevant features while compensating specific biases of single methods due to an ensemble approach. Thereby, EFS can improve the prediction accuracy and interpretability in subsequent binary classification models. Availability EFS can be downloaded as an R-package from CRAN or used via a web application at http://EFS.heiderlab.de .
- Published
- 2017
- Full Text
- View/download PDF
13. ContraDRG: Automatic Partial Charge Prediction by Machine Learning
- Author
-
Roman Martin and Dominik Heider
- Subjects
PRODRG ,ATB ,machine learning ,molecular dynamics simulations ,partial charge prediction ,Genetics ,QH426-470 - Abstract
In recent years, machine learning techniques have been widely used in biomedical research to predict unseen data based on models trained on experimentally derived data. In the current study, we used machine learning algorithms to emulate computationally complex predictions in a reverse engineering–like manner and developed ContraDRG, a software that can be used to predict partial charges for small molecules based on PRODRG and Automated Topology Builder (ATB) predictions. Both tools generate molecular topology files, including the partial atomic charge, by using different procedures. We show that ContraDRG can accurately predict partial charges in a fraction of the time, because it exploits existing complex models with intensive calculations by using machine learning techniques and thus can also be applied for screening projects with large amounts of molecules. We provide ContraDRG as a web server, which can be used to automatically assign partial charges to incoming user-specified molecules by using our machine learning models. In this study, we compared ContraDRG with PRODRG and ATB in regard of predictivity by statistical methods. ContraDRG allows predicting ATB-derived partial charges with an R2 value up to 0.980 and for PRODRG up to 1.00. While ATB requires hours or days for the quantum mechanical accurate calculation and refinements, ContraDRG does its approximation within seconds.
- Published
- 2019
- Full Text
- View/download PDF
14. Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning
- Author
-
Yunxiao Ren, Swapnil Doijad, Oliver Schwengers, Dominik Heider, Alexander Goesmann, Trinad Chakraborty, Linda Falgenhauer, Jane Falgenhauer, and Anne-Christin Hauschild
- Subjects
Statistics and Probability ,Source code ,AcademicSubjects/SCI01060 ,Computer science ,media_common.quotation_subject ,Machine learning ,computer.software_genre ,Biochemistry ,Convolutional neural network ,Encoding (memory) ,Molecular Biology ,Throughput (business) ,media_common ,Whole genome sequencing ,Original Paper ,business.industry ,Genome Analysis ,Computer Science Applications ,Random forest ,Support vector machine ,Data set ,Computational Mathematics ,Computational Theory and Mathematics ,Artificial intelligence ,business ,computer - Abstract
Motivation Antimicrobial resistance (AMR) is one of the biggest global problems threatening human and animal health. Rapid and accurate AMR diagnostic methods are thus very urgently needed. However, traditional antimicrobial susceptibility testing (AST) is time-consuming, low throughput and viable only for cultivable bacteria. Machine learning methods may pave the way for automated AMR prediction based on genomic data of the bacteria. However, comparing different machine learning methods for the prediction of AMR based on different encodings and whole-genome sequencing data without previously known knowledge remains to be done. Results In this study, we evaluated logistic regression (LR), support vector machine (SVM), random forest (RF) and convolutional neural network (CNN) for the prediction of AMR for the antibiotics ciprofloxacin, cefotaxime, ceftazidime and gentamicin. We could demonstrate that these models can effectively predict AMR with label encoding, one-hot encoding and frequency matrix chaos game representation (FCGR encoding) on whole-genome sequencing data. We trained these models on a large AMR dataset and evaluated them on an independent public dataset. Generally, RFs and CNNs perform better than LR and SVM with AUCs up to 0.96. Furthermore, we were able to identify mutations that are associated with AMR for each antibiotic. Availability and implementation Source code in data preparation and model training are provided at GitHub website (https://github.com/YunxiaoRen/ML-iAMR). Supplementary information Supplementary data are available at Bioinformatics online.
- Published
- 2021
15. Quantification of the covariation of lake microbiomes and environmental variables using a machine learning‐based framework
- Author
-
Nico Kreuder, Jens Boenigk, Theodor Sperlea, Dominik Heider, Daniela Beisser, and Georges Hattab
- Subjects
0106 biological sciences ,0301 basic medicine ,Biology ,Machine learning ,computer.software_genre ,010603 evolutionary biology ,01 natural sciences ,Machine Learning ,03 medical and health sciences ,Microbial ecology ,Genetics ,Systems ecology ,Ecosystem ,Ecology, Evolution, Behavior and Systematics ,Ecology ,business.industry ,Microbiota ,Lake ecosystem ,Lakes ,030104 developmental biology ,Taxon ,Habitat ,Microbial population biology ,Artificial intelligence ,business ,Biologie ,Bioindicator ,computer - Abstract
It is known that microorganisms are essential for the functioning of ecosystems, but the extent to which microorganisms respond to different environmental variables in their natural habitats is not clear. In the current study, we present a methodological framework to quantify the covariation of the microbial community of a habitat and environmental variables of this habitat. It is built on theoretical considerations of systems ecology, makes use of state-of-the-art machine learning techniques and can be used to identify bioindicators. We apply the framework to a data set containing operational taxonomic units (OTUs) as well as more than twenty physicochemical and geographic variables measured in a large-scale survey of European lakes. While a large part of variation (up to 61%) in many environmental variables can be explained by microbial community composition, some variables do not show significant covariation with the microbial lake community. Moreover, we have identified OTUs that act as "multitask" bioindicators, i.e., that are indicative for multiple environmental variables, and thus could be candidates for lake water monitoring schemes. Our results represent, for the first time, a quantification of the covariation of the lake microbiome and a wide array of environmental variables for lake ecosystems. Building on the results and methodology presented here, it will be possible to identify microbial taxa and processes that are essential for functioning and stability of lake ecosystems.
- Published
- 2021
16. Chaos game representation and its applications in bioinformatics
- Author
-
Dominik Heider and Hannah F. Löchel
- Subjects
Sequence ,Markov chain ,Computer science ,Bioinformatics ,DNA and protein encoding ,Biophysics ,Representation (systemics) ,Sequence analysis ,Review Article ,Extension (predicate logic) ,Biochemistry ,Computer Science Applications ,Frequency matrix ,Chaos game representation ,Structural Biology ,Encoding (memory) ,Machine learning ,Genetics ,Feature (machine learning) ,Alignment-free sequence comparison ,TP248.13-248.65 ,ComputingMethodologies_COMPUTERGRAPHICS ,Biotechnology - Abstract
Graphical abstract, Highlights • Applications in phylogeny. • Alignment-free sequence comparison. • Encoding for machine learning., Chaos game representation (CGR), a milestone in graphical bioinformatics, has become a powerful tool regarding alignment-free sequence comparison and feature encoding for machine learning. The algorithm maps a sequence to 2-dimensional space, while an extension of the CGR, the so-called frequency matrix representation (FCGR), transforms sequences of different lengths into equal-sized images or matrices. The CGR is a generalized Markov chain and includes various properties, which allow a unique representation of a sequence. Therefore, it has a broad spectrum of applications in bioinformatics, such as sequence comparison and phylogenetic analysis and as an encoding of sequences for machine learning. This review introduces the construction of CGRs and FCGRs, their applications on DNA and proteins, and gives an overview of recent applications and progress in bioinformatics.
- Published
- 2021
17. Comparative analyses of error handling strategies for next-generation sequencing in precision medicine
- Author
-
Dominik Heider and Hannah F. Löchel
- Subjects
Computer science ,Process (engineering) ,lcsh:Medicine ,HIV Infections ,Machine learning ,computer.software_genre ,DNA sequencing ,Article ,Workflow ,Genetics ,Humans ,Precision Medicine ,lcsh:Science ,Data mining ,Sequence ,Multidisciplinary ,business.industry ,lcsh:R ,High-Throughput Nucleotide Sequencing ,Sequence Analysis, DNA ,Precision medicine ,HIV-1 ,lcsh:Q ,Artificial intelligence ,Personalized medicine ,business ,computer - Abstract
Next-generation sequencing (NGS) offers the opportunity to sequence millions and billions of DNA sequences in a short period, leading to novel applications in personalized medicine, such as cancer diagnostics or antiviral therapy. Nevertheless, sequencing technologies have different error rates, which occur during the sequencing process. If the NGS data is used for diagnostics, these sequences with errors are typically neglected or a worst-case scenario is assumed. In the current study, we focused on the impact of ambiguous bases on therapy recommendations for Human Immunodeficiency Virus 1 (HIV-1) patients. Concretely, we analyzed the treatment recommendation with entry blockers based on prediction models for co-receptor tropism. We compared three different error handling strategies that have been used in the literature, namely (i) neglection, (ii) worst-case assumption, and (iii) deconvolution with a majority vote. We could show that for two or more ambiguous positions per sequence a reliable prediction is generally no longer possible. Moreover, also the position of ambiguity plays a crucial role. Thus, we analyzed the error probability distributions of existing sequencing technologies, e.g., Illumina MiSeq or PacBio, with respect to the aforementioned error handling strategies and it turned out that neglection outperforms the other strategies in the case where no systematic errors are present. In other cases, the deconvolution strategy with the majority vote should be preferred.
- Published
- 2020
18. Machine learning with asymmetric abstention for biomedical decision-making
- Author
-
Dominik Heider, Mariem Gandouz, and Hajo Holzmann
- Subjects
Computer science ,Low Confidence ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Health Informatics ,Machine learning ,computer.software_genre ,Imbalanced data ,Machine Learning ,Biomedical data ,Artificial Intelligence ,Humans ,Decision-making ,Diagnostics ,Computational model ,business.industry ,Research ,Medical data science ,Health Policy ,Contrast (statistics) ,Classification ,Computer Science Applications ,Decision boundary ,Prognostics ,Artificial intelligence ,business ,computer - Abstract
Machine learning and artificial intelligence have entered biomedical decision-making for diagnostics, prognostics, or therapy recommendations. However, these methods need to be interpreted with care because of the severe consequences for patients. In contrast to human decision-making, computational models typically make a decision also with low confidence. Machine learning with abstention better reflects human decision-making by introducing a reject option for samples with low confidence. The abstention intervals are typically symmetric intervals around the decision boundary. In the current study, we use asymmetric abstention intervals, which we demonstrate to be better suited for biomedical data that is typically highly imbalanced. We evaluate symmetric and asymmetric abstention on three real-world biomedical datasets and show that both approaches can significantly improve classification performance. However, asymmetric abstention rejects as many or fewer samples compared to symmetric abstention and thus, should be used in imbalanced data. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01655-y.
- Published
- 2021
19. Is the Combination of ADOS and ADI-R Necessary to Classify ASD? Rethinking the 'Gold Standard' in Diagnosing ASD
- Author
-
Charlotte Küpper, Dominik Heider, Nicole Wolff, Luise Poustka, Johannes Tauscher, Inge Kamp-Becker, Sanna Stroth, Veit Roessner, and Stefan Roepke
- Subjects
ADI-R ,RC435-571 ,autism spectrum disorder ,Autism Diagnostic Observation Schedule ,03 medical and health sciences ,0302 clinical medicine ,medicine ,Diagnostic data ,0501 psychology and cognitive sciences ,Medical diagnosis ,clinical characteristics ,Original Research ,ADOS ,Psychiatry ,Autism Diagnostic Interview ,business.industry ,05 social sciences ,Gold standard (test) ,medicine.disease ,Mental health ,Psychiatry and Mental health ,machine learning ,Goldstandard ,Autism spectrum disorder ,Autism ,business ,030217 neurology & neurosurgery ,random forest ,differential diagnosis behavioral aspects ,050104 developmental & child psychology ,Clinical psychology - Abstract
Diagnosing autism spectrum disorder (ASD) requires extensive clinical expertise and training as well as a focus on differential diagnoses. The diagnostic process is particularly complex given symptom overlap with other mental disorders and high rates of co-occurring physical and mental health concerns. The aim of this study was to conduct a data-driven selection of the most relevant diagnostic information collected from a behavior observation and an anamnestic interview in two clinical samples of children/younger adolescents and adolescents/adults with suspected ASD. Via random forests, the present study discovered patterns of symptoms in the diagnostic data of 2310 participants (46% ASD, 54% non-ASD, age range 4–72 years) using data from the combined Autism Diagnostic Observation Schedule (ADOS) and Autism Diagnostic Interview—Revised (ADI-R) and ADOS data alone. Classifiers built on reduced subsets of diagnostic features yield satisfactory sensitivity and specificity values. For adolescents/adults specificity values were lower compared to those for children/younger adolescents. The models including ADOS and ADI-R data were mainly built on ADOS items and in the adolescent/adult sample the classifier including only ADOS items performed even better than the classifier including information from both instruments. Results suggest that reduced subsets of ADOS and ADI-R items may suffice to effectively differentiate ASD from other mental disorders. The imbalance of ADOS and ADI-R items included in the models leads to the assumption that, particularly in adolescents and adults, the ADI-R may play a lesser role than current behavior observations.
- Published
- 2021
20. Integrative Analysis of Next-Generation Sequencing for Next-Generation Cancer Research toward Artificial Intelligence
- Author
-
Dominik Heider, Anne-Christin Hauschild, and Youngjun Park
- Subjects
0301 basic medicine ,Cancer Research ,Computer science ,Systems biology ,Big data ,Review ,Tumor heterogeneity ,DNA sequencing ,03 medical and health sciences ,0302 clinical medicine ,Research question ,RC254-282 ,biological network ,business.industry ,pathway ,deep neural network ,Neoplasms. Tumors. Oncology. Including cancer and carcinogens ,systems biology ,artificial intelligence ,030104 developmental biology ,machine learning ,Oncology ,030220 oncology & carcinogenesis ,Data analysis ,Cancer research ,Deep neural networks ,next-generation sequencing ,business ,Biological network - Abstract
Simple Summary In recent years both research areas of next-generation sequencing and artificial intelligence have grown remarkably. Their intersection simultaneously gave rise to a panacea of different algorithms and applications. This article delineates tailored machine learning and systems biology approaches and combinations thereof that tackle the various challenges that arise in the face of big data. Moreover, it provides an overview of the numerous applications of artificial intelligence aiding the analysis and interpretation of next-generation sequencing data. Abstract The rapid improvement of next-generation sequencing (NGS) technologies and their application in large-scale cohorts in cancer research led to common challenges of big data. It opened a new research area incorporating systems biology and machine learning. As large-scale NGS data accumulated, sophisticated data analysis methods became indispensable. In addition, NGS data have been integrated with systems biology to build better predictive models to determine the characteristics of tumors and tumor subtypes. Therefore, various machine learning algorithms were introduced to identify underlying biological mechanisms. In this work, we review novel technologies developed for NGS data analysis, and we describe how these computational methodologies integrate systems biology and omics data. Subsequently, we discuss how deep neural networks outperform other approaches, the potential of graph neural networks (GNN) in systems biology, and the limitations in NGS biomedical research. To reflect on the various challenges and corresponding computational solutions, we will discuss the following three topics: (i) molecular characteristics, (ii) tumor heterogeneity, and (iii) drug discovery. We conclude that machine learning and network-based approaches can add valuable insights and build highly accurate models. However, a well-informed choice of learning algorithm and biological network information is crucial for the success of each specific research question.
- Published
- 2021
21. A large-scale comparative study on peptide encodings for biomedical classification
- Author
-
Georges Hattab, Siba Mohsen, Dominik Heider, Anne-Christin Hauschild, and Sebastian Spänig
- Subjects
AcademicSubjects/SCI01140 ,AcademicSubjects/SCI01060 ,Computer science ,AcademicSubjects/SCI00030 ,Standard Article ,Machine learning ,computer.software_genre ,AcademicSubjects/SCI01180 ,Task (project management) ,03 medical and health sciences ,0302 clinical medicine ,Encoding (memory) ,Selection (linguistics) ,030304 developmental biology ,Structure (mathematical logic) ,0303 health sciences ,Sequence ,business.industry ,Range (mathematics) ,Workflow ,Artificial intelligence ,AcademicSubjects/SCI00980 ,business ,Completeness (statistics) ,computer ,030217 neurology & neurosurgery - Abstract
Owing to the great variety of distinct peptide encodings, working on a biomedical classification task at hand is challenging. Researchers have to determine encodings capable to represent underlying patterns as numerical input for the subsequent machine learning. A general guideline is lacking in the literature, thus, we present here the first large-scale comprehensive study to investigate the performance of a wide range of encodings on multiple datasets from different biomedical domains. For the sake of completeness, we added additional sequence- and structure-based encodings. In particular, we collected 50 biomedical datasets and defined a fixed parameter space for 48 encoding groups, leading to a total of 397 700 encoded datasets. Our results demonstrate that none of the encodings are superior for all biomedical domains. Nevertheless, some encodings often outperform others, thus reducing the initial encoding selection substantially. Our work offers researchers to objectively compare novel encodings to the state of the art. Our findings pave the way for a more sophisticated encoding optimization, for example, as part of automated machine learning pipelines. The work presented here is implemented as a large-scale, end-to-end workflow designed for easy reproducibility and extensibility. All standardized datasets and results are available for download to comply with FAIR standards.
- Published
- 2021
22. Mushroom data creation, curation, and simulation to support classification tasks
- Author
-
Dennis Wagner, Dominik Heider, and Georges Hattab
- Subjects
Databases, Factual ,Classification and taxonomy ,Science ,Discriminant Analysis ,Quality control ,Bayes Theorem ,Pilot Projects ,Scientific data ,Article ,Machine Learning ,Data processing ,Logistic Models ,Medicine ,Computational models ,Computer Simulation ,Data integration ,Agaricales ,Data mining ,Algorithms ,Data Curation - Abstract
Predicting if a set of mushrooms is edible or not corresponds to the task of classifying them into two groups—edible or poisonous—on the basis of a classification rule. To support this binary task, we have collected the largest and most comprehensive attribute based data available. In this work, we detail the creation, curation and simulation of a data set for binary classification. Thanks to natural language processing, the primary data are based on a text book for mushroom identification and contain 173 species from 23 families. While the secondary data comprise simulated or hypothetical entries that are structurally comparable to the 1987 data, it serves as pilot data for classification tasks. We evaluated different machine learning algorithms, namely, naive Bayes, logistic regression, and linear discriminant analysis (LDA), and random forests (RF). We found that the RF provided the best results with a five-fold Cross-Validation accuracy and F2-score of 1.0 (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu =1$$\end{document}μ=1, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =0$$\end{document}σ=0), respectively. The results of our pilot are conclusive and indicate that our data were not linearly separable. Unlike the 1987 data which showed good results using a linear decision boundary with the LDA. Our data set contains 23 families and is the largest available. We further provide a fully reproducible workflow and provide the data under the FAIR principles.
- Published
- 2021
23. Quantifying the information content of lake microbiomes using a machine learning-based framework
- Author
-
Nico Kreuder, Georges Hattab, Jens Boenigk, Theodor Sperlea, Daniela Beisser, and Dominik Heider
- Subjects
business.industry ,Computer science ,Content (measure theory) ,Artificial intelligence ,Machine learning ,computer.software_genre ,business ,computer - Abstract
Background: Bacteria and microbial eukaryotes occupy a wide range of ecological niches and are essential for the functioning of ecosystems. The advent of next-generation sequencing methods enabled the study of environmental microbial community compositions. Yet, many questions regarding the stability and functioning of environmental microbiomes remain open. Results: In the current study, we present a methodological framework to quantify the information shared between the microbial community of a habitat and the abiotic parameters of this habitat. It is built on theoretical considerations of systems ecology and makes use of state-of-the-art machine learning techniques. It can also be used to identify bioindicators. We apply the framework to a dataset containing operational taxonomic units (OTUs) as well as more than twenty physico-chemical and geographic parameters measured in a large-scale survey of European lakes. While a large part of variation (up to 61\%) in many physico-chemical parameters can be explained by microbial community composition, some of the examined parameters only share little information with the microbiome. Moreover, we have identified OTUs that act as `multi-task’ bioindicators that could be potential candidates for lake water monitoring schemes. Conclusions: This study demonstrates the benefits of machine learning approaches in microbial ecology. Our results represent, for the first time, a quantification of information shared between the lake microbiome and a wide array of ecosystem parameters. Building on the results and methodology presented here, it will be possible to identify microbial taxa and processes central for the functioning and stability of lake ecosystems.
- Published
- 2020
24. Editorial: Artificial Intelligence Bioinformatics: Development and Application of Tools for Omics and Inter-Omics Studies
- Author
-
Dominik Heider, Davide Chicco, Angelo Facchiano, Chicco, D, Heider, D, and Facchiano, A
- Subjects
bioinformatic ,lcsh:QH426-470 ,Computer science ,inter-omic ,omic ,INF/01 - INFORMATICA ,Genomics ,data mining ,bioinformatics ,inter-omics ,Proteomics ,Omics ,artificial intelligence ,Data science ,omics ,genomic ,lcsh:Genetics ,Editorial ,proteomics ,machine learning ,Genetics ,genomics ,Molecular Medicine ,proteomic ,Genetics (clinical) - Published
- 2020
25. Encodings and models for antimicrobial peptide classification for multi-resistant pathogens
- Author
-
Dominik Heider and Sebastian Spänig
- Subjects
Computer science ,Antimicrobial peptides ,lcsh:Analysis ,Review ,Computational biology ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,03 medical and health sciences ,Encoding (memory) ,Machine learning ,Genetics ,Set (psychology) ,Molecular Biology ,030304 developmental biology ,0303 health sciences ,Sequence ,business.industry ,Deep learning ,030302 biochemistry & molecular biology ,Principal (computer security) ,lcsh:QA299.6-433 ,Antimicrobial ,Computer Science Applications ,Support vector machine ,Computational Mathematics ,Computational Theory and Mathematics ,lcsh:R858-859.7 ,Artificial intelligence ,Encodings ,business - Abstract
Antimicrobial peptides (AMPs) are part of the inherent immune system. In fact, they occur in almost all organisms including, e.g., plants, animals, and humans. Remarkably, they show effectivity also against multi-resistant pathogens with a high selectivity. This is especially crucial in times, where society is faced with the major threat of an ever-increasing amount of antibiotic resistant microbes. In addition, AMPs can also exhibit antitumor and antiviral effects, thus a variety of scientific studies dealt with the prediction of active peptides in recent years. Due to their potential, even the pharmaceutical industry is keen on discovering and developing novel AMPs. However, AMPs are difficult to verify in vitro, hence researchers conduct sequence similarity experiments against known, active peptides. Unfortunately, this approach is very time-consuming and limits potential candidates to sequences with a high similarity to known AMPs. Machine learning methods offer the opportunity to explore the huge space of sequence variations in a timely manner. These algorithms have, in principal, paved the way for an automated discovery of AMPs. However, machine learning models require a numerical input, thus an informative encoding is very important. Unfortunately, developing an appropriate encoding is a major challenge, which has not been entirely solved so far. For this reason, the development of novel amino acid encodings is established as a stand-alone research branch. The present review introduces state-of-the-art encodings of amino acids as well as their properties in sequence and structure based aggregation. Moreover, albeit a well-chosen encoding is essential, performant classifiers are required, which is reflected by a tendency towards specifically designed models in the literature. Furthermore, we introduce these models with a particular focus on encodings derived from support vector machines and deep learning approaches. Albeit a strong focus has been set on AMP predictions, not all of the mentioned encodings have been elaborated as part of antimicrobial research studies, but rather as general protein or peptide representations.
- Published
- 2019
26. Non-invasive assessment of NAFLD as systemic disease—A machine learning perspective
- Author
-
Monika Rau, Ali Canbay, Dominik Heider, Jan-Peter Sowa, Julia Kälsch, Andreas Geier, Hideo A. Baba, Simon Hohenester, Ursula Neumann, and Christian Rust
- Subjects
Male ,Metabolic Analysis ,0301 basic medicine ,Physiology ,Peptide Hormones ,Medizin ,Apoptosis ,Disease ,computer.software_genre ,Biochemistry ,Cohort Studies ,Machine Learning ,0302 clinical medicine ,Non-alcoholic Fatty Liver Disease ,Weight loss ,Immune Physiology ,Medicine and Health Sciences ,Medicine ,Innate Immune System ,Multidisciplinary ,Liver Diseases ,Fatty liver ,Middle Aged ,Bioassays and Physiological Analysis ,Physiological Parameters ,Cohort ,Cytokines ,Female ,030211 gastroenterology & hepatology ,Adiponectin ,Anatomy ,medicine.symptom ,Research Article ,Cohort study ,Adult ,Histology ,Waist ,Science ,Immunology ,Gastroenterology and Hepatology ,Research and Analysis Methods ,Machine learning ,03 medical and health sciences ,Adipokines ,Weight Loss ,Basal Metabolic Rate Measurement ,Humans ,Obesity ,business.industry ,Body Weight ,Computational Biology ,Biology and Life Sciences ,nutritional and metabolic diseases ,Molecular Development ,medicine.disease ,Fibrosis ,Hormones ,Fatty Liver ,030104 developmental biology ,Immune System ,Artificial intelligence ,Steatohepatitis ,business ,computer ,Biomarkers ,Developmental Biology - Abstract
Background & aims Current non-invasive scores for the assessment of severity of non-alcoholic fatty liver disease (NAFLD) and identification of patients with non-alcoholic steatohepatitis (NASH) have insufficient performance to be included in clinical routine. In the current study, we developed a novel machine learning approach to overcome the caveats of existing approaches. Methods Non-invasive parameters were selected by an ensemble feature selection (EFS) from a retrospectively collected training cohort of 164 obese individuals (age: 43.5±10.3y; BMI: 54.1 ±10.1kg/m ² ) to develop a model able to predict the histological assessed NAFLD activity score (NAS). The model was evaluated in an independent validation cohort (122 patients, age: 45.2±11.75y, BMI: 50.8±8.61kg/m ² ). Results EFS identified age, γGT, HbA1c, adiponectin, and M30 as being highly associated with NAFLD. The model reached a Spearman correlation coefficient with the NAS of 0.46 in the training cohort and was able to differentiate between NAFL (NAS4) and NASH (NAS>4) with an AUC of 0.73. In the independent validation cohort, an AUC of 0.7 was achieved for this separation. We further analyzed the potential of the new model for disease monitoring in an obese cohort of 38 patients under lifestyle intervention for one year. While all patients lost weight under intervention, increasing scores were observed in 15 patients. Increasing scores were associated with significantly lower absolute weight loss, lower reduction of waist circumference and basal metabolic rate. Conclusions A newly developed model (http://CHek.heiderlab.de) can predict presence or absence of NASH with reasonable performance. The new score could be used to detect NASH and monitor disease progression or therapy response to weight loss interventions. CA extern
- Published
- 2019
27. FRI - Feature Relevance Intervals for Interpretable and Interactive Data Exploration
- Author
-
Ursula Neumann, Dominik Heider, Lukas Pfannschmidt, Barbara Hammer, and Christina Göpfert
- Subjects
Clustering high-dimensional data ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,Linear classifier ,Feature selection ,Machine Learning (stat.ML) ,Machine learning ,computer.software_genre ,01 natural sciences ,Computer Science - Information Retrieval ,Machine Learning (cs.LG) ,010104 statistics & probability ,03 medical and health sciences ,feature selection ,Statistics - Machine Learning ,Feature relevance ,0101 mathematics ,Spurious relationship ,030304 developmental biology ,Interpretability ,computer.programming_language ,interactive biomarker discovery ,0303 health sciences ,business.industry ,Python (programming language) ,Batch processing ,Artificial intelligence ,business ,interpretability ,computer ,global feature relevance ,Information Retrieval (cs.IR) - Abstract
Most existing feature selection methods are insufficient for analytic purposes as soon as high dimensional data or redundant sensor signals are dealt with since features can be selected due to spurious effects or correlations rather than causal effects. To support the finding of causal features in biomedical experiments, we hereby present FRI, an open source Python library that can be used to identify all-relevant variables in linear classification and (ordinal) regression problems. Using the recently proposed feature relevance method, FRI is able to provide the base for further general experimentation or in specific can facilitate the search for alternative biomarkers. It can be used in an interactive context, by providing model manipulation and visualization methods, or in a batch process as a filter method., Comment: Addition of IEEE copyright notice. Accepted for CIBCB 2019 (https://cibcb2019.icas.xyz/)
- Published
- 2019
28. SEDE-GPS: socio-economic data enrichment based on GPS information
- Author
-
Theodor Sperlea, Jens Boenigk, Stefan Füser, and Dominik Heider
- Subjects
0301 basic medicine ,Ecology (disciplines) ,GPS ,Biodiversity ,lcsh:Computer applications to medicine. Medical informatics ,Biochemistry ,Freshwater ecosystem ,Environmental data ,Database ,Microbial ecology ,Machine Learning ,03 medical and health sciences ,Structural Biology ,Ecosystem ,lcsh:QH301-705.5 ,Molecular Biology ,Environmental quality ,Ecology ,business.industry ,Applied Mathematics ,Microbiota ,Environmental resource management ,Lake ecosystem ,Data enrichment ,Computer Science Applications ,Lakes ,030104 developmental biology ,Geography ,lcsh:Biology (General) ,Socioeconomic Factors ,Geographic Information Systems ,lcsh:R858-859.7 ,business ,Biologie ,Software ,Algorithms - Abstract
Background Microbes are essentail components of all ecosystems because they drive many biochemical processes and act as primary producers. In freshwater ecosystems, the biodiversity in and the composition of microbial communities can be used as indicators for environmental quality. Recently, some environmental features have been identified that influence microbial ecosystems. However, the impact of human action on lake microbiomes is not well understood. This is, in part, due to the fact that environmental data is, albeit theoretically accessible, not easily available. Results In this work, we present SEDE-GPS, a tool that gathers data that are relevant to the environment of an user-provided GPS coordinate. To this end, it accesses a list of public and corporate databases and aggregates the information in a single file, which can be used for further analysis. To showcase the use of SEDE-GPS, we enriched a lake microbial ecology sequencing dataset with around 18,000 socio-economic, climate, and geographic features. The sources of SEDE-GPS are public databases such as Eurostat, the Climate Data Center, and OpenStreetMap, as well as corporate sources such as Twitter. Using machine learning and feature selection methods, we were able to identify features in the data provided by SEDE-GPS that can be used to predict lake microbiome alpha diversity. Conclusion The results presented in this study show that SEDE-GPS is a handy and easy-to-use tool for comprehensive data enrichment for studies of ecology and other processes that are affected by environmental features. Furthermore, we present lists of environmental, socio-economic, and climate features that are predictive for microbial biodiversity in lake ecosystems. These lists indicate that human action has a major impact on lake microbiomes. SEDE-GPS and its source code is available for download at http://SEDE-GPS.heiderlab.de Electronic supplementary material The online version of this article (10.1186/s12859-018-2419-4) contains supplementary material, which is available to authorized users.
- Published
- 2018
29. GUESS: projecting machine learning scores to well-calibrated probability estimates for clinical decision-making
- Author
-
Johanna Schwarz and Dominik Heider
- Subjects
Statistics and Probability ,Computer science ,Bayesian probability ,Clinical Decision-Making ,Machine learning ,computer.software_genre ,Biochemistry ,Clinical decision support system ,Machine Learning ,03 medical and health sciences ,Bayes' theorem ,Molecular Biology ,030304 developmental biology ,Probability ,0303 health sciences ,business.industry ,030302 biochemistry & molecular biology ,Probabilistic logic ,Bayes Theorem ,Class discrimination ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,Calibration ,Artificial intelligence ,business ,computer ,Classifier (UML) ,Quantile - Abstract
Motivation Clinical decision support systems have been applied in numerous fields, ranging from cancer survival toward drug resistance prediction. Nevertheless, clinical decision support systems typically have a caveat: many of them are perceived as black-boxes by non-experts and, unfortunately, the obtained scores cannot usually be interpreted as class probability estimates. In probability-focused medical applications, it is not sufficient to perform well with regards to discrimination and, consequently, various calibration methods have been developed to enable probabilistic interpretation. The aims of this study were (i) to develop a tool for fast and comparative analysis of different calibration methods, (ii) to demonstrate their limitations for the use on clinical data and (iii) to introduce our novel method GUESS. Results We compared the performances of two different state-of-the-art calibration methods, namely histogram binning and Bayesian Binning in Quantiles, as well as our novel method GUESS on both, simulated and real-world datasets. GUESS demonstrated calibration performance comparable to the state-of-the-art methods and always retained accurate class discrimination. GUESS showed superior calibration performance in small datasets and therefore may be an optimal calibration method for typical clinical datasets. Moreover, we provide a framework (CalibratR) for R, which can be used to identify the most suitable calibration method for novel datasets in a timely and efficient manner. Using calibrated probability estimates instead of original classifier scores will contribute to the acceptance and dissemination of machine learning based classification models in cost-sensitive applications, such as clinical research. Availability and implementation GUESS as part of CalibratR can be downloaded at CRAN.
- Published
- 2018
30. Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach
- Author
-
Ursula, Neumann, Mona, Riemenschneider, Jan-Peter, Sowa, Theodor, Baars, Julia, Kälsch, Ali, Canbay, and Dominik, Heider
- Subjects
Research ,Ensemble learning ,Machine learning ,Feature selection ,Biomarker discovery ,Random forest - Abstract
Motivation Biomarker discovery methods are essential to identify a minimal subset of features (e.g., serum markers in predictive medicine) that are relevant to develop prediction models with high accuracy. By now, there exist diverse feature selection methods, which either are embedded, combined, or independent of predictive learning algorithms. Many preceding studies showed the defectiveness of single feature selection results, which cause difficulties for professionals in a variety of fields (e.g., medical practitioners) to analyze and interpret the obtained feature subsets. Whereas each of these methods is highly biased, an ensemble feature selection has the advantage to alleviate and compensate for such biases. Concerning the reliability, validity, and reproducibility of these methods, we examined eight different feature selection methods for binary classification datasets and developed an ensemble feature selection system. Results By using an ensemble of feature selection methods, a quantification of the importance of the features could be obtained. The prediction models that have been trained on the selected features showed improved prediction performance. Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0114-4) contains supplementary material, which is available to authorized users.
- Published
- 2016
31. Exploiting HIV-1 protease and reverse transcriptase cross-resistance information for improved drug resistance prediction by means of multi-label classification
- Author
-
Mona, Riemenschneider, Robin, Senge, Ursula, Neumann, Eyke, Hüllermeier, and Dominik, Heider
- Subjects
Retrovirus ,Machine learning ,Short Report ,Infectious diseases ,HIV therapy - Abstract
Background Antiretroviral therapy is essential for human immunodeficiency virus (HIV) infected patients to inhibit viral replication and therewith to slow progression of disease and prolong a patient’s life. However, the high mutation rate of HIV can lead to a fast adaptation of the virus under drug pressure and thereby to the evolution of resistant variants. In turn, these variants will lead to the failure of antiretroviral treatment. Moreover, these mutations cannot only lead to resistance against single drugs, but also to cross-resistance, i.e., resistance against drugs that have not yet been applied. Methods 662 protease sequences and 715 reverse transcriptase sequences with complete resistance profiles were analyzed using machine learning techniques, namely binary relevance classifiers, classifier chains, and ensembles of classifier chains. Results In our study, we applied multi-label classification models incorporating cross-resistance information to predict drug resistance for two of the major drug classes used in antiretroviral therapy for HIV-1, namely protease inhibitors (PIs) and non-nucleoside reverse transcriptase inhibitors (NNRTIs). By means of multi-label learning, namely classifier chains (CCs) and ensembles of classifier chains (ECCs), we were able to improve overall prediction accuracy for all drugs compared to hitherto applied binary classification models. Conclusions The development of fast and precise models to predict drug resistance in HIV-1 is highly important to enable a highly effective personalized therapy. Cross-resistance information can be exploited to improve prediction accuracy of computational drug resistance models. Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0089-1) contains supplementary material, which is available to authorized users.
- Published
- 2015
32. Dynamic causal modeling with genetic algorithms
- Author
-
Andreas Jansen, Sascha Hauke, Martin Pyka, Tilo Kircher, and Dominik Heider
- Subjects
Fitness landscape ,Models, Neurological ,Crossover ,Bayesian inference ,Machine learning ,computer.software_genre ,Genetic algorithm ,Genetics ,Image Processing, Computer-Assisted ,Animals ,Humans ,Computer Simulation ,Selection (genetic algorithm) ,Mathematics ,Causal model ,Fitness function ,business.industry ,General Neuroscience ,Brain ,Bayes Theorem ,Magnetic Resonance Imaging ,Oxygen ,Nonlinear Dynamics ,Mutation (genetic algorithm) ,Artificial intelligence ,business ,Biologie ,computer ,Algorithms - Abstract
In the last years, dynamic causal modeling has gained increased popularity in the neuroimaging community as an approach for the estimation of effective connectivity from functional magnetic resonance imaging (fMRI) data. The algorithm calls for an a priori defined model, whose parameter estimates are subsequently computed upon the given data. As the number of possible models increases exponentially with additional areas, it rapidly becomes inefficient to compute parameter estimates for all models in order to reveal the family of models with the highest posterior probability. In the present study, we developed a genetic algorithm for dynamic causal models and investigated whether this evolutionary approach can accelerate the model search. In this context, the configuration of the intrinsic, extrinsic and bilinear connection matrices represents the genetic code and Bayesian model selection serves as a fitness function. Using crossover and mutation, populations of models are created and compared with each other. The most probable ones survive the current generation and serve as a source for the next generation of models. Tests with artificially created data sets show that the genetic algorithm approximates the most plausible models faster than a random-driven brute-force search. The fitness landscape revealed by the genetic algorithm indicates that dynamic causal modeling has excellent properties for evolution-driven optimization techniques.
- Published
- 2011
33. Insights into the classification of small GTPases
- Author
-
Martin Pyka, Sascha Hauke, Dominik Heider, and Daniel Kessler
- Subjects
Random Forests ,Software tool ,Medizin ,Computational biology ,GTPase ,Guanosine triphosphate ,Biology ,computer.software_genre ,Biochemistry, Genetics and Molecular Biology (miscellaneous) ,Biochemistry ,Genome ,lcsh:Biochemistry ,chemistry.chemical_compound ,cancer ,lcsh:QD415-436 ,lcsh:QH301-705.5 ,Original Research ,SUPERFAMILY ,proteins ,Computer Science Applications ,Informatik ,machine learning ,classification ,chemistry ,lcsh:Biology (General) ,Chemistry (miscellaneous) ,Advances and Applications in Bioinformatics and Chemistry ,Data mining ,Biologie ,computer ,Intracellular transport - Abstract
Dominik Heider1, Sascha Hauke3, Martin Pyka4, Daniel Kessler21Department of Bioinformatics, Center for Medical Biotechnology, 2Institute of Cell Biology (Cancer Research), University of Duisburg-Essen, Essen, Germany; 3Institute of Computer Science, University of Münster, Münster, Germany; 4Interdisciplinary Center for Clinical Research, University Hospital of Münster, Münster, GermanyAbstract: In this study we used a Random Forest-based approach for an assignment of small guanosine triphosphate proteins (GTPases) to specific subgroups. Small GTPases represent an important functional group of proteins that serve as molecular switches in a wide range of fundamental cellular processes, including intracellular transport, movement and signaling events. These proteins have further gained a special emphasis in cancer research, because within the last decades a huge variety of small GTPases from different subgroups could be related to the development of all types of tumors. Using a random forest approach, we were able to identify the most important amino acid positions for the classification process within the small GTPases superfamily and its subgroups. These positions are in line with the results of earlier studies and have been shown to be the essential elements for the different functionalities of the GTPase families. Furthermore, we provide an accurate and reliable software tool (GTPasePred) to identify potential novel GTPases and demonstrate its application to genome sequences.Keywords: cancer, machine learning, classification, Random Forests, proteins
- Published
- 2010
34. On the Application of Supervised Machine Learning to Trustworthiness Assessment
- Author
-
Dominik Heider, Sebastian Biedermann, Sascha Hauke, and Max Mühlhäuser
- Subjects
business.industry ,Computer science ,media_common.quotation_subject ,Supervised learning ,Bayesian probability ,Probabilistic logic ,Trusted Computing ,Machine learning ,computer.software_genre ,Data modeling ,Generalizability theory ,Artificial intelligence ,business ,Complex adaptive system ,Biologie ,computer ,Reputation ,media_common - Abstract
State-of-the art trust and reputation systems seek to apply machine learning methods to overcome generalizability issues of experience-based Bayesian trust assessment. These approaches are, however, often model-centric instead of focussing on data and the complex adaptive system that is driven by reputation-based service selection. This entails the risk of unrealistic model assumptions. We outline the requirements for robust probabilistic trust assessment using supervised learning and apply a selection of estimators to a real-world dataset, in order to show the effectiveness of supervised methods. Furthermore, we provide a representational mapping of estimator output to a belief logic representation for the modular integration of supervised methods with other trust assessment methodologies.
- Published
- 2013
35. Machine learning on normalized protein sequences
- Author
-
Dominik Heider, Daniel Hoffmann, and Jens Verheyen
- Subjects
Normalization (statistics) ,Current (mathematics) ,Computer science ,Forschungszentren » Zentrum für Medizinische Biotechnologie (ZMB) ,Short Report ,Medizin ,lcsh:Medicine ,Linear interpolation ,Machine learning ,computer.software_genre ,General Biochemistry, Genetics and Molecular Biology ,ddc:570 ,lcsh:Science (General) ,lcsh:QH301-705.5 ,Statistical hypothesis testing ,Medicine(all) ,Sequence ,Biochemistry, Genetics and Molecular Biology(all) ,business.industry ,lcsh:R ,General Medicine ,Variable length ,Random forest ,Informatik ,lcsh:Biology (General) ,Artificial intelligence ,business ,computer ,Biologie ,Interpolation ,lcsh:Q1-390 - Abstract
Background Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. Findings We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%. Conclusions We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.
- Published
- 2011
36. Transfer learning compensates limited data, batch effects and technological heterogeneity in single-cell sequencing
- Author
-
Youngjun Park, Dominik Heider, and Anne-Christin Hauschild
- Subjects
AcademicSubjects/SCI01140 ,AcademicSubjects/SCI01060 ,Research areas ,Computer science ,Big data ,AcademicSubjects/SCI00030 ,Standard Article ,Machine learning ,computer.software_genre ,AcademicSubjects/SCI01180 ,03 medical and health sciences ,0302 clinical medicine ,Cancer genome ,030304 developmental biology ,0303 health sciences ,business.industry ,Scale (chemistry) ,General Medicine ,Single cell sequencing ,Sample size determination ,Pattern recognition (psychology) ,Artificial intelligence ,AcademicSubjects/SCI00980 ,business ,Transfer of learning ,computer ,030217 neurology & neurosurgery - Abstract
Tremendous advances in next-generation sequencing technology have enabled the accumulation of large amounts of omics data in various research areas over the past decade. However, study limitations due to small sample sizes, especially in rare disease clinical research, technological heterogeneity and batch effects limit the applicability of traditional statistics and machine learning analysis. Here, we present a meta-transfer learning approach to transfer knowledge from big data and reduce the search space in data with small sample sizes. Few-shot learning algorithms integrate meta-learning to overcome data scarcity and data heterogeneity by transferring molecular pattern recognition models from datasets of unrelated domains. We explore few-shot learning models with large scale public dataset, TCGA (The Cancer Genome Atlas) and GTEx dataset, and demonstrate their potential as pre-training dataset in other molecular pattern recognition tasks. Our results show that meta-transfer learning is very effective for datasets with a limited sample size. Furthermore, we show that our approach can transfer knowledge across technological heterogeneity, for example, from bulk cell to single-cell data. Our approach can overcome study size constraints, batch effects and technical limitations in analyzing single-cell data by leveraging existing bulk-cell sequencing data.
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.