29 results
Search Results
2. Variable selection for binary classification using error rate p-values applied to metabolomics data.
- Author
-
van Reenen, Mari, Reinecke, Carolus J., Westerhuis, Johan A., and Venter, J. Hendrik
- Subjects
METABOLOMICS ,ERROR rates ,P-value (Statistics) ,DATA ,MYCOBACTERIUM tuberculosis - Abstract
Background: Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The important task of selecting informative variables can therefore become complex. In this paper we look at discriminating between two groups. Two tasks need to be performed: (i) finding variables which differ between the two groups; and (ii) determining how the selected variables can be used to classify new subjects. We introduce an approach using minimum classification error rates as test statistics to find discriminatory and therefore informative variables. The thresholds resulting in the minimum error rates can be used to classify new subjects. This approach transforms error rates into p-values and is referred to as ERp. Results: We show that non-parametric hypothesis testing, based on minimum classification error rates as test statistics, can find statistically significantly shifted variables. The discriminatory ability of variables becomes more apparent when error rates are evaluated based on their corresponding p-values, as relatively high error rates can still be statistically significant. ERp can handle unequal and small group sizes, as well as account for the cost of misclassification. ERp retains (if known) or reveals (if unknown) the shift direction, aiding in biological interpretation. The threshold resulting in the minimum error rate can immediately be used to classify new subjects. We use NMR generated metabolomics data to illustrate how ERp is able to discriminate subjects diagnosed with Mycobacterium tuberculosis infected meningitis from a control group. The list of discriminatory variables produced by ERp contains all biologically relevant variables with appropriate shift directions discussed in the original paper from which this data is taken. Conclusions: ERp performs variable selection and classification, is non-parametric and aids biological interpretation while handling unequal group sizes and misclassification costs. All this is achieved by a single approach which is easy to perform and interpret. ERp has the potential to address many other characteristics of metabolomics data. Future research aims to extend ERp to account for a large proportion of observations below the detection limit, as well as expand on interactions between variables. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
3. Integrating gene expression and protein-protein interaction network to prioritize cancer-associated genes.
- Author
-
Chao Wu, Jun Zhu, and Xuegong Zhang
- Subjects
CANCER ,GENE expression ,PROTEINS ,CELLS ,DATA - Abstract
Background: To understand the roles they play in complex diseases, genes need to be investigated in the networks they are involved in. Integration of gene expression and network data is a promising approach to prioritize disease-associated genes. Some methods have been developed in this field, but the problem is still far from being solved. Results: In this paper, we developed a method, Networked Gene Prioritizer (NGP), to prioritize cancer-associated genes. Applications on several breast cancer and lung cancer datasets demonstrated that NGP performs better than the existing methods. It provides stable top ranking genes between independent datasets. The top-ranked genes by NGP are enriched in the cancer-associated pathways. The top-ranked genes by NGP-PLK1, MCM2, MCM3, MCM7, MCM10 and SKP2 might coordinate to promote cell cycle related processes in cancer but not normal cells. Conclusions: In this paper, we have developed a method named NGP, to prioritize cancer-associated genes. Our results demonstrated that NGP performs better than the existing methods. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
4. A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations.
- Author
-
Kiiveri, Harri T.
- Subjects
BIOTECHNOLOGY ,BIOLOGY ,DATA ,LOGISTIC regression analysis ,LINEAR statistical models - Abstract
Background: With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking. Results: The major contribution of this paper is to present a unified methodology which allows many common (statistical) response models to be fitted to such data sets. The class of models includes virtually any model with a linear predictor in it, for example (but not limited to), multiclass logistic regression (classification), generalised linear models (regression) and survival models. A fast algorithm for finding sparse well fitting models is presented. The ideas are illustrated on real data sets with numbers of variables ranging from thousands to millions. R code implementing the ideas is available for download. Conclusion: The method described in this paper enables existing work on response models when there are less variables than observations to be leveraged to the situation when there are many more variables than observations. It is a powerful approach to finding parsimonious models for such datasets. The method is capable of handling problems with millions of variables and a large variety of response types within the one framework. The method compares favourably to existing methods such as support vector machines and random forests, but has the advantage of not requiring separate variable selection steps. It is also works for data types which these methods were not designed to handle. The method usually produces very sparse models which make biological interpretation simpler and more focused. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
5. High-throughput cancer hypothesis testing with an integrated PhysiCell-EMEWS workflow.
- Author
-
Ozik, Jonathan, Collier, Nicholson, Wozniak, Justin M., Macal, Charles, Cockrell, Chase, Friedman, Samuel H., Ghaffarizadeh, Ahmadreza, Heiland, Randy, An, Gary, and Macklin, Paul
- Subjects
IMMUNOTHERAPY ,CANCER ,CLINICAL immunology ,TUMORS ,DATA - Abstract
Background: Cancer is a complex, multiscale dynamical system, with interactions between tumor cells and non-cancerous host systems. Therapies act on this combined cancer-host system, sometimes with unexpected results. Systematic investigation of mechanistic computational models can augment traditional laboratory and clinical studies, helping identify the factors driving a treatment's success or failure. However, given the uncertainties regarding the underlying biology, these multiscale computational models can take many potential forms, in addition to encompassing high-dimensional parameter spaces. Therefore, the exploration of these models is computationally challenging. We propose that integrating two existing technologies—one to aid the construction of multiscale agent-based models, the other developed to enhance model exploration and optimization—can provide a computational means for high-throughput hypothesis testing, and eventually, optimization. Results: In this paper, we introduce a high throughput computing (HTC) framework that integrates a mechanistic 3-D multicellular simulator (PhysiCell) with an extreme-scale model exploration platform (EMEWS) to investigate high-dimensional parameter spaces. We show early results in applying PhysiCell-EMEWS to 3-D cancer immunotherapy and show insights on therapeutic failure. We describe a generalized PhysiCell-EMEWS workflow for high-throughput cancer hypothesis testing, where hundreds or thousands of mechanistic simulations are compared against data-driven error metrics to perform hypothesis optimization. Conclusions: While key notational and computational challenges remain, mechanistic agent-based models and high-throughput model exploration environments can be combined to systematically and rapidly explore key problems in cancer. These high-throughput computational experiments can improve our understanding of the underlying biology, drive future experiments, and ultimately inform clinical practice. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
6. EpiViewer: an epidemiological application for exploring time series data.
- Author
-
Thorve, Swapna, Wilson, Mandy L., Lewis, Bryan L., Swarup, Samarth, Vullikanti, Anil Kumar S., and Marathe, Madhav V.
- Subjects
TIME series analysis ,DATA ,SIMULATION methods & models ,EPIDEMICS ,HOSPITAL care - Abstract
Background: Visualization plays an important role in epidemic time series analysis and forecasting. Viewing time series data plotted on a graph can help researchers identify anomalies and unexpected trends that could be overlooked if the data were reviewed in tabular form; these details can influence a researcher's recommended course of action or choice of simulation models. However, there are challenges in reviewing data sets from multiple data sources – data can be aggregated in different ways (e.g., incidence vs. cumulative), measure different criteria (e.g., infection counts, hospitalizations, and deaths), or represent different geographical scales (e.g., nation, HHS Regions, or states), which can make a direct comparison between time series difficult. In the face of an emerging epidemic, the ability to visualize time series from various sources and organizations and to reconcile these datasets based on different criteria could be key in developing accurate forecasts and identifying effective interventions. Many tools have been developed for visualizing temporal data; however, none yet supports all the functionality needed for easy collaborative visualization and analysis of epidemic data. Results: In this paper, we present EpiViewer, a time series exploration dashboard where users can upload epidemiological time series data from a variety of sources and compare, organize, and track how data evolves as an epidemic progresses. EpiViewer provides an easy-to-use web interface for visualizing temporal datasets either as line charts or bar charts. The application provides enhanced features for visual analysis, such as hierarchical categorization, zooming, and filtering, to enable detailed inspection and comparison of multiple time series on a single canvas. Finally, EpiViewer provides several built-in statistical Epi-features to help users interpret the epidemiological curves. Conclusion: EpiViewer is a single page web application that provides a framework for exploring, comparing, and organizing temporal datasets. It offers a variety of features for convenient filtering and analysis of epicurves based on meta-attribute tagging. EpiViewer also provides a platform for sharing data between groups for better comparison and analysis. Our user study demonstrated that EpiViewer is easy to use and fills a particular niche in the toolspace for visualization and exploration of epidemiological data. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
7. Prediction of scaffold proteins based on protein interaction and domain architectures.
- Author
-
Kimin Oh and Gwan-Su Yi
- Subjects
SCAFFOLD proteins ,PROTEIN-protein interactions ,CELL physiology ,COMPUTATIONAL biology ,DATA - Abstract
Background: Scaffold proteins are known for being crucial regulators of various cellular functions by assembling multiple proteins involved in signaling and metabolic pathways. Identification of scaffold proteins and the study of their molecular mechanisms can open a new aspect of cellular systemic regulation and the results can be applied in the field of medicine and engineering. Despite being highlighted as the regulatory roles of dozens of scaffold proteins, there was only one known computational approach carried out so far to find scaffold proteins from interactomes. However, there were limitations in finding diverse types of scaffold proteins because their criteria were restricted to the classical scaffold proteins. In this paper, we will suggest a systematic approach to predict massive scaffold proteins from interactomes and to characterize the roles of scaffold proteins comprehensively. Results: From a total of 10,419 basic scaffold protein candidates in protein interactomes, we classified them into three classes according to the structural evidences for scaffolding, such as domain architectures, domain interactions and protein complexes. Finally, we could define 2716 highly reliable scaffold protein candidates and their characterized functional features. To assess the accuracy of our prediction, the gold standard positive and negative data sets were constructed. We prepared 158 gold standard positive data and 844 gold standard negative data based on the functional information from Gene Ontology consortium. The precision, sensitivity and specificity of our testing was 80.3, 51.0, and 98.5 % respectively. Through the function enrichment analysis of highly reliable scaffold proteins, we could confirm the significantly enriched functions that are related to scaffold protein binding. We also identified functional association between scaffold proteins and their recruited proteins. Furthermore, we checked that the disease association of scaffold proteins is higher than kinases. Conclusions: In conclusion, we could predict larger volume of scaffold proteins and analyzed their functional characteristics. Deeper understandings about the roles of scaffold proteins from this study will provide a higher opportunity to find therapeutic or engineering applications of scaffold proteins using their functional characteristics. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
8. Heterodimeric protein complex identification by naïve Bayes classifiers.
- Author
-
Maruyama, Osamu
- Subjects
PROTEINS ,COMPLEX compounds ,NAIVE Bayes classification ,DATA ,MAXIMUM likelihood statistics - Abstract
Background Protein complexes are basic cellular entities that carry out the functions of their components. It can be found that in databases of protein complexes of yeast like CYC2008, the major type of known protein complexes is heterodimeric complexes. Although a number of methods for trying to predict sets of proteins that form arbitrary types of protein complexes simultaneously have been proposed, it can be found that they often fail to predict heterodimeric complexes. Results In this paper, we have designed several features characterizing heterodimeric protein complexes based on genomic data sets, and proposed a supervised-learning method for the prediction of heterodimeric protein complexes. This method learns the parameters of the features, which are embedded in the naïve Bayes classifier. The log-likelihood ratio derived from the naïve Bayes classifier with the parameter values obtained by maximum likelihood estimation gives the score of a given pair of proteins to predict whether the pair is a heterodimeric complex or not. A five-fold cross-validation shows good performance on yeast. The trained classifiers also show higher predictability than various existing algorithms on yeast data sets with approximate and exact matching criteria. Conclusions Heterodimeric protein complex prediction is a rather harder problem than heteromeric protein complex prediction because heterodimeric protein complex is topologically simpler. However, it turns out that by designing features specialized for heterodimeric protein complexes, predictability of them can be improved. Thus, the design of more sophisticate features for heterodimeric protein complexes as well as the accumulation of more accurate and useful genome-wide data sets will lead to higher predictability of heterodimeric protein complexes. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
9. Building Markov state models with solvent dynamics.
- Author
-
Chen Gu, Huang-Wei Chang, Maibaum, Lutz, Pande, Vijay S., Carlsson, Gunnar E., and Guibas, Leonidas J.
- Subjects
MARKOV processes ,SOLVENTS ,MACROMOLECULES ,DATA ,CARBON nanotubes - Abstract
Background: Markov state models have been widely used to study conformational changes of biological macromolecules. These models are built from short timescale simulations and then propagated to extract long timescale dynamics. However, the solvent information in molecular simulations are often ignored in current methods, because of the large number of solvent molecules in a system and the indistinguishability of solvent molecules upon their exchange. Methods: We present a solvent signature that compactly summarizes the solvent distribution in the highdimensional data, and then define a distance metric between different configurations using this signature. We next incorporate the solvent information into the construction of Markov state models and present a fast geometric clustering algorithm which combines both the solute-based and solvent-based distances. Results: We have tested our method on several different molecular dynamical systems, including alanine dipeptide, carbon nanotube, and benzene rings. With the new solvent-based signatures, we are able to identify different solvent distributions near the solute. Furthermore, when the solute has a concave shape, we can also capture the water number inside the solute structure. Finally we have compared the performances of different Markov state models. The experiment results show that our approach improves the existing methods both in the computational running time and the metastability. Conclusions: In this paper we have initiated an study to build Markov state models for molecular dynamical systems with solvent degrees of freedom. The methods we described should also be broadly applicable to a wide range of biomolecular simulation analyses. [ABSTRACT FROM AUTHOR]
- Published
- 2013
- Full Text
- View/download PDF
10. Scenario driven data modelling: a method for integrating diverse sources of data and data streams.
- Author
-
Griffith, Shelton D., Quest, Daniel J., Brettin, Thomas S., and Cottingham, Robert W.
- Subjects
BIOLOGY ,DATA ,GRAPHIC methods ,LIFE sciences ,INTERNET - Abstract
Background: Biology is rapidly becoming a data intensive, data-driven science. It is essential that data is represented and connected in ways that best represent its full conceptual content and allows both automated integration and data driven decision-making. Recent advancements in distributed multi-relational directed graphs, implemented in the form of the Semantic Web make it possible to deal with complicated heterogeneous data in new and interesting ways. Results: This paper presents a new approach, scenario driven data modelling (SDDM), that integrates multirelational directed graphs with data streams. SDDM can be applied to virtually any data integration challenge with widely divergent types of data and data streams. In this work, we explored integrating genetics data with reports from traditional media. SDDM was applied to the New Delhi metallo-beta-lactamase gene (NDM-1), an emerging global health threat. The SDDM process constructed a scenario, created a RDF multi-relational directed graph that linked diverse types of data to the Semantic Web, implemented RDF conversion tools (RDFizers) to bring content into the Sematic Web, identified data streams and analytical routines to analyse those streams, and identified user requirements and graph traversals to meet end-user requirements. Conclusions: We provided an example where SDDM was applied to a complex data integration challenge. The process created a model of the emerging NDM-1 health threat, identified and filled gaps in that model, and constructed reliable software that monitored data streams based on the scenario derived multi-relational directed graph. The SDDM process significantly reduced the software requirements phase by letting the scenario and resulting multi-relational directed graph define what is possible and then set the scope of the user requirements. Approaches like SDDM will be critical to the future of data intensive, data-driven science because they automate the process of converting massive data streams into usable knowledge. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
11. SNP and gene networks construction and analysis from classification of copy number variations data.
- Author
-
Yang Liu, Yiu Fai Lee, and Ng, Michael K.
- Subjects
DNA ,GENES ,DATA ,DEOXYRIBOSE ,HEREDITY - Abstract
Background: Detection of genomic DNA copy number variations (CNVs) can provide a complete and more comprehensive view of human disease. It is interesting to identify and represent relevant CNVs from a genomewide data due to high data volume and the complexity of interactions. Results: In this paper, we incorporate the DNA copy number variation data derived from SNP arrays into a computational shrunken model and formalize the detection of copy number variations as a case-control classification problem. More than 80% accuracy can be obtained using our classification model and by shrinkage, the number of relevant CNVs to disease can be determined. In order to understand relevant CNVs, we study their corresponding SNPs in the genome and a statistical software PLINK is employed to compute the pair-wise SNPSNP interactions, and identify SNP networks based on their P-values. Our selected SNP networks are statistically significant compared with random SNP networks and play a role in the biological process. For the unique genes that those SNPs are located in, a gene-gene similarity value is computed using GOSemSim and gene pairs that have similarity values being greater than a threshold are selected to construct gene networks. A gene enrichment analysis show that our gene networks are functionally important. Experimental results demonstrate that our selected SNP and gene networks based on the selected CNVs contain some functional relationships directly or indirectly to disease study. Conclusions: Two datasets are given to demonstrate the effectiveness of the introduced method. Some statistical and biological analysis show that this shrunken classification model is effective in identifying CNVs from genomewide data and our proposed framework has a potential to become a useful analysis tool for SNP data sets. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
12. Correction to: Enhancing SVM for survival data using local invariances and weighting.
- Author
-
Sanz, Hector, Reverter, Ferran, and Valim, Clarissa
- Subjects
DATA - Abstract
An amendment to this paper has been published and can be accessed via the original article. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
13. FSBC: fast string-based clustering for HT-SELEX data.
- Author
-
Kato, Shintaro, Ono, Takayoshi, Minagawa, Hirotaka, Horii, Katsunori, Shiratori, Ikuo, Waga, Iwao, Ito, Koichi, and Aoki, Takafumi
- Subjects
APTAMERS ,BASE pairs ,DATA - Abstract
Background: The combination of systematic evolution of ligands by exponential enrichment (SELEX) and deep sequencing is termed high-throughput (HT)-SELEX, which enables searching aptamer candidates from a massive amount of oligonucleotide sequences. A clustering method is an important procedure to identify sequence groups including aptamer candidates for evaluation with experimental analysis. In general, aptamer includes a specific target binding region, which is necessary for binding to the target molecules. The length of the target binding region varies depending on the target molecules and/or binding styles. Currently available clustering methods for HT-SELEX only estimate clusters based on the similarity of full-length sequences or limited length of motifs as target binding regions. Hence, a clustering method considering the target binding region with different lengths is required. Moreover, to handle such huge data and to save sequencing cost, a clustering method with fast calculation from a single round of HT-SELEX data, not multiple rounds, is also preferred. Results: We developed fast string-based clustering (FSBC) for HT-SELEX data. FSBC was designed to estimate clusters by searching various lengths of over-represented strings as target binding regions. FSBC was also designed for fast calculation with search space reduction from a single round, typically the final round, of HT-SELEX data considering imbalanced nucleobases of the aptamer selection process. The calculation time and clustering accuracy of FSBC were compared with those of four conventional clustering methods, FASTAptamer, AptaCluster, APTANI, and AptaTRACE, using HT-SELEX data (>15 million oligonucleotide sequences). FSBC, AptaCluster, and AptaTRACE could complete the clustering for all sequence data, and FSBC and AptaTRACE performed higher clustering accuracy. FSBC showed the highest clustering accuracy and had the second fastest calculation speed among all methods compared. Conclusion: FSBC is applicable to a large HT-SELEX dataset, which can facilitate the accurate identification of groups including aptamer candidates. Availability of data and materials: FSBC is available at http://www.aoki.ecei.tohoku.ac.jp/fsbc/. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
14. Searchlight: automated bulk RNA-seq exploration and visualisation using dynamically generated R scripts
- Author
-
Cole, John J., Faydaci, Bekir A., McGuinness, David, Shaw, Robin, Maciewicz, Rose A., Robertson, Neil A., and Goodyear, Carl S.
- Published
- 2021
- Full Text
- View/download PDF
15. Focused multidimensional scaling: interactive visualization for exploration of high-dimensional data.
- Author
-
Urpa, Lea M. and Anders, Simon
- Subjects
DATA visualization ,MULTIDIMENSIONAL scaling ,INDIVIDUALIZED medicine ,DATA ,DIMENSION reduction (Statistics) - Abstract
Background: Visualization is an important tool for generating meaning from scientific data, but the visualization of structures in high-dimensional data (such as from high-throughput assays) presents unique challenges. Dimension reduction methods are key in solving this challenge, but these methods can be misleading- especially when apparent clustering in the dimension-reducing representation is used as the basis for reasoning about relationships within the data. Results: We present two interactive visualization tools, distnet and focusedMDS, that help in assessing the validity of a dimension-reducing plot and in interactively exploring relationships between objects in the data. The distnet tool is used to examine discrepancies between the placement of points in a two dimensional visualization and the points' actual similarities in feature space. The focusedMDS tool is an intuitive, interactive multidimensional scaling tool that is useful for exploring the relationships of one particular data point to the others, that might be useful in a personalized medicine framework. Conclusions: We introduce here two freely available tools for visually exploring and verifying the validity of dimension-reducing visualizations and biological information gained from these. The use of such tools can confirm that conclusions drawn from dimension-reducing visualizations are not simply artifacts of the visualization method, but are real biological insights. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
16. Rapid analysis of metagenomic data using signature-based clustering.
- Author
-
Chappell, Timothy, Geva, Shlomo, Hogan, James M., Huygens, Flavia, Rathnayake, Irani U., Rudd, Stephen, Kelly, Wayne, and Perrin, Dimitri
- Subjects
METAGENOMICS ,MICROBIAL genomics ,DATA ,STAPHYLOCOCCUS aureus ,BACTERIA - Abstract
Background: Sequencing highly-variable 16S regions is a common and often effective approach to the study of microbial communities, and next-generation sequencing (NGS) technologies provide abundant quantities of data for analysis. However, the speed of existing analysis pipelines may limit our ability to work with these quantities of data. Furthermore, the limited coverage of existing 16S databases may hamper our ability to characterise these communities, particularly in the context of complex or poorly studied environments. Results: In this article we present the SigClust algorithm, a novel clustering method involving the transformation of sequence reads into binary signatures. When compared to other published methods, SigClust yields superior cluster coherence and separation of metagenomic read data, while operating within substantially reduced timeframes. We demonstrate its utility on published Illumina datasets and on a large collection of labelled wound reads sourced from patients in a wound clinic. The temporal analysis is based on tracking the dominant clusters of wound samples over time. The analysis can identify markers of both healing and non-healing wounds in response to treatment. Prominent clusters are found, corresponding to bacterial species known to be associated with unfavourable healing outcomes, including a number of strains of Staphylococcus aureus. Conclusions: SigClust identifies clusters rapidly and supports an improved understanding of the wound microbiome without reliance on a reference database. The results indicate a promising use for a SigClust-based pipeline in wound analysis and prediction, and a possible novel method for wound management and treatment. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
17. Fast and accurate branch lengths estimation for phylogenomic trees.
- Author
-
Binet, Manuel, Gascuel, Olivier, Scornavacca, Celine, Douzery, Emmanuel J. P., and Pardi, Fabio
- Subjects
PHYLOGENY ,GENOMICS ,BIOLOGICAL evolution ,DATA ,TOPOLOGY - Abstract
Background: Branch lengths are an important attribute of phylogenetic trees, providing essential information for many studies in evolutionary biology. Yet, part of the current methodology to reconstruct a phylogeny from genomic information--namely supertree methods--focuses on the topology or structure of the phylogenetic tree, rather than the evolutionary divergences associated to it. Moreover, accurate methods to estimate branch lengths--typically based on probabilistic analysis of a concatenated alignment--are limited by large demands in memory and computing time, and may become impractical when the data sets are too large. Results: Here, we present a novel phylogenomic distance-based method, named ERaBLE (Evolutionary Rates and Branch Length Estimation), to estimate the branch lengths of a given reference topology, and the relative evolutionary rates of the genes employed in the analysis. ERaBLE uses as input data a potentially very large collection of distance matrices, where each matrix is obtained from a different genomic region--either directly from its sequence alignment, or indirectly from a gene tree inferred from the alignment. Our experiments show that ERaBLE is very fast and fairly accurate when compared to other possible approaches for the same tasks. Specifically, it efficiently and accurately deals with large data sets, such as the OrthoMaM v8 database, composed of 6,953 exons from up to 40 mammals. Conclusions: ERaBLE may be used as a complement to supertree methods--or it may provide an efficient alternative to maximum likelihood analysis of concatenated alignments--to estimate branch lengths from phylogenomic data sets. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
18. Controlling false discoveries in high-dimensional situations: boosting with stability selection.
- Author
-
Hofner, Benjamin, Boccuto, Luigi, and Göker, Markus
- Subjects
BOOSTING algorithms ,BIOTECHNOLOGY ,DATA ,RESAMPLING (Statistics) ,SAMPLING errors ,DATA analysis ,LOG-linear models - Abstract
Background: Modern biotechnologies often result in high-dimensional data sets with many more variables than observations (n ⪡ p). These data sets pose new challenges to statistical analysis: Variable selection becomes one of the most important tasks in this setting. Similar challenges arise if in modern data sets from observational studies, e.g., in ecology, where flexible, non-linear models are fitted to high-dimensional data. We assess the recently proposed flexible framework for variable selection called stability selection. By the use of resampling procedures, stability selection adds a finite sample error control to high-dimensional variable selection procedures such as Lasso or boosting. We consider the combination of boosting and stability selection and present results from a detailed simulation study that provide insights into the usefulness of this combination. The interpretation of the used error bounds is elaborated and insights for practical data analysis are given. Results: Stability selection with boosting was able to detect influential predictors in high-dimensional settings while controlling the given error bound in various simulation scenarios. The dependence on various parameters such as the sample size, the number of truly influential variables or tuning parameters of the algorithm was investigated. The results were applied to investigate phenotype measurements in patients with autism spectrum disorders using a log-linear interaction model which was fitted by boosting. Stability selection identified five differentially expressed amino acid pathways. Conclusion: Stability selection is implemented in the freely available R package stabs (http://CRAN.R-project.org/ package=stabs). It proved to work well in high-dimensional settings with more predictors than observations for both, linear and additive models. The original version of stability selection, which controls the per-family error rate, is quite conservative, though, this is much less the case for its improvement, complementary pairs stability selection. Nevertheless, care should be taken to appropriately specify the error bound. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
19. Normalization and missing value imputation for label-free LC-MS analysis.
- Author
-
Karpievitch, Yuliya V., Dabney, Alan R., and Smith, Richard D.
- Subjects
LIQUID chromatography-mass spectrometry ,INFERENTIAL statistics ,PROTEOMICS ,PROTEIN microarrays ,DATA - Abstract
Shotgun proteomic data are affected by a variety of known and unknown systematic biases as well as high proportions of missing values. Typically, normalization is performed in an attempt to remove systematic biases from the data before statistical inference, sometimes followed by missing value imputation to obtain a complete matrix of intensities. Here we discuss several approaches to normalization and dealing with missing values, some initially developed for microarray data and some developed specifically for mass spectrometry-based data. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
20. Evaluating the consistency of gene sets used in the analysis of bacterial gene expression data.
- Author
-
Tintle, Nathan L, Sitarik, Alexandra, Boerema, Benjamin, Young, Kylie, Best, Aaron A, and DeJongh, Matthew
- Subjects
GENES ,BACTERIA ,GENE expression ,DATA ,GENOMES - Abstract
Background: Statistical analyses of whole genome expression data require functional information about genes in order to yield meaningful biological conclusions. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) are common sources of functionally grouped gene sets. For bacteria, the SEED and MicrobesOnline provide alternative, complementary sources of gene sets. To date, no comprehensive evaluation of the data obtained from these resources has been performed. Results: We define a series of gene set consistency metrics directly related to the most common classes of statistical analyses for gene expression data, and then perform a comprehensive analysis of 3581 AffymetrixW gene expression arrays across 17 diverse bacteria. We find that gene sets obtained from GO and KEGG demonstrate lower consistency than those obtained from the SEED and MicrobesOnline, regardless of gene set size. Conclusions: Despite the widespread use of GO and KEGG gene sets in bacterial gene expression data analysis, the SEED and MicrobesOnline provide more consistent sets for a wide variety of statistical analyses. Increased use of the SEED and MicrobesOnline gene sets in the analysis of bacterial gene expression data may improve statistical power and utility of expression data [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
21. PREMIM and EMIM: tools for estimation of maternal, imprinting and interaction effects using multinomial modelling.
- Author
-
Howey, Richard and Cordell, Heather J
- Subjects
MOTHERHOOD ,GENETICS ,CHILDREN ,COMPUTER software ,DATA - Abstract
Background: Here we present two new computer tools, PREMIM and EMIM, for the estimation of parental and child genetic effects, based on genotype data from a variety of different child-parent configurations. PREMIM allows the extraction of child-parent genotype data from standard-format pedigree data files, while EMIM uses the extracted genotype data to perform subsequent statistical analysis. The use of genotype data from the parents as well as from the child in question allows the estimation of complex genetic effects such as maternal genotype effects, maternal foetalinteractions and parent-of-origin (imprinting) effects. These effects are estimated by EMIM, incorporating chosen assumptions such as Hardy-Weinberg equilibrium or exchangeability of parental matings as required. Results: In application to simulated data, we show that the inference provided by EMIM is essentially equivalent to that provided by alternative (competing) software packages such as MENDEL and LEM. However, PREMIM and EMIM (used in combination) considerably outperform MENDEL and LEM in terms of speed and ease of execution. Conclusions: Together, EMIM and PREMIM provide easy-to-use command-line tools for the analysis of pedigree data, giving unbiased estimates of parental and child genotype relative risks. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
22. TDT-HET: A new transmission disequilibrium test that incorporates locus heterogeneity into the analysis of family-based association data.
- Subjects
HETEROGENEITY ,LOCUS (Genetics) ,DATA ,GENOTYPE-environment interaction ,CHROMOSOMES ,BREAST cancer - Abstract
The article focuses on the development of methods to address locus heterogeneity in genetic association analysis. Information regarding development of a test that extends the classic transmission disequilibrium test (TDT) to one that accounts for locus heterogeneity is presented. It is mentioned that there are several methods of this methods which includes estimates of parameters in the presence of heterogeneity, and reasonable power even when the proportion of linked trios is small.
- Published
- 2012
- Full Text
- View/download PDF
23. Integrated analysis of the heterogeneous microarray data.
- Author
-
Sung Gon Yi and Taesung Park
- Subjects
MICROARRAY technology ,DATA ,GENES ,HOSPITALS ,LABORATORIES - Abstract
Background: As the magnitude of the experiment increases, it is common to combine various types of microarrays such as paired and non-paired microarrays from different laboratories or hospitals. Thus, it is important to analyze microarray data together to derive a combined conclusion after accounting for heterogeneity among data sets. One of the main objectives of the microarray experiment is to identify differentially expressed genes among the different experimental groups. We propose the linear mixed effect model for the integrated analysis of the heterogeneous microarray data sets. Results: The proposed linear mixed effect model was illustrated using the data from 133 microarrays collected at three different hospitals. Though simulation studies, we compared the proposed linear mixed effect model approach with the meta-analysis and the ANOVA model approaches. The linear mixed effect model approach was shown to provide higher powers than the other approaches. Conclusions: The linear mixed effect model has advantages of allowing for various types of covariance structures over ANOVA model. Further, it can handle easily the correlated microarray data such as paired microarray data and repeated microarray data from the same subject [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
24. An adaptive optimal ensemble classifier viabagging and rank aggregation with applicationsto high dimensional data.
- Author
-
Datta, Susmita, Pihur, Vasyl, and Datta, Somnath
- Subjects
DATA ,ALGORITHMS ,CLASSIFICATION ,PERFORMANCE ,INFORMATION organization - Abstract
Background: Generally speaking, different classifiers tend to work well for certain types of data and conversely, it is usually not known a priori which algorithm will be optimal in any given classification application. In addition, for most classification problems, selecting the best performing classification algorithm amongst a number of competing algorithms is a difficult task for various reasons. As for example, the order of performance may depend on the performance measure employed for such a comparison. In this work, we present a novel adaptive ensemble classifier constructed by combining bagging and rank aggregation that is capable of adaptively changing its performance depending on the type of data that is being classified. The attractive feature of the proposed classifier is its multi-objective nature where the classification results can be simultaneously optimized with respect to several performance measures, for example, accuracy, sensitivity and specificity. We also show that our somewhat complex strategy has better predictive performance as judged on test samples than a more naive approach that attempts to directly identify the optimal classifier based on the training data performances of the individual classifiers. Results: We illustrate the proposed method with two simulated and two real-data examples. In all cases, the ensemble classifier performs at the level of the best individual classifier comprising the ensemble or better. Conclusions: For complex high-dimensional datasets resulting from present day high-throughput experiments, it may be wise to consider a number of classification algorithms combined with dimension reduction techniques rather than a fixed standard algorithm set a priori. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
25. Coverage statistics for sequence census methods.
- Author
-
Evans, Steven N, Hower, Valerie, and Pachter, Lior
- Subjects
GENOMES ,POISSON processes ,DATA ,COMPUTER simulation ,CENSUS - Abstract
Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce a coding of the shape of the coverage depth function as a tree and explain how this can be used to detect regions with anomalous coverage. This modeling perspective is especially germane to current highthroughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: Under the mild assumptions that fragment start sites are Poisson distributed and successive fragment lengths are independent and identically distributed, we observe that, regardless of fragment length distribution, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the successive jumps of the coverage function, and show that they can be encoded as a random tree that is approximately a Galton-Watson tree with generation-dependent geometric offspring distributions whose parameters can be computed. Conclusions: We extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. Our approach leads to explicit determinations of the null distributions of certain test statistics, while for others it greatly simplifies the approximation of their null distributions by simulation. Our focus on fragments also leads to a new approach to visualizing sequencing data that is of independent interest. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
26. Assessment of composite motif discovery methods.
- Author
-
Klepper, Kjetil, Sandve, Geir K., Abul, Osman, Johansen, Jostein, and Drablos, Finn
- Subjects
BIOINFORMATICS ,TRANSCRIPTION factors ,DATA ,GENETIC regulation ,NUCLEOTIDES - Abstract
Background: Computational discovery of regulatory elements is an important area of bioinformatics research and more than a hundred motif discovery methods have been published. Traditionally, most of these methods have addressed the problem of single motif discovery -- discovering binding motifs for individual transcription factors. In higher organisms, however, transcription factors usually act in combination with nearby bound factors to induce specific regulatory behaviours. Hence, recent focus has shifted from single motifs to the discovery of sets of motifs bound by multiple cooperating transcription factors, so called composite motifs or cis-regulatory modules. Given the large number and diversity of methods available, independent assessment of methods becomes important. Although there have been several benchmark studies of single motif discovery, no similar studies have previously been conducted concerning composite motif discovery. Results: We have developed a benchmarking framework for composite motif discovery and used it to evaluate the performance of eight published module discovery tools. Benchmark datasets were constructed based on real genomic sequences containing experimentally verified regulatory modules, and the module discovery programs were asked to predict both the locations of these modules and to specify the single motifs involved. To aid the programs in their search, we provided position weight matrices corresponding to the binding motifs of the transcription factors involved. In addition, selections of decoy matrices were mixed with the genuine matrices on one dataset to test the response of programs to varying levels of noise. Conclusion: Although some of the methods tested tended to score somewhat better than others overall, there were still large variations between individual datasets and no single method performed consistently better than the rest in all situations. The variation in performance on individual datasets also shows that the new benchmark datasets represents a suitable variety of challenges to most methods for module discovery. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
27. GapCoder automates the use of indel characters in phylogenetic analysis.
- Author
-
Young, Nelson D. and Healy, John
- Subjects
PHYLOGENY ,DATA ,CHARACTER sets (Data processing) ,CODING theory ,BIOINFORMATICS - Abstract
Background: Several ways of incorporating indels into phylogenetic analysis have been suggested. Simple indel coding has two strengths: (1) biological realism and (2) efficiency of analysis. In the method, each indel with different start and/or end positions is considered to be a separate character. The presence/absence of these indel characters is then added to the data set. Algorithm: We have written a program, GapCoder to automate this procedure. The program can input PIR format aligned datasets, find the indels and add the indel-based characters. The output is a NEXUS format file, which includes a table showing what region each indel characters is based on. If regions are excluded from analysis, this table makes it easy to identify the corresponding indel characters for exclusion. Discussion: Manual implementation of the simple indel coding method can be very time-consuming, especially in data sets where indels are numerous and/or overlapping. GapCoder automates this method and is therefore particularly useful during procedures where phylogenetic analyses need to be repeated many times, such as when different alignments are being explored or when various taxon or character sets are being explored. GapCoder is currently available for Windows from http://www.home.duq.edu/~youngnd/GapCoder. [ABSTRACT FROM AUTHOR]
- Published
- 2003
- Full Text
- View/download PDF
28. LabPipe: an extensible bioinformatics toolkit to manage experimental data and metadata.
- Author
-
Zhao, Bo, Bryant, Luke, Cordell, Rebecca, Wilde, Michael, Salman, Dahlia, Ruszkiewicz, Dorota, Ibrahim, Wadah, Singapuri, Amisha, Coats, Tim, Gaillard, Erol, Beardsmore, Caroline, Suzuki, Toru, Ng, Leong, Greening, Neil, Thomas, Paul, Monks, Paul, Brightling, Christopher, Siddiqui, Salman, and Free, Robert C.
- Subjects
METADATA ,DATA management ,DATA entry ,ACQUISITION of data ,DATA - Abstract
Background: Data handling in clinical bioinformatics is often inadequate. No freely available tools provide straightforward approaches for consistent, flexible metadata collection and linkage of related experimental data generated locally by vendor software. Results: To address this problem, we created LabPipe, a flexible toolkit which is driven through a local client that runs alongside vendor software and connects to a light-weight server. The toolkit allows re-usable configurations to be defined for experiment metadata and local data collection, and handles metadata entry and linkage of data. LabPipe was piloted in a multi-site clinical breathomics study. Conclusions: LabPipe provided a consistent, controlled approach for handling metadata and experimental data collection, collation and linkage in the exemplar study and was flexible enough to deal effectively with different data handling challenges. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
29. Multiple imputation and direct estimation for qPCR data with non-detects.
- Author
-
Sherina, Valeriia, McMurray, Helene R., Powers, Winslow, Land, Harmut, Love, Tanzy M. T., and McCall, Matthew N.
- Subjects
MULTIPLE imputation (Statistics) ,ESTIMATION bias ,EXPECTATION-maximization algorithms ,GENE expression ,DETECTION limit ,DATA - Abstract
Background: Quantitative real-time PCR (qPCR) is one of the most widely used methods to measure gene expression. An important aspect of qPCR data that has been largely ignored is the presence of non-detects: reactions failing to exceed the quantification threshold and therefore lacking a measurement of expression. While most current software replaces these non-detects with a value representing the limit of detection, this introduces substantial bias in the estimation of both absolute and differential expression. Single imputation procedures, while an improvement on previously used methods, underestimate residual variance, which can lead to anti-conservative inference. Results: We propose to treat non-detects as non-random missing data, model the missing data mechanism, and use this model to impute missing values or obtain direct estimates of model parameters. To account for the uncertainty inherent in the imputation, we propose a multiple imputation procedure, which provides a set of plausible values for each non-detect. We assess the proposed methods via simulation studies and demonstrate the applicability of these methods to three experimental data sets. We compare our methods to mean imputation, single imputation, and a penalized EM algorithm incorporating non-random missingness (PEMM). The developed methods are implemented in the R/Bioconductor package nondetects. Conclusions: The statistical methods introduced here reduce discrepancies in gene expression values derived from qPCR experiments in the presence of non-detects, providing increased confidence in downstream analyses. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.