32 results on '"Fernández-Baca D"'
Search Results
2. Flipping: A supertree construction method
- Author
-
Chen, D., primary, Diao, L., additional, Eulenstein, O., additional, Fernández-Baca, D., additional, and Sanderson, M., additional
- Published
- 2003
- Full Text
- View/download PDF
3. Rainbow: a toolbox for phylogenetic supertree construction and analysis
- Author
-
Chen, D., Eulenstein, O., and Fernández-Baca, D.
- Published
- 2004
4. An ILP solution for the gene duplication problem
- Author
-
Fernández-Baca David F, Burleigh Gordon J, Chang Wen-Chieh, and Eulenstein Oliver
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee. Results We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics. Conclusions Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.
- Published
- 2011
- Full Text
- View/download PDF
5. iGTP: A software package for large-scale gene tree parsimony analysis
- Author
-
Fernández-Baca David, Wehe André, Bansal Mukul S, Chaudhary Ruchi, and Eulenstein Oliver
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background The ever-increasing wealth of genomic sequence information provides an unprecedented opportunity for large-scale phylogenetic analysis. However, species phylogeny inference is obfuscated by incongruence among gene trees due to evolutionary events such as gene duplication and loss, incomplete lineage sorting (deep coalescence), and horizontal gene transfer. Gene tree parsimony (GTP) addresses this issue by seeking a species tree that requires the minimum number of evolutionary events to reconcile a given set of incongruent gene trees. Despite its promise, the use of gene tree parsimony has been limited by the fact that existing software is either not fast enough to tackle large data sets or is restricted in the range of evolutionary events it can handle. Results We introduce iGTP, a platform-independent software program that implements state-of-the-art algorithms that greatly speed up species tree inference under the duplication, duplication-loss, and deep coalescence reconciliation costs. iGTP significantly extends and improves the functionality and performance of existing gene tree parsimony software and offers advanced features such as building effective initial trees using stepwise leaf addition and the ability to have unrooted gene trees in the input. Moreover, iGTP provides a user-friendly graphical interface with integrated tree visualization software to facilitate analysis of the results. Conclusions iGTP enables, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication, duplication-loss, and deep coalescence reconciliation costs, all from within a convenient graphical user interface.
- Published
- 2010
- Full Text
- View/download PDF
6. Robinson-Foulds Supertrees
- Author
-
Eulenstein Oliver, Burleigh J Gordon, Bansal Mukul S, and Fernández-Baca David
- Subjects
Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters (or clades) from the input trees. Results We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (naïve) solutions by a factor of Θ(n) and Θ(n2) respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods. Conclusions Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.
- Published
- 2010
- Full Text
- View/download PDF
7. Constructing majority-rule supertrees
- Author
-
McMorris FR, Fernández-Baca David, and Dong Jianrong
- Subjects
Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background Supertree methods combine the phylogenetic information from multiple partially-overlapping trees into a larger phylogenetic tree called a supertree. Several supertree construction methods have been proposed to date, but most of these are not designed with any specific properties in mind. Recently, Cotton and Wilkinson proposed extensions of the majority-rule consensus tree method to the supertree setting that inherit many of the appealing properties of the former. Results We study a variant of one of Cotton and Wilkinson's methods, called majority-rule (+) supertrees. After proving that a key underlying problem for constructing majority-rule (+) supertrees is NP-hard, we develop a polynomial-size exact integer linear programming formulation of the problem. We then present a data reduction heuristic that identifies smaller subproblems that can be solved independently. While this technique is not guaranteed to produce optimal solutions, it can achieve substantial problem-size reduction. Finally, we report on a computational study of our approach on various real data sets, including the 121-taxon, 7-tree Seabirds data set of Kennedy and Page. Conclusions The results indicate that our exact method is computationally feasible for moderately large inputs. For larger inputs, our data reduction heuristic makes it feasible to tackle problems that are well beyond the range of the basic integer programming approach. Comparisons between the results obtained by our heuristic and exact solutions indicate that the heuristic produces good answers. Our results also suggest that the majority-rule (+) approach, in both its basic form and with data reduction, yields biologically meaningful phylogenies.
- Published
- 2010
- Full Text
- View/download PDF
8. PhyloFinder: An intelligent search engine for phylogenetic tree databases
- Author
-
Bansal Mukul S, Burleigh J Gordon, Chen Duhong, and Fernández-Baca David
- Subjects
Evolution ,QH359-425 - Abstract
Abstract Background Bioinformatic tools are needed to store and access the rapidly growing phylogenetic data. These tools should enable users to identify existing phylogenetic trees containing a specified taxon or set of taxa and to compare a specified phylogenetic hypothesis to existing phylogenetic trees. Results PhyloFinder is an intelligent search engine for phylogenetic databases that we have implemented using trees from TreeBASE. It enables taxonomic queries, in which it identifies trees in the database containing the exact name of the query taxon and/or any synonymous taxon names, and it provides spelling suggestions for the query when there is no match. Additionally, PhyloFinder can identify trees containing descendants or direct ancestors of the query taxon. PhyloFinder also performs phylogenetic queries, in which it identifies trees that contain the query tree or topologies that are similar to the query tree. Conclusion PhyloFinder can enhance the utility of any tree database by providing tools for both taxonomic and phylogenetic queries as well as visualization tools that highlight the query results and provide links to NCBI and TBMap. An implementation of PhyloFinder using trees from TreeBASE is available from the web client application found in the availability and requirements section.
- Published
- 2008
- Full Text
- View/download PDF
9. Testing the agreement of trees with internal labels.
- Author
-
Fernández-Baca D and Liu L
- Abstract
Background: A semi-labeled tree is a tree where all leaves as well as, possibly, some internal nodes are labeled with taxa. Semi-labeled trees encompass ordinary phylogenetic trees and taxonomies. Suppose we are given a collection [Formula: see text] of semi-labeled trees, called input trees, over partially overlapping sets of taxa. The agreement problem asks whether there exists a tree [Formula: see text], called an agreement tree, whose taxon set is the union of the taxon sets of the input trees such that the restriction of [Formula: see text] to the taxon set of [Formula: see text] is isomorphic to [Formula: see text], for each [Formula: see text]. The agreement problems is a special case of the supertree problem, the problem of synthesizing a collection of phylogenetic trees with partially overlapping taxon sets into a single supertree that represents the information in the input trees. An obstacle to building large phylogenetic supertrees is the limited amount of taxonomic overlap among the phylogenetic studies from which the input trees are obtained. Incorporating taxonomies into supertree analyses can alleviate this issue., Results: We give a [Formula: see text] algorithm for the agreement problem, where n is the total number of distinct taxa in [Formula: see text], k is the number of trees in [Formula: see text], and [Formula: see text] is the maximum number of children of a node in [Formula: see text]., Conclusion: Our algorithm can aid in integrating taxonomies into supertree analyses. Our computational experience with the algorithm suggests that its performance in practice is much better than its worst-case bound indicates., (© 2021. The Author(s).)
- Published
- 2021
- Full Text
- View/download PDF
10. Genotypic Characterization of the U.S. Peanut Core Collection.
- Author
-
Otyama PI, Kulkarni R, Chamberlin K, Ozias-Akins P, Chu Y, Lincoln LM, MacDonald GE, Anglin NL, Dash S, Bertioli DJ, Fernández-Baca D, Graham MA, Cannon SB, and Cannon EKS
- Subjects
- Alleles, Genotype, Phylogeny, Arachis genetics, Genetic Variation
- Abstract
Cultivated peanut ( Arachis hypogaea ) is an important oil, food, and feed crop worldwide. The USDA peanut germplasm collection currently contains 8,982 accessions. In the 1990s, 812 accessions were selected as a core collection on the basis of phenotype and country of origin. The present study reports genotyping results for the entire available core collection. Each accession was genotyped with the Arachis_Axiom2 SNP array, yielding 14,430 high-quality, informative SNPs across the collection. Additionally, a subset of 253 accessions was replicated, using between two and five seeds per accession, to assess heterogeneity within these accessions. The genotypic diversity of the core is mostly captured in five genotypic clusters, which have some correspondence with botanical variety and market type. There is little genetic clustering by country of origin, reflecting peanut's rapid global dispersion in the 18
th and 19th centuries. A genetic cluster associated with the hypogaea/aequatoriana/peruviana varieties, with accessions coming primarily from Bolivia, Peru, and Ecuador, is consistent with these having been the earliest landraces. The genetics, phenotypic characteristics, and biogeography are all consistent with previous reports of tetraploid peanut originating in Southeast Bolivia. Analysis of the genotype data indicates an early genetic radiation, followed by regional distribution of major genetic classes through South America, and then a global dissemination that retains much of the early genetic diversity in peanut. Comparison of the genotypic data relative to alleles from the diploid progenitors also indicates that subgenome exchanges, both large and small, have been major contributors to the genetic diversity in peanut., (Copyright © 2020 Otyama et al.)- Published
- 2020
- Full Text
- View/download PDF
11. Family-Specific Gains and Losses of Protein Domains in the Legume and Grass Plant Families.
- Author
-
Yadav A, Fernández-Baca D, and Cannon SB
- Abstract
Protein domains can be regarded as sections of protein sequences capable of folding independently and performing specific functions. In addition to amino-acid level changes, protein sequences can also evolve through domain shuffling events such as domain insertion, deletion, or duplication. The evolution of protein domains can be studied by tracking domain changes in a selected set of species with known phylogenetic relationships. Here, we conduct such an analysis by defining domains as "features" or "descriptors," and considering the species (target + outgroup) as instances or data-points in a data matrix. We then look for features (domains) that are significantly different between the target species and the outgroup species. We study the domain changes in 2 large, distinct groups of plant species: legumes (Fabaceae) and grasses (Poaceae), with respect to selected outgroup species. We evaluate 4 types of domain feature matrices: domain content, domain duplication, domain abundance, and domain versatility. The 4 types of domain feature matrices attempt to capture different aspects of domain changes through which the protein sequences may evolve-that is, via gain or loss of domains, increase or decrease in the copy number of domains along the sequences, expansion or contraction of domains, or through changes in the number of adjacent domain partners. All the feature matrices were analyzed using feature selection techniques and statistical tests to select protein domains that have significant different feature values in legumes and grasses. We report the biological functions of the top selected domains from the analysis of all the feature matrices. In addition, we also perform domain-centric gene ontology (dcGO) enrichment analysis on all selected domains from all 4 feature matrices to study the gene ontology terms associated with the significantly evolving domains in legumes and grasses. Domain content analysis revealed a striking loss of protein domains from the Fanconi anemia (FA) pathway, the pathway responsible for the repair of interstrand DNA crosslinks. The abundance analysis of domains found in legumes revealed an increase in glutathione synthase enzyme, an antioxidant required from nitrogen fixation, and a decrease in xanthine oxidizing enzymes, a phenomenon confirmed by previous studies. In grasses, the abundance analysis showed increases in domains related to gene silencing which could be due to polyploidy or due to enhanced response to viral infection. We provide a docker container that can be used to perform this analysis workflow on any user-defined sets of species, available at https://cloud.docker.com/u/akshayayadav/repository/docker/akshayayadav/protein-domain-evolution-project., Competing Interests: Declaration of conflicting interests:The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article., (© The Author(s) 2020.)
- Published
- 2020
- Full Text
- View/download PDF
12. Cercis : A Non-polyploid Genomic Relic Within the Generally Polyploid Legume Family.
- Author
-
Stai JS, Yadav A, Sinou C, Bruneau A, Doyle JJ, Fernández-Baca D, and Cannon SB
- Abstract
Based on evolutionary, phylogenomic, and synteny analyses of genome sequences for more than a dozen diverse legume species as well as analysis of chromosome counts across the legume family, we conclude that the genus Cercis provides a plausible model for an early evolutionary form of the legume genome. The small Cercis genus is in the earliest-diverging clade in the earliest-diverging legume subfamily (Cercidoideae). The Cercis genome is physically small, and has accumulated mutations at an unusually slow rate compared to other legumes. Chromosome counts across 477 legume genera, combined with phylogenetic reconstructions and histories of whole-genome duplications, suggest that the legume progenitor had 7 chromosomes - as does Cercis . We propose a model in which a legume progenitor, with 7 chromosomes, diversified into species that would become the Cercidoideae and the remaining legume subfamilies; then speciation in the Cercidoideae gave rise to the progenitor of the Cercis genus. There is evidence for a genome duplication in the remaining Cercidoideae, which is likely due to allotetraploidy involving hybridization between a Cercis progenitor and a second diploid species that existed at the time of the polyploidy event. Outside the Cercidoideae, a set of probably independent whole-genome duplications gave rise to the five other legume subfamilies, at least four of which have predominant counts of 12-14 chromosomes among their early-diverging taxa. An earlier study concluded that independent duplications occurred in the Caesalpinioideae, Detarioideae, and Papilionoideae. We conclude that Cercis may be unique among legumes in lacking evidence of polyploidy, a process that has shaped the genomes of all other legumes thus far investigated.
- Published
- 2019
- Full Text
- View/download PDF
13. An efficient algorithm for testing the compatibility of phylogenies with nested taxa.
- Author
-
Deng Y and Fernández-Baca D
- Abstract
Background: Semi-labeled trees generalize ordinary phylogenetic trees, allowing internal nodes to be labeled by higher-order taxa. Taxonomies are examples of semi-labeled trees. Suppose we are given collection [Formula: see text] of semi-labeled trees over various subsets of a set of taxa. The ancestral compatibility problem asks whether there is a semi-labeled tree that respects the clusterings and the ancestor/descendant relationships implied by the trees in [Formula: see text]. The running time and space usage of the best previous algorithm for testing ancestral compatibility depend on the degrees of the nodes in the trees in [Formula: see text]., Results: We give a algorithm for the ancestral compatibility problem that runs in [Formula: see text] time and uses [Formula: see text] space, where [Formula: see text] is the total number of nodes and edges in the trees in [Formula: see text]., Conclusions: Taxonomies enable researchers to expand greatly the taxonomic coverage of their phylogenetic analyses. The running time of our method does not depend on the degrees of the nodes in the trees in [Formula: see text]. This characteristic is important when taxonomies-which can have nodes of high degree-are used.
- Published
- 2017
- Full Text
- View/download PDF
14. Assessing approaches for inferring species trees from multi-copy genes.
- Author
-
Chaudhary R, Boussau B, Burleigh JG, and Fernández-Baca D
- Subjects
- Computer Simulation, Gene Deletion, Gene Duplication, Sequence Analysis, DNA standards, Software standards, Classification methods, Phylogeny, Sequence Analysis, DNA methods
- Abstract
With the availability of genomic sequence data, there is increasing interest in using genes with a possible history of duplication and loss for species tree inference. Here we assess the performance of both nonprobabilistic and probabilistic species tree inference approaches using gene duplication and loss and coalescence simulations. We evaluated the performance of gene tree parsimony (GTP) based on duplication (Only-dup), duplication and loss (Dup-loss), and deep coalescence (Deep-c) costs, the NJst distance method, the MulRF supertree method, and PHYLDOG, which jointly estimates gene trees and species tree using a hierarchical probabilistic model. We examined the effects of gene tree and species sampling, gene tree error, and duplication and loss rates on the accuracy of phylogenetic estimates. In the 10-taxon duplication and loss simulation experiments, MulRF is more accurate than the other methods when the duplication and loss rates are low, and Dup-loss is generally the most accurate when the duplication and loss rates are high. PHYLDOG performs well in 10-taxon duplication and loss simulations, but its run time is prohibitively long on larger data sets. In the larger duplication and loss simulation experiments, MulRF outperforms all other methods in experiments with at most 100 taxa; however, in the larger simulation, Dup-loss generally performs best. In all duplication and loss simulation experiments with more than 10 taxa, all methods perform better with more gene trees and fewer missing sequences, and they are all affected by gene tree error. Our results also highlight high levels of error in estimates of duplications and losses from GTP methods and demonstrate the usefulness of methods based on generic tree distances for large analyses., (© The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)
- Published
- 2015
- Full Text
- View/download PDF
15. STBase: one million species trees for comparative biology.
- Author
-
McMahon MM, Deepak A, Fernández-Baca D, Boss D, and Sanderson MJ
- Subjects
- User-Computer Interface, Biology methods, Databases, Genetic, Trees classification, Trees genetics
- Abstract
Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.
- Published
- 2015
- Full Text
- View/download PDF
16. MulRF: a software package for phylogenetic analysis using multi-copy gene trees.
- Author
-
Chaudhary R, Fernández-Baca D, and Burleigh JG
- Subjects
- Algorithms, Computer Simulation, Evolution, Molecular, Humans, Programming Languages, Gene Dosage genetics, Phylogeny, Sequence Analysis, DNA methods, Software
- Abstract
Summary: MulRF is a platform-independent software package for phylogenetic analysis using multi-copy gene trees. It seeks the species tree that minimizes the Robinson-Foulds (RF) distance to the input trees using a generalization of the RF distance to multi-labeled trees. The underlying generic tree distance measure and fast running time make MulRF useful for inferring phylogenies from large collections of gene trees, in which multiple evolutionary processes as well as phylogenetic error may contribute to gene tree discord. MulRF implements several features for customizing the species tree search and assessing the results, and it provides a user-friendly graphical user interface (GUI) with tree visualization. The species tree search is implemented in C++ and the GUI in Java Swing., Availability: MulRF's executable as well as sample datasets and manual are available at http://genome.cs.iastate.edu/CBL/MulRF/, and the source code is available at https://github.com/ruchiherself/MulRFRepo., Contact: ruchic@ufl.edu, Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2015
- Full Text
- View/download PDF
17. Enumerating all maximal frequent subtrees in collections of phylogenetic trees.
- Author
-
Deepak A and Fernández-Baca D
- Abstract
Background: A common problem in phylogenetic analysis is to identify frequent patterns in a collection of phylogenetic trees. The goal is, roughly, to find a subset of the species (taxa) on which all or some significant subset of the trees agree. One popular method to do so is through maximum agreement subtrees (MASTs). MASTs are also used, among other things, as a metric for comparing phylogenetic trees, computing congruence indices and to identify horizontal gene transfer events., Results: We give algorithms and experimental results for two approaches to identify common patterns in a collection of phylogenetic trees, one based on agreement subtrees, called maximal agreement subtrees, the other on frequent subtrees, called maximal frequent subtrees. These approaches can return subtrees on larger sets of taxa than MASTs, and can reveal new common phylogenetic relationships not present in either MASTs or the majority rule tree (a popular consensus method). Our current implementation is available on the web at https://code.google.com/p/mfst-miner/., Conclusions: Our computational results confirm that maximal agreement subtrees and all maximal frequent subtrees can reveal a more complete phylogenetic picture of the common patterns in collections of phylogenetic trees than maximum agreement subtrees; they are also often more resolved than the majority rule tree. Further, our experiments show that enumerating maximal frequent subtrees is considerably more practical than enumerating ordinary (not necessarily maximal) frequent subtrees.
- Published
- 2014
- Full Text
- View/download PDF
18. Characterizing compatibility and agreement of unrooted trees via cuts in graphs.
- Author
-
Vakati S and Fernández-Baca D
- Abstract
Background: Deciding whether there is a single tree -a supertree- that summarizes the evolutionary information in a collection of unrooted trees is a fundamental problem in phylogenetics. We consider two versions of this question: agreement and compatibility. In the first, the supertree is required to reflect precisely the relationships among the species exhibited by the input trees. In the second, the supertree can be more refined than the input trees. Testing for compatibility is an NP-complete problem; however, the problem is solvable in polynomial time when the number of input trees is fixed. Testing for agreement is also NP-complete, but it is not known whether it is fixed-parameter tractable. Compatibility can be characterized in terms of the existence of a specific kind of triangulation in a structure known as the display graph. Alternatively, it can be characterized as a chordal graph sandwich problem in a structure known as the edge label intersection graph. No characterization of agreement was known., Results: We present a simple and natural characterization of compatibility in terms of minimal cuts in the display graph, which is closely related to compatibility of splits. We then derive a characterization for agreement., Conclusions: Explicit characterizations of tree compatibility and agreement are essential to finding practical algorithms for these problems. The simplicity of the characterizations presented here could help to achieve this goal.
- Published
- 2014
- Full Text
- View/download PDF
19. Inferring species trees from incongruent multi-copy gene trees using the Robinson-Foulds distance.
- Author
-
Chaudhary R, Burleigh JG, and Fernández-Baca D
- Abstract
Background: Constructing species trees from multi-copy gene trees remains a challenging problem in phylogenetics. One difficulty is that the underlying genes can be incongruent due to evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. Gene tree estimation errors may further exacerbate the difficulties of species tree estimation., Results: We present a new approach for inferring species trees from incongruent multi-copy gene trees that is based on a generalization of the Robinson-Foulds (RF) distance measure to multi-labeled trees (mul-trees). We prove that it is NP-hard to compute the RF distance between two mul-trees; however, it is easy to calculate this distance between a mul-tree and a singly-labeled species tree. Motivated by this, we formulate the RF problem for mul-trees (MulRF) as follows: Given a collection of multi-copy gene trees, find a singly-labeled species tree that minimizes the total RF distance from the input mul-trees. We develop and implement a fast SPR-based heuristic algorithm for the NP-hard MulRF problem.We compare the performance of the MulRF method (available at http://genome.cs.iastate.edu/CBL/MulRF/) with several gene tree parsimony approaches using gene tree simulations that incorporate gene tree error, gene duplications and losses, and/or lateral transfer. The MulRF method produces more accurate species trees than gene tree parsimony approaches. We also demonstrate that the MulRF method infers in minutes a credible plant species tree from a collection of nearly 2,000 gene trees., Conclusions: Our new phylogenetic inference method, based on a generalized RF distance, makes it possible to quickly estimate species trees from large genomic data sets. Since the MulRF method, unlike gene tree parsimony, is based on a generic tree distance measure, it is appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplication and losses as well as phylogenetic error may contribute to gene tree discord. In experiments, the MulRF method estimated species trees accurately and quickly, demonstrating MulRF as an efficient alternative approach for phylogenetic inference from large-scale genomic data sets.
- Published
- 2013
- Full Text
- View/download PDF
20. Extracting conflict-free information from multi-labeled trees.
- Author
-
Deepak A, Fernández-Baca D, and McMahon MM
- Abstract
Background: A multi-labeled tree, or MUL-tree, is a phylogenetic tree where two or more leaves share a label, e.g., a species name. A MUL-tree can imply multiple conflicting phylogenetic relationships for the same set of taxa, but can also contain conflict-free information that is of interest and yet is not obvious., Results: We define the information content of a MUL-tree T as the set of all conflict-free quartet topologies implied by T, and define the maximal reduced form of T as the smallest tree that can be obtained from T by pruning leaves and contracting edges while retaining the same information content. We show that any two MUL-trees with the same information content exhibit the same reduced form. This introduces an equivalence relation among MUL-trees with potential applications to comparing MUL-trees. We present an efficient algorithm to reduce a MUL-tree to its maximally reduced form and evaluate its performance on empirical datasets in terms of both quality of the reduced tree and the degree of data reduction achieved., Conclusions: Our measure of conflict-free information content based on quartets is simple and topologically appealing. In the experiments, the maximally reduced form is often much smaller than the original tree, yet retains most of the taxa. The reduction algorithm is quadratic in the number of leaves and its complexity is unaffected by the multiplicity of leaf labels or the degree of the nodes.
- Published
- 2013
- Full Text
- View/download PDF
21. Incompatible quartets, triplets, and characters.
- Author
-
Shutters B, Vakati S, and Fernández-Baca D
- Abstract
We study a long standing conjecture on the necessary and sufficient conditions for the compatibility of multi-state characters: There exists a function f(r) such that, for any set C of r-state characters, C is compatible if and only if every subset of f(r) characters of C is compatible. We show that for every r≥2, there exists an incompatible set C of Ω(r2)r-state characters such that every proper subset of C is compatible. This improves the previous lower bound of f(r)≥r given by Meacham (1983), and f(4)≥5 given by Habib and To (2011). For the case when r=3, Lam, Gusfield and Sridhar (2011) recently showed that f(3)=3. We give an independent proof of this result and completely characterize the sets of pairwise compatible 3-state characters by a single forbidden intersection pattern.Our lower bound on f(r) is proven via a result on quartet compatibility that may be of independent interest: For every n≥4, there exists an incompatible set Q of Ω(n2) quartets over n labels such that every proper subset of Q is compatible. We show that such a set of quartets can have size at most 3 when n=5, and at most O(n3) for arbitrary n. We contrast our results on quartets with the case of rooted triplets: For every n≥3, if R is an incompatible set of more than n-1 triplets over n labels, then some proper subset of R is incompatible. We show this bound is tight by exhibiting, for every n≥3, a set of n-1 triplets over n taxa such that R is incompatible, but every proper subset of R is compatible.
- Published
- 2013
- Full Text
- View/download PDF
22. Fast local search for unrooted Robinson-Foulds supertrees.
- Author
-
Chaudhary R, Burleigh JG, and Fernández-Baca D
- Subjects
- Cluster Analysis, Computer Simulation, Databases, Factual, Algorithms, Computational Biology methods, Phylogeny
- Abstract
A Robinson-Foulds (RF) supertree for a collection of input trees is a tree containing all the species in the input trees that is at minimum total RF distance to the input trees. Thus, an RF supertree is consistent with the maximum number of splits in the input trees. Constructing RF supertrees for rooted and unrooted data is NP-hard. Nevertheless, effective local search heuristics have been developed for the restricted case where the input trees and the supertree are rooted. We describe new heuristics, based on the Edge Contract and Refine (ECR) operation, that remove this restriction, thereby expanding the utility of RF supertrees. Our experimental results on simulated and empirical data sets show that our unrooted local search algorithms yield better supertrees than those obtained from MRP and rooted RF heuristics in terms of total RF distance to the input trees and, for simulated data, in terms of RF distance to the true tree.
- Published
- 2012
- Full Text
- View/download PDF
23. iGTP: a software package for large-scale gene tree parsimony analysis.
- Author
-
Chaudhary R, Bansal MS, Wehe A, Fernández-Baca D, and Eulenstein O
- Subjects
- Algorithms, Databases, Genetic, Evolution, Molecular, Gene Duplication, Genome, Genomics methods, Phylogeny, Software
- Abstract
Background: The ever-increasing wealth of genomic sequence information provides an unprecedented opportunity for large-scale phylogenetic analysis. However, species phylogeny inference is obfuscated by incongruence among gene trees due to evolutionary events such as gene duplication and loss, incomplete lineage sorting (deep coalescence), and horizontal gene transfer. Gene tree parsimony (GTP) addresses this issue by seeking a species tree that requires the minimum number of evolutionary events to reconcile a given set of incongruent gene trees. Despite its promise, the use of gene tree parsimony has been limited by the fact that existing software is either not fast enough to tackle large data sets or is restricted in the range of evolutionary events it can handle., Results: We introduce iGTP, a platform-independent software program that implements state-of-the-art algorithms that greatly speed up species tree inference under the duplication, duplication-loss, and deep coalescence reconciliation costs. iGTP significantly extends and improves the functionality and performance of existing gene tree parsimony software and offers advanced features such as building effective initial trees using stepwise leaf addition and the ability to have unrooted gene trees in the input. Moreover, iGTP provides a user-friendly graphical interface with integrated tree visualization software to facilitate analysis of the results., Conclusions: iGTP enables, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication, duplication-loss, and deep coalescence reconciliation costs, all from within a convenient graphical user interface.
- Published
- 2010
- Full Text
- View/download PDF
24. Majority-rule (+) consensus trees.
- Author
-
Dong J, Fernández-Baca D, McMorris FR, and Powers RC
- Subjects
- Algorithms, Phylogeny, Systems Biology methods
- Abstract
The construction of a consensus tree to summarize the information of a given set of phylogenetic trees is now routinely a part of many studies in systematic biology. One popular method is the majority-rule consensus tree. In this paper we introduce and characterize a new consensus method that refines the majority-rule tree by adding certain compatible clusters satisfying a simple criterion., (Copyright © 2010 Elsevier Inc. All rights reserved.)
- Published
- 2010
- Full Text
- View/download PDF
25. Robinson-Foulds supertrees.
- Author
-
Bansal MS, Burleigh JG, Eulenstein O, and Fernández-Baca D
- Abstract
Background: Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters (or clades) from the input trees., Results: We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (naïve) solutions by a factor of Theta(n) and Theta(n2) respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods., Conclusions: Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.
- Published
- 2010
- Full Text
- View/download PDF
26. Constructing majority-rule supertrees.
- Author
-
Dong J, Fernández-Baca D, and McMorris FR
- Abstract
Background: Supertree methods combine the phylogenetic information from multiple partially-overlapping trees into a larger phylogenetic tree called a supertree. Several supertree construction methods have been proposed to date, but most of these are not designed with any specific properties in mind. Recently, Cotton and Wilkinson proposed extensions of the majority-rule consensus tree method to the supertree setting that inherit many of the appealing properties of the former., Results: We study a variant of one of Cotton and Wilkinson's methods, called majority-rule (+) supertrees. After proving that a key underlying problem for constructing majority-rule (+) supertrees is NP-hard, we develop a polynomial-size exact integer linear programming formulation of the problem. We then present a data reduction heuristic that identifies smaller subproblems that can be solved independently. While this technique is not guaranteed to produce optimal solutions, it can achieve substantial problem-size reduction. Finally, we report on a computational study of our approach on various real data sets, including the 121-taxon, 7-tree Seabirds data set of Kennedy and Page., Conclusions: The results indicate that our exact method is computationally feasible for moderately large inputs. For larger inputs, our data reduction heuristic makes it feasible to tackle problems that are well beyond the range of the basic integer programming approach. Comparisons between the results obtained by our heuristic and exact solutions indicate that the heuristic produces good answers. Our results also suggest that the majority-rule (+) approach, in both its basic form and with data reduction, yields biologically meaningful phylogenies.
- Published
- 2010
- Full Text
- View/download PDF
27. Properties of majority-rule supertrees.
- Author
-
Dong J and Fernández-Baca D
- Subjects
- Genetic Techniques, Phylogeny
- Published
- 2009
- Full Text
- View/download PDF
28. PhyloFinder: an intelligent search engine for phylogenetic tree databases.
- Author
-
Chen D, Burleigh JG, Bansal MS, and Fernández-Baca D
- Subjects
- Internet, User-Computer Interface, Databases, Genetic, Genomics methods, Information Storage and Retrieval methods, Phylogeny
- Abstract
Background: Bioinformatic tools are needed to store and access the rapidly growing phylogenetic data. These tools should enable users to identify existing phylogenetic trees containing a specified taxon or set of taxa and to compare a specified phylogenetic hypothesis to existing phylogenetic trees., Results: PhyloFinder is an intelligent search engine for phylogenetic databases that we have implemented using trees from TreeBASE. It enables taxonomic queries, in which it identifies trees in the database containing the exact name of the query taxon and/or any synonymous taxon names, and it provides spelling suggestions for the query when there is no match. Additionally, PhyloFinder can identify trees containing descendants or direct ancestors of the query taxon. PhyloFinder also performs phylogenetic queries, in which it identifies trees that contain the query tree or topologies that are similar to the query tree., Conclusion: PhyloFinder can enhance the utility of any tree database by providing tools for both taxonomic and phylogenetic queries as well as visualization tools that highlight the query results and provide links to NCBI and TBMap. An implementation of PhyloFinder using trees from TreeBASE is available from the web client application found in the availability and requirements section.
- Published
- 2008
- Full Text
- View/download PDF
29. Spectral partitioning of phylogenetic data sets based on compatibility.
- Author
-
Chen D, Burleigh GJ, and Fernández-Baca D
- Subjects
- Computer Simulation, Phylogeny, Software
- Abstract
We describe two new methods to partition phylogenetic data sets of discrete characters based on pairwise compatibility. The partitioning methods make no assumptions regarding the phylogeny, model of evolution, or characteristics of the data. The methods first build a compatibility graph, in which each node represents a character in the data set. Edges in the compatibility graph may represent strict compatibility of characters or they may be weighted based on a fractional compatibility scoring procedure that measures how close the characters are to being compatible. Given the desired number of partitions, the partitioning methods then seek to cluster the characters with the highest average pairwise compatibility, so that characters in each cluster are more compatible with each other than they are with characters in the other cluster(s). Partitioning according to these criteria is computationally intractable (NP-hard); however, spectral methods can quickly provide high-quality solutions. We demonstrate that the spectral partitioning effectively identifies characters with different evolutionary histories in simulated data sets, and it is better at highlighting phylogenetic conflict within empirical data sets than previously used partitioning methods.
- Published
- 2007
- Full Text
- View/download PDF
30. Improved heuristics for minimum-flip supertree construction.
- Author
-
Chen D, Eulenstein O, Fernández-Baca D, and Burleigh JG
- Abstract
The utility of the matrix representation with flipping (MRF) supertree method has been limited by the speed of its heuristic algorithms. We describe a new heuristic algorithm for MRF supertree construction that improves upon the speed of the previous heuristic by a factor of n (the number of taxa in the supertree). This new heuristic makes MRF tractable for large-scale supertree analyses and allows the first comparisons of MRF with other supertree methods using large empirical data sets. Analyses of three published supertree data sets with between 267 to 571 taxa indicate that MRF supertrees are equally or more similar to the input trees on average than matrix representation with parsimony (MRP) and modified min-cut supertrees. The results also show that large differences may exist between MRF and MRP supertrees and demonstrate that the MRF supertree method is a practical and potentially more accurate alternative to the nearly ubiquitous MRP supertree method.
- Published
- 2007
31. Performance of flip supertree construction with a heuristic algorithm.
- Author
-
Eulenstein O, Chen D, Burleigh JG, Fernández-Baca D, and Sanderson MJ
- Subjects
- Computer Simulation, Algorithms, Classification methods, Phylogeny
- Abstract
Supertree methods are used to assemble separate phylogenetic trees with shared taxa into larger trees (supertrees) in an effort to construct more comprehensive phylogenetic hypotheses. In spite of much recent interest in supertrees, there are still few methods for supertree construction. The flip supertree problem is an error correction approach that seeks to find a minimum number of changes (flips) to the matrix representation of the set of input trees to resolve their incompatibilities. A previous flip supertree algorithm was limited to finding exact solutions and was only feasible for small input trees. We developed a heuristic algorithm for the flip supertree problem suitable for much larger input trees. We used a series of 48- and 96-taxon simulations to compare supertrees constructed with the flip supertree heuristic algorithm with supertrees constructed using other approaches, including MinCut (MC), modified MC (MMC), and matrix representation with parsimony (MRP). Flip supertrees are generally far more accurate than supertrees constructed using MC or MMC algorithms and are at least as accurate as supertrees built with MRP. The flip supertree method is therefore a viable alternative to other supertree methods when the number of taxa is large.
- Published
- 2004
- Full Text
- View/download PDF
32. Fast algorithms for inferring evolutionary trees.
- Author
-
Agarwala R, Fernández-Baca D, and Slutzki G
- Subjects
- Online Systems, Phylogeny, Algorithms, Biological Evolution
- Abstract
We present algorithms for the perfect phylogeny problem restricted to binary characters. The first algorithm is faster than a previous algorithm by Gusfield when the input matrix for the problem is sparse. Next, we present two online algorithms. For the first of these, the set of species is fixed and the characters are given as input one at a time, while, for the second, the set of characters is fixed and the species are given as input one at a time. These two online algorithms are then combined into an algorithm that can process any sequence of additions and deletions of species and characters.
- Published
- 1995
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.