To the Editor:Stephens et al. (2001xA new statistical method for haplotype reconstruction from population data. Stephens, M, Smith, NJ, and Donnelly, P. Am J Hum Genet. 2001; 68: 978–989Abstract | Full Text | Full Text PDF | PubMed | Scopus (4936)See all References2001) (henceforth referred to as “SSD”) introduced a new statistical method for haplotype reconstruction, called “PHASE,” that has three major advantages over existing approaches, including EM. The letter from Zhang et al. (2001xComparisons of two methods for haplotype reconstruction and haplotype frequency estimation from population data. Zhang, S, Pakstis, AJ, Kidd, KK, and Zhao, H. Am J Hum Genet. 2001; 69: 906–912Abstract | Full Text | Full Text PDF | PubMed | Scopus (55)See all References2001 [in this issue]) (henceforth referred to as “ZPKZ”), questions one of these—namely, the increased accuracy of PHASE.ZPKZ report two kinds of comparisons. The first is based on “empirical population haplotype frequency data,” and the second is based on data for which the true phase is determined experimentally. Only the second of these types is actually based on “real” data in the usual sense, and when these data are used, PHASE does considerably outperform EM. We report comparisons below, using three other real data sets. In each case, PHASE provides haplotype reconstructions that are more accurate than those provided by EM, sometimes considerably so.Much of the discussion by ZPKZ—as well as, apparently, their discouraging conclusion for PHASE—is based on their first set of comparisons. Unfortunately, their terminology may cause some confusion. The “empirical” haplotype frequencies on which they base their comparisons are not, in fact, haplotype counts in real data. Instead (S. Zhang, personal communication), although not mentioned in their letter, the “empirical” frequencies are actually estimates, provided by the EM algorithm, from genotype data.PHASE is best thought of as a Bayesian method for haplotype reconstruction. Its potential to improve on maximum likelihood (and, hence, on EM) comes from its use of prior information. In particular, it incorporates the prior knowledge that unresolved haplotypes will tend to be the same as, or similar to, known haplotypes. When this is true in actual data, PHASE will typically provide better haplotype estimates. The comparisons by ZPKZ suggest that, when such clustering of haplotypes is not present, PHASE does not perform systematically worse than EM.As emphasized by SSD, although PHASE uses a coalescent approximation to quantify the fact that haplotypes tend to be similar to one another, PHASE does not depend on the assumptions underlying the coalescent model, and we would expect it to perform well under much more general settings, including population structure, recombination, and selection.In collaboration with H. Ackerman, we have compared EM and PHASE for haplotypes determined from pedigree data at the IL8 and TNF loci. At the IL8 locus, Hull et al. (2001xUnusual haplotypic structure of IL8, a susceptibility locus for a common respiratory virus. Hull, J, Ackerman, H, Isles, K, Usen, S, Pinder, M, Thomson, A, and Kwiatkowski, D. Am J Hum Genet. 2001; 69: 413–419Abstract | Full Text | Full Text PDF | PubMed | Scopus (142)See all References2001) typed six single-nucleotide polymorphisms (SNPs) over 4.5 kb in 61 Gambian parents-child triples. Of the 122 parents, 102 had haplotypes that were unambiguous or that could be determined from the child’s genotype. At the TNF locus, H. Ackerman (unpublished data) typed 12 SNPs over 4.3 kb in 53 Gambian parents-child triples, and the same procedure gave 96 unambiguous parents. For each locus, we applied EM and PHASE to the subset of unambiguous parents and computed the error rates. At IL8, error rates were 7/31 for EM and 6/31 for PHASE; at TNF, error rates were 24/88 for EM and 10/88 for PHASE. Thus, PHASE reduced error rates in these data sets by 14% and 58%, respectively.We are grateful to S.M. Fullerton, G. Ybazeta, and A. DiRienzo (personal communication), for allowing us to report the following results of their unpublished comparison of PHASE and EM on molecularly determined haplotypes at the CAPN10 locus. They typed 46 individuals from four populations (n=11, 12, 11, and 12) at 14 biallelic SNPs and found the discrepancy for the algorithms applied to the combined sample and applied to the four population samples separately. PHASE consistently outperformed EM, reducing discrepancy by as much as 69% (table 1table 1).Table 1Discrepancies Obtained by PHASE and EM on Genotypes from the CAPN10 LocusDiscrepancy Obtained bySampleEM MethodPHASE MethodCombined.13.05Population 1.14.09Population 2.26.08Population 3.00.00Population 4.23.13In summary: 1.PHASE typically provides more-accurate haplotype estimates than does EM and other existing methods, when there is “clustering” in the true haplotype configuration.2.Such clustering would usually be expected in real data, on population genetics grounds, whether or not the data are well modelled by the standard coalescent.3.PHASE outperforms EM for the one real data set in ZPKZ and for the three other real data sets we have looked at.4.Most of the comparisons by ZPKZ are based not on real haplotype data but rather on genotype data from which haplotype frequencies have been estimated by EM. Haplotype frequencies estimated by EM will not necessarily exhibit clustering, even if it is present in the true frequencies. It is thus not surprising—and, perhaps, not directly relevant—that, in most instances, ZPKZ observe similar behavior between EM and PHASE.5.When the true haplotypes do not exhibit clustering, PHASE does not seem to perform systematically worse than EM.Thus, although we admit that there will be exceptions, PHASE provides more-accurate haplotype reconstructions than EM for all the real data sets we and ZPKZ have examined and under conditions which seem likely for most other real data sets. In other settings, it performs no worse. In this sense, using PHASE is a low-risk strategy with considerable potential gains; however, increased accuracy is only one of the advantages of PHASE. We continue to regard the other advantages as being at least as important. It remains the case that PHASE is practicable for much larger problems than is EM, and it is the only available method that provides an accurate measure of the uncertainty associated with phase calls, thus guarding against inappropriate overconfidence in statistically reconstructed haplotypes.