To the Editor: Recently, an article by Simoni et al. (2000), who used (i) SAAP analysis to analyze the population frequencies of mtDNA haplogroups and (ii) AIDA analysis to examine both the frequency and the sequence similarity of truncated mtDNA sequences, appeared in this Journal. The main outcome of their study was that “the overall patterns of mtDNA diversity appear to be poorly significant in Europe.” The raw data comprised 2,619 hypervariable segment I (HVS-I) sequences (denoted as “HVR-I” [hypervariable region I] sequences by Simoni et al. [2000]) that were obtained from 36 regions or populations of Europe, the Near East, and the Caucasus and that were collected from both the literature and unpublished sources. Simoni et al. ostensibly grouped the HVS-I sequences according to haplogroup motifs proposed elsewhere (Richards et al. 1998), and they reported the resulting frequencies for each region/population in table 3 in their study. We have checked the input data displayed in table 3 and have found serious technical errors affecting numerous entries. More critically, the mtDNA categories that they report correspond neither to their own criteria nor to the haplogroup definitions established in the literature (to which they refer). Furthermore, their decision to truncate HVS-I information (and to disregard RFLP information) renders these data inadequate to differentiate even African and East Asian sequences from European sequences in many cases. Inspection of table 3 in the study by Simoni et al. (2000) reveals that (i) the data in the “Galicia” and “Spain: Central” rows have been, in part, crossed-over, (ii) the data in the “Belgium,” “Alps,” and “Turkey” rows have been computed with the use of sample sizes smaller than those reported in table 1 in the same study, (iii) the haplogroup “J” column has been totally randomized, and (iv) the “Other” column is complementary to the last four “superhaplogroup” columns but not to the first 11 haplogroup columns. As for item (iii), almost all positive entries in the haplogroup “J” column have been either displaced or calculated with the use of sample sizes corresponding to nearby rows. Hence, most entries in this column diverge widely from the real haplogroup J frequencies (see the last column of table 1 in the present study). Table 1 Haplogroup J Frequencies According to Simoni et al. (2000), a Crude Default Criterion, and Inference in the Present Study As an example of their haplogroup assignment, Simoni et al. (2000) specifically referred to the motif 16069T–16126C for haplogroup J, but they overlooked the fact that this criterion cannot formally be applied to the sequences in the study by Richards et al. (1996), since these were reported only between 16090 and 16365. This might explain some of the many “0” entries in the haplogroup “J” column of table 3 in the Simoni et al. study (see table 1 in the present study). Simoni et al. should have either adopted the haplogroup J frequencies reported by Richards et al. (1996), excluded these population samples from their study, or trimmed all data to the shortest common segment. In the latter case, by employing the motif 16126C–16294C, one could take the default cluster JT-T (comprising all JT sequences that are not T) as a crude default criterion for haplogroup J (see table 1 in the present study). The discrepancies in haplogroup frequencies are by no means restricted to haplogroup J. Table 2 in the present study shows the marked contrast between published haplogroup frequencies and those assumed by Simoni et al. (2000) for the well-characterized Tuscan, Druze, and Adygei samples (which were typed for RFLPs as well as for HVS-I sequences by Torroni et al. [1996] and Macaulay et al. [1999]). The large differences in frequency for haplogroup H, the most-common European haplogroup, are due to the premise of Simoni et al. (2000) that haplogroup “H contains all sequences . . . that show none of the 22 substitutions considered in this study.” This extreme simplification results, on the one hand, in the dumping of large numbers of haplogroup H mtDNAs mainly into the default category “Other” and, on the other hand, in the inclusion of several non-H sequences within their haplogroup H category. For instance, by their criterion, 10/20 haplogroup H mtDNAs from the Tuscan sample (Torroni et al. 1996) would no longer be scored as “H,” whereas the U sequence 16051G–16309G–16318C would be scored as “H.” In consequence, the haplogroup H category described by Simoni et al. (2000) is bound to be highly polyphyletic in the mtDNA genealogy and does not reflect the spatial patterns of haplogroup H. Table 2 Haplogroup Frequencies, According to Simoni et al. (2000) vs. the Original Studies, in Tuscan, Druze, and Adygei Populations At this point, it is important to clarify what haplogroup classification entails. An mtDNA haplogroup, when properly defined, is a monophyletic clade of the mtDNA genealogy. Originally, high-resolution RFLP analysis (employing 14 enzymes) had been used for identification of clades by signature sites (Torroni et al. 1992, 1993, 1994a, 1994b, 1996; Chen et al. 1995), and current haplogroup nomenclature originated in that context. In retrospect, this approach is indeed quite reliable, although recurrent changes at a few sites, such as 10394 DdeI, may occasionally cause problems. Potential ambiguities can largely be resolved by incorporation of information from other segments of mtDNA sequences or specific positions of the coding regions (Torroni et al. 1997; Brown et al. 1998; Starikovskaya et al. 1998; Macaulay et al. 1999; Quintana-Murci et al. 1999; Schurr et al. 1999). For instance, haplogroup K is now understood to be a clade (as are U1–U6) within haplogroup U. HVS-I data in combination with partial RFLPs can sometimes serve as a satisfactory substitute for a full RFLP analysis (Rando et al 1998, 2000; Kivisild et al. 1999a, 1999b). Unfortunately, HVS-I data alone, which have been produced en masse, often do not contain sufficient information for confident assignment of haplogroup affiliation. The truncation of the HVS-I data to only 13–22 variant positions, as performed by Simoni et al. (2000), yields even poorer results. For example, the motif 16223T–16278T, which was used by Simoni et al. to identify haplogroup X, would transfer most African L1/L2 sequences (Watson et al. 1997; Rando et al. 1998) into the then artefactual category “X.” For Europe, this is relevant insofar as a few L1/L2 sequences are present in Iberia (Rocha et al. 1999), and there even resides an African L1c sequence with the motif 16223T–16278T in the British data (Piercy et al. 1993). In addition, as was previously pointed out (Torroni et al. 1996; Macaulay et al. 1999), one has to be prepared for recurrent mutations in the HVS-I motifs (compare also figs. 4, 5, 8, and 9 of the study by Richards et al. [1998]). For instance, the frequency discrepancy (17.8% vs. 26.7%) for haplogroup X in the Druze sample (see table 2 in the present study) is due to the fact that Simoni et al. did not include four haplogroup X mtDNAs that have mutated to 16223C. Another of the many possible examples of misclassification caused by the use of truncated motifs is illustrated by 16129A–16223T, the motif used by Simoni et al. for classification of haplogroup I mtDNAs. Use of this truncated motif has led them to classify both the Asian haplogroup C mtDNAs (16129A–16223T–16298C–16327T) of the Adygei (6.0%) and the East African haplogroup M1 mtDNA (16129A–16189C–16223T–16249C–16311C–16359C) of the Druze (2.2%) as members of haplogroup I (see table 2 in the present study). The issue of haplogroups only affects the SAAP analysis. However, there are also serious difficulties with the AIDA analysis. Ideally, AIDA should be applied to full DNA-sequence data, but Simoni et al. (2000) included only 22/241 variant positions. One cannot expect that such a truncated data set would show much evidence of geographic patterns within Europe. Most of the haplogroup diagnostic variants in western Eurasian mtDNA are very ancient, and they probably evolved in the Near East and subsequently spread to Europe (Torroni et al. 1998; Macaulay et al. 1999); at any event, they occur throughout western Eurasia. The more recent “rare substitutions,” which have evolved since the earlier dispersals and which Simoni et al. (2000) discarded as “statistical noise,” are precisely those that are most likely to show regional distributions. The exclusion of such mutations severely restricts the capacity to identify phylogeographic units and, thus, is bound to have seriously reduced the power of the approach to detect autocorrelation. Even when haplogroup assignment is done with care, failure to detect significant clines in haplogroup frequencies does not prove the absence of any spatial structure in the mtDNA pool. Such structure would rather be manifest at a phylogenetically finer scale (defined on the basis of more-recent mutations). In any case, one would not expect that meaningful patterns of mtDNA diversity could emerge from analyses based on categories with no demonstrable phylogenetic support.