Diogo B. Lima, Richard H. Valente, Julia Chamot-Rooke, Mathieu Dupré, Louise U. Kurt, André R.F. Silva, Carolina Alves Nicolau, Marlon D.M. Santos, Paulo C. Carvalho, Valmir C. Barbosa, Fiocruz Paraná - Instituto Carlos Chagas / Carlos Chagas Institute [Curitiba, Brésil] (ICC), Fundação Oswaldo Cruz / Oswaldo Cruz Foundation (FIOCRUZ), Réseau International des Instituts Pasteur (RIIP)-Réseau International des Instituts Pasteur (RIIP), Leibniz Forschungsinstitut für Molekulare Pharmakolgie = Leibniz Institute for Molecular Pharmacology [Berlin, Allemagne] (FMP), Leibniz Association, Spectrométrie de Masse pour la Biologie – Mass Spectrometry for Biology (UTechS MSBio), Institut Pasteur [Paris] (IP)-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS)-Université Paris Cité (UPCité), Instituto Oswaldo Cruz / Oswaldo Cruz Institute [Rio de Janeiro] (IOC), Centre de Recherche en Cancérologie et Immunologie Nantes-Angers (CRCINA), Université d'Angers (UA)-Université de Nantes (UN)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-Centre hospitalier universitaire de Nantes (CHU Nantes), Instituto Alberto Luiz Coimbra de Pós-Graduação e Pesquisa de Engenharia (COPPE-UFRJ), Universidade Federal do Rio de Janeiro (UFRJ), Universidade Federal do Estado do Rio de Janeiro (UNIRIO), A.R.F.S., L.U.K., M.D.M.S. and V.C.B. acknowledge support from Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES). A.R.F.S. acknowledges Instituto Carlos Chagas (ICC). D.B.L., M.D. and J.C.R. acknowledge Institut Pasteur, CNRS, the Agence Nationale de la Recherche (project ANR-15-CE18-0021) and the European Joint Programme One Health EJP from the European Union's Horizon 2020 research and innovation programme (Grant Agreement 773830) for financial support. R.H.V., V.C.B. and P.C.C. acknowledge support from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq). V.C.B. acknowledges a BBP grant from Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ)., ANR-15-CE18-0021,PathoTOP,Identification rapide de pathogènes bactériens par protéomique top-down en contexte clinique(2015), European Project: 773830, H2020-SFS-2017-1 ,One Health EJP(2018), Fundação Oswaldo Cruz (FIOCRUZ), Institut Pasteur [Paris]-Institut de Chimie du CNRS (INC)-Centre National de la Recherche Scientifique (CNRS), Institut National de la Santé et de la Recherche Médicale (INSERM)-Université de Nantes - UFR de Médecine et des Techniques Médicales (UFR MEDECINE), and Université de Nantes (UN)-Université de Nantes (UN)-Centre hospitalier universitaire de Nantes (CHU Nantes)-Centre National de la Recherche Scientifique (CNRS)-Université d'Angers (UA)
International audience; In proteomics, the identification of peptides from mass spectral data can be mathematically described as the partitioning of mass spectra into clusters (i.e., groups of spectra derived from the same peptide). The way partitions are validated is just as important, having evolved side by side with the clustering algorithms themselves and given rise to many partition assessment measures. An assessment measure is said to have a selection bias if, and only if, the probability that a randomly chosen partition scoring a high value depends on the number of clusters in the partition. In the context of clustering mass spectra, this might mislead the validation process to favor clustering algorithms that generate too many (or few) spectral clusters, regardless of the underlying peptide sequence. A selection bias toward the number of peptides is desirable for proteomics as it estimates the number of peptides in a complex protein mixture. Here, we introduce an assessment measure that is purposely biased toward the number of peptide ion species. We also introduce a partition assessment framework for proteomics, called the Partition Assessment Tool, and demonstrate its importance by evaluating the performance of eight clustering algorithms on seven proteomics datasets while discussing the trade-offs involved. SIGNIFICANCE: Clustering algorithms are widely adopted in proteomics for undertaking several tasks such as speeding up search engines, generating consensus mass spectra, and to aid in the classification of proteomic profiles. Choosing which algorithm is most fit for the task at hand is not simple as each algorithm has advantages and disadvantages; furthermore, specifying clustering parameters is also a necessary and fundamental step. For example, deciding on whether to generate "pure clusters" or fewer clusters but accepting noise. With this as motivation, we verify the performance of several widely adopted algorithms on proteomic datasets and introduce a theoretical framework for drawing conclusions on which approach is suitable for the task at hand.