Back to Search Start Over

Characterizing the impacts of dataset imbalance on single-cell data integration.

Authors :
Maan H
Zhang L
Yu C
Geuenich MJ
Campbell KR
Wang B
Source :
Nature biotechnology [Nat Biotechnol] 2024 Mar 01. Date of Electronic Publication: 2024 Mar 01.
Publication Year :
2024
Publisher :
Ahead of Print

Abstract

Computational methods for integrating single-cell transcriptomic data from multiple samples and conditions do not generally account for imbalances in the cell types measured in different datasets. In this study, we examined how differences in the cell types present, the number of cells per cell type and the cell type proportions across samples affect downstream analyses after integration. The Iniquitate pipeline assesses the robustness of integration results after perturbing the degree of imbalance between datasets. Benchmarking of five state-of-the-art single-cell RNA sequencing integration techniques in 2,600 integration experiments indicates that sample imbalance has substantial impacts on downstream analyses and the biological interpretation of integration results. Imbalance perturbation led to statistically significant variation in unsupervised clustering, cell type classification, differential expression and marker gene annotation, query-to-reference mapping and trajectory inference. We quantified the impacts of imbalance through newly introduced properties-aggregate cell type support and minimum cell type center distance. To better characterize and mitigate impacts of imbalance, we introduce balanced clustering metrics and imbalanced integration guidelines for integration method users.<br /> (© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.)

Details

Language :
English
ISSN :
1546-1696
Database :
MEDLINE
Journal :
Nature biotechnology
Publication Type :
Academic Journal
Accession number :
38429430
Full Text :
https://doi.org/10.1038/s41587-023-02097-9