Back to Search
Start Over
A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-seq data
- Publication Year :
- 2019
- Publisher :
- Cold Spring Harbor Laboratory, 2019.
-
Abstract
- High-dimensional data are becoming increasingly common in nearly all areas of science. Developing approaches to analyze these data and understand their meaning is a pressing issue. This is particularly true for the rapidly growing field of single-cell RNA-Seq (scRNA-Seq), a technique that simultaneously measures the expression of tens of thousands of genes in thousands to millions of single cells. The emerging consensus for analysis workflows reduces the dimensionality of the dataset before performing downstream analysis, such as assignment of cell types. One problem with this approach is that dimensionality reduction can introduce substantial distortion into the data; consider the familiar example of trying to represent the three-dimensional earth as a two-dimensional map. It is currently unclear if such distortion affects analysis of scRNA-Seq data sets. Here, we introduce a straightforward approach to quantifying this distortion by comparing the local neighborhoods of points before and after dimensionality reduction. We found that popular techniques like t-SNE and UMAP introduce significant distortion even for relatively simple geometries such as simulated hyperspheres. For scRNA-Seq data, we found the distortion in local neighborhoods was greater than 95% in the 2- and 3-dimensional space typically used for downstream analysis. This high level of distortion can readily introduce important errors into cell type identification, pseudotime ordering, and other analyses that rely on local relationships. We found that principal component analysis can generate accurate embeddings of the data, but only when using dimensionalities that are much higher than typically used in scRNA-Seq analysis. We suggest approaches to take these findings into account and call for a new generation of dimensional reduction algorithms that can accurately embed high dimensional data in its true latent dimension.
- Subjects :
- Clustering high-dimensional data
0303 health sciences
Computer science
Dimensionality reduction
computer.software_genre
03 medical and health sciences
Identification (information)
0302 clinical medicine
Dimensional reduction
Distortion
Principal component analysis
Metric (mathematics)
Data mining
computer
030217 neurology & neurosurgery
030304 developmental biology
Curse of dimensionality
Subjects
Details
- Database :
- OpenAIRE
- Accession number :
- edsair.doi.dedup.....f46e3f3b9e81f0a40dee62edd76a4a33
- Full Text :
- https://doi.org/10.1101/689851