1. The sum of two halves may be different from the whole. Effects of splitting sequencing samples across lanes
- Author
-
Eleanor Williams, Arash Shahsavari, Chazarra-Gil R, Irina Mohorianu, Williams, Eleanor C [0000-0002-8023-9550], Chazarra-Gil, Ruben [0000-0002-8016-3613], Shahsavari, Arash [0000-0002-8396-3958], Mohorianu, Irina [0000-0003-4863-761X], and Apollo - University of Cambridge Repository
- Subjects
Computer science ,Human Genome ,Robustness (evolution) ,Reproducibility of Results ,High-Throughput Nucleotide Sequencing ,Computational biology ,10× ,DNA sequencing ,Expression (mathematics) ,differential expression ,enrichment analysis ,mRNAseq ,Variable (computer science) ,Identification (information) ,cell type calling ,Single cell sequencing ,FOS: Biological sciences ,Genetics ,Generic health relevance ,ChIPseq ,Peak calling ,smartSeq ,Level of detail ,sample splitting - Abstract
Over the past two decades, the advances in high throughput sequencing (HTS) enabled the characterisation of biological processes at an unprecedented level of detail; as a result the vast majority of hypotheses in molecular biology rely on analyses of HTS data. However, achieving increased robustness and reproducibility of results remains one of the main challenges across analyses. Although variability in results may be introduced at various stages, such as alignment, summarisation or detection of differences in expression, one source of variability has been systematically omitted: the consequences of choices that influence the sequencing design which propagate through analyses and introduce an additional layer of technical variation. In this study, we illustrate qualitative and quantitative differences in results arising from the splitting of samples across lanes, on bulk and single cell sequencing outputs. For bulk mRNAseq data, we focus on differential expression and enrichment analyses; for bulk ChIPseq data, we investigate the effect on peak calling, and the peaks’ properties. At single cell level, we concentrate on the identification of cell subpopulations (cells clustered based on their expression profiles). We rely on the identity of markers used for assigning cell identities; both smartSeq and 10x data are presented. We conclude that the observed reduction in the number of unique sequenced fragments reduces the level of detail on which the different prediction approaches depend. Further, the sequencing stochasticity adds in a weighting bias corroborated with variable sequencing depths.
- Published
- 2022
- Full Text
- View/download PDF