1. CWL-Based Analysis Pipeline for Hi-C Data: From FASTQ Files to Matrices.
- Author
-
Miura H, Cerbus RT, Noda I, and Hiratani I
- Subjects
- Humans, Genomics methods, Computational Biology methods, Chromatin Immunoprecipitation Sequencing methods, High-Throughput Nucleotide Sequencing methods, Software, Chromatin genetics, Chromatin metabolism, Workflow
- Abstract
Over a decade has passed since the development of the Hi-C method for genome-wide analysis of 3D genome organization. Hi-C utilizes next-generation sequencing (NGS) technology to generate large-scale chromatin interaction data, which has accumulated across a diverse range of species and cell types, particularly in eukaryotes. There is thus a growing need to streamline the process of Hi-C data analysis to utilize these data sets effectively. Hi-C generates data that are much larger compared to other NGS techniques such as chromatin immunoprecipitation sequencing (ChIP-seq) or RNA-seq, making the data reanalysis process computationally expensive. In an effort to bridge this resource gap, the 4D Nucleome (4DN) Data Portal has reanalyzed approximately 600 Hi-C data sets, allowing users to access and utilize the analyzed data. In this chapter, we provide detailed instructions for the implementation of the common workflow language (CWL)-based Hi-C analysis pipeline adopted by the 4DN Data Portal ecosystem. This reproducible and portable pipeline generates standard Hi-C contact matrices in formats such as .hic or .mcool from FASTQ files. It enables users to output their own Hi-C data in the same format as those registered in the 4DN Data portal, facilitating comparative analysis using data registered in the portal. Our custom-made scripts are available on GitHub at https://github.com/kuzobuta/4dn_cwl_pipeline ., (© 2025. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.)
- Published
- 2025
- Full Text
- View/download PDF