Back to Search Start Over

Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files.

Authors :
Du, Lianming
Liu, Qin
Fan, Zhenxin
Tang, Jie
Zhang, Xiuyue
Price, Megan
Yue, Bisong
Zhao, Kelei
Source :
Briefings in Bioinformatics; Jul2021, Vol. 22 Issue 4, p1-8, 8p
Publication Year :
2021

Abstract

FASTA and FASTQ are the most widely used biological data formats that have become the de facto standard to exchange sequence data between bioinformatics tools. With the avalanche of next-generation sequencing data, the amount of sequence data being deposited and accessed in FASTA/Q formats is increasing dramatically. However, the existing tools have very low efficiency at random retrieval of subsequences due to the requirement of loading the entire index into memory. In addition, most existing tools have no capability to build index for large FASTA/Q files because of the limited memory. Furthermore, the tools do not provide support to randomly accessing sequences from FASTA/Q files compressed by gzip, which is extensively adopted by most public databases to compress data for saving storage. In this study, we developed pyfastx as a versatile Python package with commonly used command-line tools to overcome the above limitations. Compared to other tools, pyfastx yielded the highest performance in terms of building index and random access to sequences, particularly when dealing with large FASTA/Q files with hundreds of millions of sequences. A key advantage of pyfastx over other tools is that it offers an efficient way to randomly extract subsequences directly from gzip compressed FASTA/Q files without needing to uncompress beforehand. Pyfastx can easily be installed from PyPI (https://pypi.org/project/pyfastx) and the source code is freely available at https://github.com/lmdu/pyfastx. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
14675463
Volume :
22
Issue :
4
Database :
Complementary Index
Journal :
Briefings in Bioinformatics
Publication Type :
Academic Journal
Accession number :
152575540
Full Text :
https://doi.org/10.1093/bib/bbaa368