Back to Search Start Over

PanPA: generation and alignment of panproteome graphs

Authors :
Fawaz Dabbaghie
Sanjay K. Srikakulam
Tobias Marschall
Olga V. Kalinina
Publication Year :
2023
Publisher :
Cold Spring Harbor Laboratory, 2023.

Abstract

MotivationCompared to eukaryotes, prokaryote genomes are more diverse through different mechanisms, including a higher mutation rate and horizontal gene transfer. Therefore, using a linear representative reference can cause a reference bias. Graph-based pangenome methods have been developed to tackle this problem. However, comparisons in DNA space is still challenging due to this high diversity. In contrast, amino acids have higher similarity due to evolutionary constraints, resulting in conserved amino acids that, however, may be encoded by several synonymous codons. Coding regions cover the majority of the genome in prokaryotes. Thus, building panproteomes leverages the high sequence similarity while not losing much of the genome in non-coding regions.ResultsWe presentPanPA, a method that takes a set of multiple sequence alignments (MSAs) of proteins or protein clusters, indexes them, and builds a graph for each MSA. In the querying step, it can align DNA or amino acid sequences back to these graphs. We first showcase thatPanPAgenerates correct alignments on a panproteome from 1,350E. coli. To demonstrate that panproteomes allow longer phylogenetic distance comparison, we compare DNA and protein alignments from 1,073S. entericaassemblies againstE. colireference genome, pangenome, and panproteome usingBWA,GraphAligner, andPanPArespectively, wherePanPAwas able to produce around 22% more alignments. We also aligned DNA short-reads WGS sample fromS. entericaagainst theE. colireference withBWAand the panproteome withPanPA, wherePanPAwas able to find alignment for 69% of the reads compared to 5% withBWAAvailabilityPanPAis available athttps://github.com/fawaz-dabbaghieh/PanPAContactfawaz@hhu.de,olga.kalinina@helmholtz-hzi.deSupplementary informationSupplementary data are available atBioinformaticsonline.

Details

Database :
OpenAIRE
Accession number :
edsair.doi...........b3f3a4457d0075ae4e912ae09514caab