Back to Search Start Over

SubKluster: Novel method to bin scaffolds from cereal genomes into subgenomes using substring frequency analysis

Authors :
Kalbskopf, Victor
Kalbskopf, Victor
Publication Year :
2023

Abstract

The genome of the Belinda variety of the hexaploid oat (Avena sativa) has recently been sequenced and assembled. This project aims to improve the assembly by clustering the thousands of scaffolds into their three ancestral subgenomes using Principle Component Analysis (PCA) of kmer and repeat-element frequencies. The method was developed using a chromosome level assembly of hexaploid Wheat (Tritium aestivum), which formed highly distinguishable subgenome true clusters in their PCA graph, which indicates that the method has merit. The longest scaffolds of oats that formed 90% of the genome (N90) were processed in the same manner, and which resulted in 2 clusters, one with about one third of the 3-copy BUSCOs (Benchmarking Universal Single-Copy Orthologs), and another with two thirds. The latter cluster could then be subdivided into two clusters, with about half of the 2-copy BUSCOs in each cluster. A one:one:one ratio of BUSCOs in each cluster would indicate that the subgenomes are dividing into their respective clusters. The clustering is not neat or as clear as in the wheat example, but the length of the scaffolds or the state of the assembly may have a very large effect on the efficacy of the method. It is hoped that this method, with additional improvements, could be used to assess the assemblies of other large polyploid genomes and be part of a larger pipeline for understanding crop genome evolution.<br />Too many puzzle pieces: Oats has a messy genome Imagine putting together a puzzle where all the puzzle pieces look very similar. And it has 12 billion pieces. And each piece has a copy. And someone mixed in two more puzzle sets that are slightly different. So you have 72 billion very similar puzzle pieces which will make 6 slightly different puzzles when assembled. Yes, even computers struggle with this. Which is why it’s taking so long to sequence the Oats genome. Yes, the cereal you eat. We’re struggling to know what’s going in it’s genome because there are 6 copies and repeating puzzle pieces. When we try to sequence it, we get only little parts where we think we’ve managed to put together a a few thousand pieces into a fragment here or there. But we don’t know which of the 6 puzzles (genomes) each fragment belongs to. Once we know that, we can connect up the fragments into larger parts to form the right puzzle. So I looked for patterns in the puzzle pieces, and found too many. Literally millions. Some of the patterns would be the 2 of the same letters repeated over and over, for example AGAGAGAG… However, I found a pattern in these patterns. They seemed to be found more or less depending on the which puzzle the belonged to. I thought to see if this would work by testing it on the Wheat genome, which has a larger genome in 6 copies. We have managed to sequence wheat quite well, so I chopped up the puzzles that made up wheat so it was about the size of the fragments of oats, but kept track of which puzzle it belonged to. Then I counted how often the patterns occurred in each fragment, and saw a beautiful pattern. The patterns formed signatures that identified their parent puzzle. I ran into problems though. Some of the oats fragments were very short, so the signatures couldn’t be found. And over the millions of years that Oats has evolved, many puzzles have swapped fragments, which means their signatures got mixed up too. However, I also know that all plants shoul

Details

Database :
OAIster
Notes :
application/pdf, English
Publication Type :
Electronic Resource
Accession number :
edsoai.on1379083283
Document Type :
Electronic Resource