In the last years the number of sequencing projects dramatically increased and genomes of completely different organisms have been released. However, many published genomes are in a quite meaningless “high quality” draft phase, meaning that they were submitted as contigs or scaffolds not as entire chromosomes. As a result, many genomic regions within a sequenced organism are still unknown or unplaced. Moreover, some contigs or scaffolds could be mis-assembled due, for example, to the presence of repeated regions. The aim of our project is to develop bioinformatics tools and pipelines that will take advantage of mate-pair sequences to validate and ameliorate genome sequences. This is possible because mate-pairs libraries give physical constraints between two mate DNA sequences. These kind of libraries are used, in most genome sequencing projects, to join contigs together. In our view, any discrepancy regarding the predicted physical constrains may be useful to detect problematic regions (i.e. misassembly). As a consequence, the original genomic sequence can be modified until the constraints are finally satisfied. The amelioration, that is the possibility to insert new sequences in the assembly, will take advantage of those sequences that fall within gaps, using two different strategies: (i) gap-filling via de novo assembly of mate reads; (ii) recovering of previously “discarded” draft sequences (unplaced contigs) that indeed cover the gap.
Assessment and amelioration of genome assemblies with comprehensive usage of mate-pair sequences
VEZZI, ALESSANDRO;DE PASCALE, FABIO;VITULO, NICOLA;SCHIAVON, RICCARDO;CAMPAGNA, DAVIDE;VALLE, GIORGIO
2014
Abstract
In the last years the number of sequencing projects dramatically increased and genomes of completely different organisms have been released. However, many published genomes are in a quite meaningless “high quality” draft phase, meaning that they were submitted as contigs or scaffolds not as entire chromosomes. As a result, many genomic regions within a sequenced organism are still unknown or unplaced. Moreover, some contigs or scaffolds could be mis-assembled due, for example, to the presence of repeated regions. The aim of our project is to develop bioinformatics tools and pipelines that will take advantage of mate-pair sequences to validate and ameliorate genome sequences. This is possible because mate-pairs libraries give physical constraints between two mate DNA sequences. These kind of libraries are used, in most genome sequencing projects, to join contigs together. In our view, any discrepancy regarding the predicted physical constrains may be useful to detect problematic regions (i.e. misassembly). As a consequence, the original genomic sequence can be modified until the constraints are finally satisfied. The amelioration, that is the possibility to insert new sequences in the assembly, will take advantage of those sequences that fall within gaps, using two different strategies: (i) gap-filling via de novo assembly of mate reads; (ii) recovering of previously “discarded” draft sequences (unplaced contigs) that indeed cover the gap.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.