Introduction The assembly of a genome is a complex task, whose hardest step is the resolution of repeats. As these regions are usually considered of minor concern for the description of the features of the genome, they are often poorly characterized in a genome analysis. By definition, any region present at least twice in the genome is a repeat, therefore duplicated genes could fall in this category, leading to an underestimation of duplication events in genomes. This effect could be exacerbated in the k-mers based short reads assembly algorithms. Methods While working in gap closure experiments for the tomato genome it came out that some of the unplaced contigs were duplicated genes. To our knowledge this loss of duplicated genes has never been measured for plant genomes. For this reason, the Arabidopsis thaliana genome was used as a reference sequence to generate simulated paired-end Illumina reads, that were assembled with De Bruijn graph based algorithms. Moreover, short reads data of other publicly available Arabidopsis thaliana ecotypes were similarly assembled and compared to the corresponding reference guided assemblies. Results The comparison between the already published genome assemblies and the De Bruijn graph based assemblies allowed us to investigate duplicated genes in terms of: 1) how many genes are missing in the genomes; 2) how the k-mers lengths may affect the loss/presence of duplicated genes in the genomes; 3) highlight how the structure of the duplicated genes can be affected by differential degree of nucleotide conservation. Discussion All the eukaryotic genome projects are now performed by means of short reads production and assembly. The impact of the sequencing strategy on duplicated gene representativeness should produce new insight to be considered when studying plant genomes and their evolution.
Measuring the loss of duplicated genes in plant genomes assembled by means of short reads.
DE PASCALE, FABIO;MARTINI, PAOLO;VEZZI, ALESSANDRO
2015
Abstract
Introduction The assembly of a genome is a complex task, whose hardest step is the resolution of repeats. As these regions are usually considered of minor concern for the description of the features of the genome, they are often poorly characterized in a genome analysis. By definition, any region present at least twice in the genome is a repeat, therefore duplicated genes could fall in this category, leading to an underestimation of duplication events in genomes. This effect could be exacerbated in the k-mers based short reads assembly algorithms. Methods While working in gap closure experiments for the tomato genome it came out that some of the unplaced contigs were duplicated genes. To our knowledge this loss of duplicated genes has never been measured for plant genomes. For this reason, the Arabidopsis thaliana genome was used as a reference sequence to generate simulated paired-end Illumina reads, that were assembled with De Bruijn graph based algorithms. Moreover, short reads data of other publicly available Arabidopsis thaliana ecotypes were similarly assembled and compared to the corresponding reference guided assemblies. Results The comparison between the already published genome assemblies and the De Bruijn graph based assemblies allowed us to investigate duplicated genes in terms of: 1) how many genes are missing in the genomes; 2) how the k-mers lengths may affect the loss/presence of duplicated genes in the genomes; 3) highlight how the structure of the duplicated genes can be affected by differential degree of nucleotide conservation. Discussion All the eukaryotic genome projects are now performed by means of short reads production and assembly. The impact of the sequencing strategy on duplicated gene representativeness should produce new insight to be considered when studying plant genomes and their evolution.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.