Identification of Structural Variations in Resequenced Genomes using Paired-End or Mate-Pair Sequences

Zamperin, Gianpiero

Next Generation Sequencing (NGS) allows the production of a lot of data in cheaper ways than the traditional Sanger technology. The huge amount of data that recently has been obtained with NGS resulted in a fast production of the draft sequence of many genomes, both from eukaryotic and prokaryotic organisms. The Human Genome Project was completed in 2003, less than 10 years ago: it cost billions of dollars and involved dozens of laboratories from all around the world. Currently, NGS allows to get the equivalent of a human genome in few weeks, at a price of 10,000 dollars. This amazing increase of performance has opened new possibilities in the biological field: for example now it is possible to genetically compare entire organisms, analyse ancient DNA, study genetic diseases at a level that was unbelievable until few years ago. Some of the main fields that can be improved with this technology are: genomics (for example, genome assembly and structural variations detection), transcriptomics (for example, analysis of gene expression, gene prediction and alternative splicing) and epigenetics. The huge amount of data that has been produced needs to be analysed; it is very unlikely that such analysis could be done manually, so new bioinformatic methods are needed to speed up the process. There is a need for optimizing computational resources to efficiently store NGS data, but also the need for new algorithms, specifically designed for NGS data, for instance to overcome one of the major limitations of new sequencing technologies: the short length of the individual sequences (generally called `reads') that can be delivered by NGS machines. From one side, NGS can produce several hundred times more reads than traditional sequencing, but on the other hand these reads are much shorter: about 50-100 bases instead than 500-1000 bases of Sanger sequencing. This makes the analysis of the data more difficult , in particular for genomic repeats that can be resolved only with longer reads. Currently the NGS machines that are mostly used are Solexa (Illumina), 454 (Roche) and SOLiD (Applied BioSystems). The first one uses a method similar to Sanger sequencing, while the other two use different technologies, respectively pyrosequencing and sequencing-by-ligation. The length of the reads is variable: 454 produces reads of about $400$ bases, while the other two produce reads of length between $35$ and $100$ bases. The three platforms differ also in their throughput that continuously improves over time; currently the 454 produces about one million reads per run, while Solexa and SOLiD can produce several hundred millions reads per run. These platforms can be used to sequence different types of libraries, including paired-end and mate-pair libraries. They are libraries that allow the sequencing of the ends of DNA fragments; as a result, pairs of sequences are produced, that must map at a distance compatible with the length of the library fragments. When used for re-sequencing individual genomes, these libraries generate a lot of links (`arcs'), one for each pair of mapped reads, that must be compatible with the length of the library fragments. The main objective of my PhD thesis is to prove that it should be possible to identify with high accuracy any structural variation occurring in individual genomes, using the data from paired-end and mate-pair libraries. The accuracy of this analysis should improve with the density of arcs that are covering the genome; therefore, the large number of arcs that can be generated by NGS platforms offers a great opportunity for structural variation studies. Structural variations are an aspect of the genome whose importance has become evident only in the past few years: before, even their existence was doubtful. It has been recently observed that in adult genomes hundreds of structural variations are present, which may be associated with cancer of other diseases (for example Parkinson's disease). Several tools have been developed to detect structural variations, based on comparative genome hybridization and, more recently, on NGS. In the latter case, the tools available are still far from being able to exploit the full potential of the NGS data, both in terms of sensitivity and specificity. The aim of my PhD was to investigate this problem and to create a bioinformatic tool able to detect structural variations with high accuracy. At the beginning I focused only on SOLiD data, then I extended my analysis also to Solexa data (and, virtually, 454). As a final result I created SV_finder, a program able to work both in base and color space. As an input it requires the list of paired-end or mate-pair reads mapped on a known reference genome; the output is a list of structural variations found on the basis of data and parameters used.

Le nuove tecnologie di sequenziamento (NGS) consentono di ottenere moltissimi dati a costi contenuti rispetto al tradizionale sequenziamento Sanger. L'enorme mole di dati che recentamente è stata prodotta con le NGS ha portato ad una veloce produzione di bozze di sequenza di molti genomi, sia eucariotici sia procariotici. Il Progetto Genoma Umano fu completato nel 2003, meno di 10 anni fa: costò miliardi di dollari e interessò decine di laboratori in tutto il mondo. Attualmente le NGS consentono di produrre l'equivalente di un genoma umano in poche settimane, al costo di 10000 dollari. L'incredibile aumento di prestazione ha aperto nuove possibilità in campo biologico: ad esempio ora è possibile comparare geneticamente interi organismi, analizzare DNA antico, studiare malattie genetiche ad un livello ritenuto incredibile fino a pochi anni fa. Alcuni dei principali campi che possono essere migliorati con questa tecnologia sono: genomico (ad esempio assemblaggio di genomi e identificazione di variazioni strutturali), trascrittomico (ad esempio predizione genica e splicing alternativi) ed epigenetico. L'enorme mole di data che è stata prodotta deve essere analizzata; è molto improbabile che tale analisi sia fatta manualmente, quindi nuovi metodi bioinformatici sono richiesti per velocizzare il processo. C'è il bisogno di ottimizzare le risorse computazionali per memorizzare efficaciemente i dati NGS, ma anche il bisogno per nuovi algoritmi, concepiti specificatamente per i dati NGS, ad esempio per superare una delle maggiori limitazioni delle nuove tecnologie di sequenziamento: la corta lunghezza delle singole sequenze (in generale chiamate `reads') che può essere prodotta dalle macchine NGS. Da una parte, le NGS possono produrre centinaia di volte più reads del sequenziamento tradizionale, ma dall'altra parte queste reads sono molto più corte: circa 50-100 basi invece che 500-1000 basi del sequenziamento Sanger. Ciò rende più difficoltosa l'analisi dei dati, particolarmente per le repeat genomiche che possono essere risolte solo con read più lunghe. Attualmente le macchine NGS più utilizzate sono il Solexa (Illumina), il 454 (Roche) e il SOLiD (Applied BioSystems). Il primo usa un metodo simile al sequenziamento Sanger, mentre gli altri due usano tecnologie differenti, rispettivamente pyrosequencing e sequencing-by-ligation. La lunghezza delle read è variabile: il 454 produce read di circa $400$ basi, mentre gli altri due producono read di lunghezza compresa tra $35$ e $100$ base. Le tre piattaforme differiscono anche nel rendimento che continuamente migliora nel tempo: attualmente il 454 produce circa un milione di read per corsa, mentre Solexa e SOLiD possono produrre molte centinaia di milioni di read per corsa. Queste piattaforme possono essere usate per sequenziare differenti tipi di librerie, incluse le librerie paired-end e mate-pair. Esse sono librerie che permettono di sequenziare le estremità di una frammento di DNA; come risultato vengono prodotte paia di sequence che devono mappare ad un distanza compatibile con la lunghezza dei frammenti della libreria. Quando usate per ri-sequenziare genomi singoli, queste librerie generano molti link (`archi'), uno per ogni coppia di read mappate, che devono essere compatibili con la lunghezza dei frammenti della libreria. L'obiettivo principale della mia tesi di dottorato è dimostrare che dovrebbe essere possibile identificare con elevata accuratezza qualsiasi variazione strutturale che si presenti nei genomi di singole persone usando i dati di librerie paired-end e mate-pair. L'accuratezza di questa analisi dovrebbe migliorare con la densità di archi che coprono il genoma; quindi, il grande numero di archi che può essere generato dalle piattaforme NGS offre una grande opportunità per gli studi su variazioni strutturali. Le variazioni strutturali sono un aspetto del genoma la cui importanza è diventata evidente solo negli ultimi anni; prima, perfino la loro esistenza era messa in dubbio. Recentemente si è osservato che in genomi adulti sono presenti centinaia di variazioni strutturali che possono essere associate a cancro o altre malattie (per esempio il morbo di Parkinson). Molti strumenti sono stati sviluppato per identificare le variazioni strutturali, basati sulla comparative genome hybridization e, più di recente, sulle NGS. Nell'ultimo caso, gli strumenti disponibili sono molto lontani dall'essere capaci di sfruttare il pieno potenziale dei dati NGS, sia in termini di sensibilità che specificità. Scopo del mio dottorato è esaminare questo problema e creare uno strumento bioinformatico capace di identificare le variazioni strutturali con elevata accuratezza. Inizialmente mi sono concentrato solo sui dati SOLiD, in seguito ho esteso la mia analisi anche ai dati Solexa (e, potenzialmente, 454). Come risultato finale ho ideato SV_finder, un programma capace di funzionare sia in base che color space. Come input richiede una lista delle read paired-end o mate-pair mappate su un genome conosciuto di riferimento; l'output è una lista di variazioni strutturali trovate in base ai dati e parametri usati.

Identification of Structural Variations in Resequenced Genomes using Paired-End or Mate-Pair Sequences / Zamperin, Gianpiero. - (2012 Jan).