Computational characterization of tandem repeat and non-globular proteins

Paladin, Lisanna

The first protein structure to be determined was hemoglobin, a globe-like, water-soluble protein with enzymatic activity. Since then, protein science has been biased towards this type, termed globular. However, over the last decades accumulating experimental evidences suggested the functional importance of their counterpart, non-globular proteins (NGPs). The definition includes tandem repetitions, intrinsically disordered regions, aggregating domains and transmembrane domains. NGPs recognition and classification is essential to shed a light on the so called “dark proteome”, i.e. the large fraction that we know almost nothing about. I contributed to this goal through the development of new resources dedicated to NGPs. My main focus are tandem repeat proteins (TRPs). TRPs are characterized by a repeated sequence which folds into a modular architecture, where modules are called “units”. The unit represents not only the structural but also the evolutionary module and base TRPs classification. TRPs are widespread in all type of organisms, where they carry out fundamental functions. The sequences of TRP units diverge quickly while maintaining their fold, hampering detection by traditional methods for sequence analysis. Conversely, the challenges of structure-based repeats detection lie in the multidimensional nature of the data. Specialized methods have been developed for TRPs identification, however few of them annotate single repeat units. RepeatsDB is a database of TRP structures annotated with the position of repeat units and insertions. I contributed to the new version of RepeatsDB database, which was populated taking advantage of ReUPred, predictor of tandem repeat units. The quality of RepeatsDB data is guaranteed by manual validation, a time-consuming task which requires community annotation efforts. To facilitate this process I developed RepeatsDB-lite, web server for the prediction and refinement of tandem repeats in protein structure. Analysing RepeatsDB data, I compared the sequence- and structure-based classification of TRPs. Moreover, I provided insights on TRPs role in the human proteome by characterizing them in terms of function, protein-protein interaction networks and impact on diseases. As a case study, I characterized Collagen V, a repeat protein associated to Ehlers-Danlos syndrome, identifying genotype-phenotype correlations in relation to its interaction network model. Another category of NGPs is intrinsically disordered proteins (IDPs), devoid of order in their native state. Intrinsic disorder was shown to be prevalent in the human proteome, to play important signaling and regulatory roles and to be frequently involved in disease. I contributed to MobiDB, database of protein disorder and mobility annotations that describes several aspects of NGPs structure and mechanism of function. MobiDB provides consensus predictions and functional annotations for all known protein sequences. A common feature of TRPs, IDPs and other NGPs is that they are characterized by low-complexity regions, where the distribution of amino acids deviates from the common amino acid usage. The functional importance of low complexity regions is strictly related to their non-globular arrangement. I contributed to the field with a critical review focusing on the definition of sequence features of low complexity regions and their relationship to structural features. Finally, I exploited the knowledge acquired on NGPs in the previous studies to design one of the first sequence-based methods for the prediction of protein solubility, SODA. SODA uses the aggregation propensity, intrinsic disorder, hydrophobicity and secondary structure preferences from a sequence to evaluate solubility changes introduced by a mutation. The main envisaged applications of SODA are in protein engineering and in the study of the impact of protein mutations in disease insurgence.

La prima struttura proteica ad essere stata determinata è quella dell’emoglobina, una proteina sferica e solubile ad attività enzimatica. Da allora la scienza si è concentrata su questa tipologia di proteine, definite globulari. Recenti evidenze sperimentali però suggeriscono l’importanza funzionale della loro controparte, proteine definite non globulari (NGP). Il riconoscimento e la classificazione delle NGP è essenziale per far luce sul cosiddetto dark proteome, ovvero la frazione del proteoma ancora non caratterizzata. Ho contribuito a questo scopo attraverso lo sviluppo di risorse dedicate alle NGP, principalmente alle proteine ripetute in tandem (TRP). Le TRP sono caratterizzate da una sequenza ripetuta che codifica per una struttura modulare, dove i singoli moduli sono chiamati unità. Essi rappresentano non solo la minima entità strutturale, ma anche evolutiva delle TRP: sono infatti alla base della loro classificazione. Le TRP sono diffuse in tutti i tipi di organismi, dove svolgono funzioni essenziali. Le sequenze delle unità ripetute divergono velocemente pur conservando la struttura: ciò complica il loro riconoscimento da sequenza. D’altro lato, anche l’individuazione delle ripetute sulla base della struttura è complessa a causa della multidimensionalità del dato. Metodi specifici sono stati sviluppati per l’identificazione delle TRP, ma pochi annotano le singole unità. RepeatsDB è un database di strutture ripetute che riporta la posizione di unità e inserzioni. Ho contribuito alla nuova versione del database, popolato grazie a ReUPred, predittore di unità ripetute. La qualità del dato è garantita da validazione manuale, un processo dispendioso che richiede il contributo di annotatori esperti. Per facilitarlo ho sviluppato RepeatsDB-Lite, un server online per la predizione e l’annotazione di TRP. Analizzando il dato in RepeatsDB, ho confrontato le classificazioni delle TRP sulla base della sequenza e della struttura. Inoltre, ho descritto il ruolo delle TRP nel proteoma umano presentando le loro funzioni, la loro rete di interazioni e il loro impatto sulle malattie. Come caso di studio ho caratterizzato il collagene V, una TRP associata alla sindrome di Ehlers-Danlos, identificando le correlazioni genotipo-fenotipo in relazione alle interazioni che la proteina stabilisce. Un’altra categoria di NGP è quella delle proteine intrinsecamente disordinate (IDP), prive di struttura terziaria fissa o ordinata. Il disordine è prevalente nel proteoma umano, ha un ruolo fondamentale nella segnalazione e nella regolazione cellulare ed è frequentemente associato alle malattie. Ho contribuito a MobiDB, database di disordine e mobilità proteica che descrive molti aspetti della struttura e dei meccanismi di funzionamento delle NGP. MobiDB presenta un consenso fra predizioni e annotazioni funzionali per tutte le sequenze proteiche conosciute. Una caratteristica comune di TRP, IDP e altre NGP è che sono caratterizzate da regioni a bassa complessità, cioè la distribuzione degli aminoacidi nelle loro sequenze devia dalla media. L’importanza funzionale delle regioni a bassa complessità è strettamente connessa al loro arrangiamento non globulare. Il mio contribuito al settore consiste nella definizione delle caratteristiche delle sequenze a bassa complessità in relazione alle loro caratteristiche strutturali. Infine, ho sfruttato le conoscenze acquisite sulle NGP per progettare uno dei primi predittori di solubilità basati sulla sequenza, SODA. SODA utilizza l’idrofobicità della sequenza oltre alla propensione ad aggregazione, disordine e a formare elementi di struttura secondaria per predire quanto contribuisce una data mutazione a modificare la sua solubilità. Le principali applicazioni di SODA sono nell’ambito dell’ingegneria proteica e nello studio dell’impatto delle mutazioni nell’insorgenza di malattie.

Computational characterization of tandem repeat and non-globular proteins / Paladin, Lisanna. - (2018 Nov 28).