Extended and robust protein sequence annotation over conservative non hierarchical clusters. The Bologna Annotation Resource v 2.0

Piovesan, D.; Bartoli, L.; Martelli, P. L.; Fariselli, Piero; Rossi, I.; Guerzoni, G.; Donvito, G.; Maggi, G. P.; Casadio, R.

Genome annotation is one of the most important issues in the genomic era. The exponential grow rate of newly sequenced genomes and proteomes urges the development of fast and reliable annotation methods, suited to exploit all the information available in curated data bases of protein sequences and structures. To this aim we developed BAR, the Bologna Annotation Resource that is now updated (available at http://microserf.biocomp.unibo.it/bar/). The basic notion is that sequences with high identity value to a counterpart can inherit the same function/s and structure, if available. What is totally new in our analysis is to cluster sequences with the constraint that sequence identity should be equal or higher than 40% on at least 90% of the pairwise alignment length. By this sequences are clustered in sets that can be annotated in terms of function and structure depending on the annotation level of the sequences within the cluster. Our method starts with on all-against-all alignment of all the sequences in a GRID environment. The alignments are then regarded as an undirected graph and after the clustering procedure that constrains both the sequence identity value and the alignment length, all the connected nodes (proteins) collapse into a single group (cluster). A cluster that incorporates a UniProt entry inherits its annotations (GO terms that are statistically validated, PDB structures, SCOP classifications, Pfam families, if available). Clusters can contain distantly related proteins that by this can be annotated with high confidence. Ultimately the method analyses a total of over 12 million protein sequences taken from 988 genomes and UniProt release 13. In this version HMM models of those clusters that contain PDB templates are also provided to the end-user for computing structural models of distantly related sequences.