1 / 77

Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics

The UniProt knowledgebase www.uniprot.org a hub of integrated protein data http://education.expasy.org/cours/Prague2011/. Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics. Protein sequences.

cheri
Download Presentation

Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The UniProt knowledgebasewww.uniprot.orga hub of integrated protein datahttp://education.expasy.org/cours/Prague2011/ Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics

  2. Protein sequences • > 180 billions of ‘different’ proteins on earth (∑ N species x M genes) • > 16.0 millions of ‘known’ protein sequences in 2011 • More than 99 % of the protein sequences are derived from the translation of nucleotide sequences (mRNA or DNA) • Less than 1 % direct protein sequencing (Edman, MS/MS…)

  3. data knowledge proteinsequencefunctional information

  4. Science cover, february 2011

  5. UniProt consortium EBI : European Bioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein information resource (US)

  6. www.uniprot.org

  7. UniProt databases

  8. UniProt databases UniParc: proteinsequence archive (EMBL-ENA equivalentat the proteinlevel) Each entry contains a proteinsequence, taxonomic information, cross-links to otherdatabaseswhereyoufind the sequence (active or not) No annotation You can: query, Blast, download ~28 mo entries

  9. UniProt databases UniRef 3 clusters of proteinsequenceswith 100, 90 and 50 % identity; useful to speed up sequencesimilaritysearch (BLAST) You can: query, Blast, download UniRef100 14 mo entries; UniRef90 9 mo entries; UniRef50 4 mo entries

  10. UniProt databases UniMES: proteinsequencesderivedfrommetagenomicprojects (mostlyGlobal OceanSampling (GOS)) You can : download 10 mo entries, included in UniParc

  11. UniProt databases The central piece

  12. UniProtKB an encyclopedia on proteins composed of 2 sectionsUniProtKB/TrEMBL and UniProtKB/Swiss-Prot unreviewed and reviewed automatically annotated and manually annotated released every 4 weeks

  13. UniProtKB • Origin of proteinsequences • UniProtKBproteinsequences are mainlyderivedfrom • INSDC (translatedsubmittedcodingsequences- CDS) • Ensembl (geneprediction) and RefSeqsequences • Sequences of PDB structures • Direct submission or sequencesscannedfromliterature • (includes direct proteinsequencing) • Notes:- UniProtis not doinganygeneprediction • - Most non-germlineimmunoglobulins, T-cell receptors , most patent sequences, highly over-representeddata (e.g. viral antigens), pseudogenessequences are excludedfromUniProtKB, - but stored in UniParc • - Data from the PIR database have been integrated in UniProtKBsince 2003. 85 % 15 %

  14. Manual annotation of the sequence and associated biological information Swiss-Prot EMBL TrEMBL Automated extraction of proteinsequence (translated CDS), genename and references. Automated annotation

  15. UniProtKB/TrEMBL unreviewed Automatic annotation released every 4 weeks

  16. Protein and genenames Taxonomic information Automated annotation Function, Subcellular location, Catalyticactivity, Sequencesimilarities… Automated annotation transmembranedomains, signal peptide… References One proteinsequence One species Automated annotation Keywords and Gene Ontology Cross-references to over 125 databases UniProtKB/TrEMBL www.uniprot.org

  17. UniProtKB/TrEMBLAutomatic annotation • Proteinsequence • -The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). • - 100% identical sequences (same length, same organism are merged automatically). • Biologicalinformation • Sources of annotation • Provided by the submitter (EMBL, PDB, TAIR…) • From automated annotation (automated generated annotation rules (i.e. SAAS) and/or manually generated annotation rules (i.e. UniRule))

  18. UniProtKB/TrEMBL Example of fullyautomatic annotation: SAAS • Rules are derived from the UniProtKB/Swiss-Prot manual annotation. • Fully automated rule generation based on C4.5 decision tree algorithm. • One annotation, one rule. • High stringency – require 99% or greater estimated precision to generate annotation(test on UniProtKB/Swiss-Prot) • Rules are produced, updated and validated at each release.

  19. UniProtKB/Swiss-Prot reviewed manually annotated released every 4 weeks

  20. Manual annotation Function, Subcellular location, Catalyticactivity, Disease, Tissue specificty, Pathway… Protein and genenames Taxonomic information MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG Manual annotation Post-translational modifications, variants, transmembranedomains, signal peptide… References One proteinsequence One gene One species Alternative products: proteinsequencesproduced by alternative splicing, alternative promoter usage, alternative initiation… Manual annotation Keywords and Gene Ontology Cross-references to over 125 databases UniProtKB/Swiss-Prot www.uniprot.org

  21. UniProtKB/Swiss-ProtManual annotation 1. Protein sequence(merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)2. Biological information(sequence analysis,extract literature information, ortholog data propagation, …)

  22. UniProtKB/Swiss-Prot 1- Protein sequence curation

  23. The displayed protein sequence: …canonical, representative, consensus…+alternative sequences (described within the entry) UniProtKB/Swiss-Prot a gene-centric view of the protein space 1 entry <-> 1 gene (1 species)

  24. What is the current status? • At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. • Typical problems • unsolved conflicts • uncorrected initiation sites • frameshifts • wrong gene prediction • other ‘problems’

  25. UCSC genome browser examples of CDS annotation submitted to INSDC…

  26. UniProtKB/Swiss-Prot 2- Biological data curation

  27. Extract literature informationand protein sequence analysis maximum usage of controlled vocabulary UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, TMHMM, …) - contacts with experts - other databases - nomenclature committees An evidence attribution system allows to easily trace the source of each annotation

  28. Protein and gene names

  29. General annotation (Comments) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org

  30. Human protein manual annotation: some statistics (June 2011)

  31. Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org

  32. Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both

  33. Find all the proteins localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven)

  34. ‘Protein existence’ tag • The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein; • Different qualifiers: 1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),…) 2. Evidence at transcript level (~19%) 3. Inferred from homology (~58 %) 4. Predicted (~5%) 5. Uncertain (mainly in TrEMBL) http://www.uniprot.org/docs/pe_criteria

  35. UniProtKB Additional information can be found in the cross-references (to more than 140 databases)

  36. Family and domain Gene3D HAMAP InterPro PANTHER Pfam PIRSF PRINTS ProDom PROSITE SMART SUPFAM TIGRFAMs Organism-specific AGD ArachnoServer CGD ConoServer CTD CYGD dictyBase EchoBASE EcoGene euHCVdb EuPathDB FlyBase GeneCards GeneDB_Spombe GeneFarm GenoList Gramene H-InvDB HGNC HPA LegioList Leproma MaizeGDB MGI MIM neXtProt Orphanet PharmGKB PseudoCAP RGD SGD TAIR TubercuList WormBase Xenbase ZFIN Sequence EMBL IPI PIR RefSeq UniGene Proteomic PeptideAtlas PRIDE ProMEX Polymorphism dbSNP Genome annotation Ensembl EnsemblBacteria EnsemblFungi EnsemblMetazoa EnsemblPlants EnsemblProtists GeneID GenomeReviews KEGG NMPDR TIGR UCSC VectorBase Gene expression ArrayExpress Bgee CleanEx Genevestigator GermOnline Protein family/group Allergome CAZy MEROPS PeroxiBase PptaseDB REBASE TCDB Ontologies GO UniProtKB/Swiss-Prot: 129 explicit links 2D gel 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE UCD-2DPAGE World-2DPAGE and 14 implicit links! Phylogenomic dbs eggNOG GeneTree HOGENOM HOVERGEN InParanoid OMA OrthoDB PhylomeDB ProtClustDB 3D structure DisProt HSSP PDB PDBsum ProteinModelPortal SMR Other BindingDB DrugBank NextBio PMAP-CutDB PPI DIP IntAct MINT STRING Enzyme and pathway BioCyc BRENDA Pathway_Interaction_DB Reactome PTM GlycoSuiteDB PhosphoSite PhosSite

  37. The UniProt web site www.uniprot.org Powerful search engine, google-like and easy-to-use, but also supports very directed field searches Scoring mechanism presenting relevant matches first Entry views, search result views and downloads are customizable The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access Search, Blast, Align, Retrieve, ID mapping

  38. Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information

  39. Find all humanproteins located in the nucleus

  40. The search interface guides users with helpful suggestions and hints

  41. Advanced Search A very powerful search tool To be used when you know in which entry section the information is stored

  42. Find all the protein localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven)

  43. Result pages: highly customizable

  44. Result pages: downloadable

  45. The URL can be bookmarked and manually modified.

More Related