1 / 18

Removing redundancy in SWISS-PROT and TrEMBL

Removing redundancy in SWISS-PROT and TrEMBL. SWISS-PROT. is a curated protein sequence data bank established in 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since 1987 contains currently 75 000 protein sequence entries. Essential criteria for a sequence data bank.

Download Presentation

Removing redundancy in SWISS-PROT and TrEMBL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Removing redundancy in SWISS-PROT and TrEMBL

  2. SWISS-PROT • is a curated protein sequence data bank established in 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since 1987 • contains currently 75 000 protein sequence entries

  3. Essential criteria for a sequence data bank • it must be complete with minimal redundancy • it must contain as much up-to-date information as possible on each sequence • all the information items must be retrievable by computer programs in a consistent manner • it should be integrated (cross-referenced) with other sequence related data banks

  4. The Bottleneck: Annotation

  5. Annotation consists of the description of: • Function(s) of the protein • Post-translational modification(s) • Domains and sites • Secondary structure • Quaternary structure • Similarities to other proteins • Disease(s) associated with deficiencie(s) in the protein • Sequence conflicts, variants, etc.

  6. TrEMBL • is a Computer-annotated supplement to SWISS-PROT • consists of entries in SWISS-PROT format • translations of CDS in the Nucleotide Sequence Database not in SWISS-PROT • the translation tools used are based on the program trembl written by Thure Etzold at the EMBL in Heidelberg

  7. TrEMBLNEW Weekly update of TrEMBL which contains protein coding sequences derived from EMBLNEW TrEMBLNEW entries are moved into TrEMBL during the quarterly release building procedure

  8. The Production of TrEMBL • Translation and entry creation • Sorting the entries • Automated post-processing of the SP-TrEMBL entries

  9. Automated post-processing of TrEMBL entries • Redundancy removal: affects currently >10% of the entries • Improvements to annotation: affects currently >20% of the entries

  10. Removing Redundancy Causes of redundancy and the detection of redundancy Removing redundancy

  11. Causes of redundancy Different literature and sequence reports for the same protein Subfragments of longer sequences Mutations, polymorphism, variations and conflicts of a sequence are often given as separate entries in EMBL

  12. Redundancy detection The Cyclic Redundancy Check (CRC32) calculates a nearly unique and very compact checksum for each sequence The Boyer-Moore sequence comparison algorithm for a fast string searching An algorithm that finds strings with errors ( Landau-Vishkin)

  13. Removing Redundancy Identical full length proteins are merged in one entry Identical fragment proteins and subfragments of longer sequences from the same organism are merged

  14. Removing Redundancy The ‘MERGE’ procedure - match CRC32 match TrEMBLNEW vs TrEMBLNEW (automatic merge) match TrEMBLNEW vs TrEMBL (automatic merge) match TrEMBLNEW vs SWISS-PROT (manual merge) - Subfragment assembly (LASSAP) match TrEMBLNEW vs TrEMBLNEW (automatic merge and manual check) match TrEMBLNEW vs TrEMBL (automatic merge and manual check) match TrEMBLNEW vs SWISS-PROT (manual merge)

  15. Day 1Day 2 Day n EMBLNEW trembl Between releases PIDCheck SP + TREMBLPIDS (Work Release) Week 1Week 2 Week n TREMBLNEW TREMBLNEW SP Updates Replace PIDs in SP+TREMBL Building Release Merge TREMBL

  16. Results EMBL Nucleotide Sequence Database (rel 55) has 326,000 CDS SWISS-PROT (rel 36) has 74,019 entries TrEMBL (rel 7) has 193,860 entries 110,000 CDS were already in 74,000 SWISS-PROT entries 207,000 CDS were in 194,000 TrEMBL entries 9,000 currently being processed due to redundancy procedures

  17. Results Results of redundancy removal within TrEMBL 7 production - 743 were already in SWISS-PROT - 3380 were merged due to CRC32 matches - 4736 were removed by subfragment matches 8,859 entries were removed

  18. SWISS-PROT at EBI Rolf Apweiler Sergio Contrino Wolfgang Fleischmann Henning Hermjakob Viv Junker Fiona Lang Claire O'Donovan Michele Magrane Maria Jesus Martin Nicoletta Mitaritonna Steffen Moeller Youla Karavidopoulou Gill Fraser Evguenia Kriventseva Collaborators Amos Bairoch Eric Glemet Jean-Jacques Codani Credits

More Related