Bioinformatics for Proteomics Shu-Hui Chen ( 陳淑慧 ) Department of Chemistry

Bioinformatics for Proteomics Shu-Hui Chen (陳淑慧) Department of Chemistry National Cheng Kung University

Bioinformatics I 5’ 3’ DNA Transcription Splicing mRNA Translation Poly-peptide Folding Protein • Transport / Localization • Oligomerization • PTM (Post-Translational Modification) Function Function How do we find protein coding regions, introns and exons in genomic DNA sequences?

What is Proteomics ? Systematic analysis of All protein sequences All protein expression pattern All protein interactions This involves Protein isolation Protein separation Protein identification Functional characterization of all proteins

The tools of Proteomics Traditional protein chemistry assay methods struggle to establish Identity Identity requires: Specificity of measurement (Precision) Mass Spectrometry MS-based data acquisition algorithm A reference for comparison Protein sequence databases Search algorithms

MS-based Proteomics and Bioinformatics • MS instrument is so far not sensitive enough to resolve proteins in a biological system solely based on signals measured. • MS, however, is able to acquire sufficient data for mapping a protein from the database using new computer algorithms to analyze the data. • This is the field of bioinformatics

Instrumentation Sample inlet vacuum Ion source Mass analyzer Data acquisition

“Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc.

MS-based Protein Identification  Mass Mapping Peptide Sequencing

Conventional Methodology- Expression Proteomics

Ion intensity m/z -NH-CH(R1)-CO-NH-CH(R2)-CO- trypsin -NH-CH(R1)-COOH H2N-CH(R2)-CO- Trypsin Digestion We know that trypsin cleaves polypeptides C-terminal to basic amino acids.

Mass Spectrometry Protein identified by database mapping

Automated Database Search Number 1 match: tumor necrosis factor type 1 receptor associated protein TRAP-1 (Mr): 76030.27 Total coverage: 33.4%

Bioinformatics I Minimal content of a « protein sequence » db • Sequences !! • Accession number (AC) • Taxonomic data • References • ANNOTATION/CURATION • Keywords • Cross-references • Documentation

Bioinformatics I SWISS-PROT/TrEMBL • Collaboration between the SIB (CH) and EMBL/EBI (UK) • SWISS-PROT: Fully annotated (manually), non-redundant, cross-referenced, documented protein sequence database. • TrEMBL: is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools. http://www.expasy.org/sprot/

ExPASy Web Server ExPASy = Expert Protein Analysis System

History for MS Searching MOWSE 1993 By Pappin and Bleasby SEQUEST 1994 By Yates and Eng MOWSEⅡ 1996 Molecular Weight Search 1997 MOWSEⅢ 1998 MASCOT By Matrix science

Scoring algorithm Final score= -10*LOG(P), where P is absolute probability that the observed match is a random event E value (expected value) = describes the number of hits one can expect to see by chance when searching a database of a particular size. A value of zero indicates that no matches would be expected by chance. Significant hits at 95% confidence level (p<0.05) there is less than a 1 in 20 chance that the observed match is a random event. Increase mass tolerance 7 5

MS-based Protein Identification Mass Mapping  Peptide Sequencing

Tandem Mass Spectrometry- MS/MS MS/MS acquisition is controlled by software setting

Protein Identification • Peptide Sequencing using MSMS peptide ABCDEF A BCDEF CID AB CDEF precursor ion ABC DEF ABCD EF ABCDE F ABC AB ABCDE A ABCD A B C D E m/z

Nomenclature used for CID peptide fragmentation- Low Energy (eV)- Q, TOF, FT “Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc.

Protein Identification by Database Search

Ion intensity m/z -NH-CH(R1)-CO-NH-CH(R2)-CO- trypsin -NH-CH(R1)-COOH H2N-CH(R2)-CO- Trypsin Digestion We know that trypsin cleaves polypeptides C-terminal to basic amino acids.

Sequence Tag Approach for Peptide Sequencing “Bioanalytical Chemistry” Mikkelsen, S.R., published by John Wiley & Sons, Inc.

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Bioinformatics I NCBI BLAST http://www.ncbi.nlm.nih.gov/blast/ BLAST: Basic Local Alignment Search Tool

Bioinformatics I 1: MY-TAIL--ORIS-RICH- ¦x ¦¦¦¦ x¦x¦ ¦¦¦¦ 2: MONTAILLEURESTRICHE Global Alignment 1: TAILO RICH ¦¦¦¦x ¦¦¦¦ 2: TAILL RICHE Two Local Alignments ¦ = Identity x = Mismatch - = Insertion / Deletion Sequence alignments and comparison 1: MYTAILORISRICH 2: MONTAILLEURESTRICHE

Bioinformatics I HBA_CHICK VL-SAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYFPHF-DL 48 HBAD_CHICK ML-TAEDKKLIQQAWEKAASHQEEFGAEALTRMFTTYPQTKTYFPHF-DL 48 HBPI_CHICK AL-TQAEKAAVTTIWAKVATQIESIGLESLERLFASYPQTKTYFPHF-DV 48 HBB_CHICK VHWTAEEKQLITGLWGKV--NVAECGAEALARLLIVYPWTQRFFASFGNL 48 HBE_CHICK VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFASFGNL 48 HBRH_CHICK VHWSAEEKQLITSVWSKV--NVEECGAEALARLLIVYPWTQRFFDNFGNL 48 MYG_CHICK GL-SDQEWQQVLTIWGKVEADIAGHGHEVLMRLFHDHPETLDRFDKFKGL 49 .... . ..* . .. * * * *.. .* * * * .. HBA_CHICK SH-----GSAQIKGHGKKVVAALIEAANHIDDIAGTLSKLSDLHAHKLRV 93 HBAD_CHICK SP-----GSDQVRGHGKKVLGALGNAVKNVDNLSQAMAELSNLHAYNLRV 93 HBPI_CHICK SQ-----GSVQLRGHGSKVLNAIGEAVKNIDDIRGALAKLSELHAYILRV 93 HBB_CHICK SSPTAILGNPMVRAHGKKVLTSFGDAVKNLDNIKNTFSQLSELHCDKLHV 98 HBE_CHICK SSPTAIMGNPRVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCDKLHV 98 HBRH_CHICK SSPTAIIGNPKVRAHGKKVLSSFGEAVKNLDNIKNTYAKLSELHCEKLHV 98 MYG_CHICK KTPDQMKGSEDLKKHGATVLTQLGKILKQKGNHESELKPLAQTHATKHKI 99 . *. .. ** .*.. . . .. .. . *.. * .. HBA_CHICK DPVNFKLLGQCFLVVVAIHHPAALTPEVHASLDKFLCAVGTVLTAKYR-- 141 HBAD_CHICK DPVNFKLLSQCIQVVLAVHMGKDYTPEVHAAFDKFLSAVSAVLAEKYR-- 141 HBPI_CHICK DPVNFKLLSHCILCSVAARYPSDFTPEVHAEWDKFLSSISSVLTEKYR-- 141 HBB_CHICK DPENFRLLGDILIIVLAAHFSKDFTPECQAAWQKLVRVVAHALARKYH-- 146 HBE_CHICK DPENFRLLGDILIIVLASHFARDFTPACQFAWQKLVNVVAHALARKYH-- 146 HBRH_CHICK DPENFRLLGNILIIVLAAHFTKDFTPTCQAVWQKLVSVVAHALAYKYH-- 146 MYG_CHICK PVKYLEFISEVIIKVIAEKHAADFGADSQAAMKKALELFRNDMASKYKEF 149 . .... . .* . . ... . .* . .. **. HBA_CHICK ---- 141 HBAD_CHICK ---- 141 HBPI_CHICK ---- 141 HBB_CHICK ---- 146 HBE_CHICK ---- 146 HBRH_CHICK ---- 146 MYG_CHICK GFQG 153 Consensus length: 154; Identity : 19 ( 12.3%); Similarity: 51 ( 33.1%) Character to show that a position in the alignment is perfectly conserved: '*' Character to show that a position is well conserved: '.' Multiple Sequence Alignment (MSA) • Programs: • CLUSTALW • T_COFFEE • MULTALIGN

Searching databases with multiple alignments PSI-BLAST: Position-Specific Iterative BLAST (Altschul et al., 1997) • Starting with a single sequence, PSI-BLAST searches a database • using BLAST and builds a multiple sequence alignment and a profile. • The profile is then used to search the protein database again. • Running the program several times can further refine the profile • and increase search sensitivity.

Bioinformatics for Proteomics Shu-Hui Chen ( 陳淑慧 ) Department of Chemistry