1 / 51

Sequence motifs, information content, logos, and HMM’s

Sequence motifs, information content, logos, and HMM’s. Morten Nielsen, CBS, BioCentrum, DTU. What is a binding motif? How to describe a sequence motif? Construction of scoring matrices Sequence motifs and hidden Markov models Use of HMM

jetta
Download Presentation

Sequence motifs, information content, logos, and HMM’s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

  2. What is a binding motif? How to describe a sequence motif? Construction of scoring matrices Sequence motifs and hidden Markov models Use of HMM Why are Profile HMM’s better than Anders Gorms sequence alignments Or at least PSSM’s Outline

  3. Binding motifs MHC-I MHC-II TAP

  4. Anchor positions MHC class I with peptide

  5. Sequence information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

  6. Sequence information

  7. Sequence Information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

  8. Sequence Information • Calculate pa at each position • Entropy • Information content • Conserved positions • PV=1, PREST=0 => S=0, I=log(20) • Mutable positions • Paa=1/20 => S=log(20), I=0

  9. Information content A R N D C Q E G H I L K M F P S T W Y V S I 1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.37 2 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.16 3 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.26 4 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.45 5 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.28 6 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.40 7 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.34 8 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.28 9 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55

  10. Sequence logos • Height of a column equal to I • Relative height of a letter is p • Highly useful tool to visualize sequence motifs HLA-A0201 High information positions http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

  11. Characterizing a sequence motif from small data sets 10 MHC restricted peptides • What can we learn? • A at P1 favors binding? • I is not allowed at P9? • K at P4 favors binding? • Which positions are important for binding? • ALAKAAAAM • ALAKAAAAN • ALAKAAAAR • ALAKAAAAT • ALAKAAAAV • GMNERPILT • GILGFVFTM • TLNAWVKVV • KLNEPVLLL • AVVPFIVSV

  12. ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Simple motifs Yes/No rules 10 MHC restricted peptides • Only 11 of 212 peptides identified! • Need more flexible rules • If not fit P1 but fit P2 then ok • Not all positions are equally important • We know that P2 and P9 determines binding more than other positions • Cannot discriminate between good and very good binders

  13. Simple motifsYes/No rules 10 MHC restricted peptides • ALAKAAAAM • ALAKAAAAN • ALAKAAAAR • ALAKAAAAT • ALAKAAAAV • GMNERPILT • GILGFVFTM • TLNAWVKVV • KLNEPVLLL • AVVPFIVSV • Example • Two first peptides will not fit the motif RLLDDTPEV 0.59 GLLGNVSTV 0.71 ALAKAAAAL 0.47

  14. Fitness of aa at each position given by P(aa) Example P1 PA = 6/10 PG = 2/10 PT = PK = 1/10 PC = PD = …PV = 0 Problems Few data Data redundancy/duplication ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Extended motifs

  15. Sequence informationRaw sequence counting • ALAKAAAAM • ALAKAAAAN • ALAKAAAAR • ALAKAAAAT • ALAKAAAAV • GMNERPILT • GILGFVFTM • TLNAWVKVV • KLNEPVLLL • AVVPFIVSV

  16. ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Sequence weighting } Similar sequences Weight 1/5 Poor or biased sampling of sequence space • Example P1 PA = 2/6 PG = 2/6 PT = PK = 1/6 PC = PD = …PV = 0 Example RLLDDTPEV 0.59 GLLGNVSTV 0.71 ALAKAAAAL 0.47

  17. Sequence weighting • ALAKAAAAM • ALAKAAAAN • ALAKAAAAR • ALAKAAAAT • ALAKAAAAV • GMNERPILT • GILGFVFTM • TLNAWVKVV • KLNEPVLLL • AVVPFIVSV

  18. ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Pseudo counts I is not found at position P9. Does this mean that I is forbidden (P(I)=0)? No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

  19. The Blosum matrix A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 Q 0.06 0.07 0.04 0.05 0.01 0.21 0.10 0.04 0.03 0.03 0.05 0.09 0.02 0.01 0.02 0.06 0.04 0.01 0.02 0.04 E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06 P 0.06 0.03 0.02 0.03 0.01 0.02 0.04 0.04 0.01 0.03 0.04 0.04 0.01 0.01 0.49 0.04 0.04 0.00 0.01 0.03 S 0.11 0.04 0.05 0.05 0.02 0.03 0.05 0.07 0.02 0.03 0.04 0.05 0.02 0.02 0.03 0.22 0.08 0.01 0.02 0.04 T 0.07 0.04 0.04 0.04 0.02 0.03 0.04 0.04 0.01 0.05 0.07 0.05 0.02 0.02 0.03 0.09 0.25 0.01 0.02 0.07 W 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.03 0.05 0.02 0.02 0.06 0.01 0.02 0.02 0.49 0.07 0.03 Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27 Some amino acids are highly conserved (i.e. C), some have a high change of mutation (i.e. I)

  20. Pseudo count estimation • ALAKAAAAM • ALAKAAAAN • ALAKAAAAR • ALAKAAAAT • ALAKAAAAV • GMNERPILT • GILGFVFTM • TLNAWVKVV • KLNEPVLLL • AVVPFIVSV • Calculate observed amino acids frequencies fa • Pseudo frequency for amino acid b • Example

  21. Weight on pseudo count • ALAKAAAAM • ALAKAAAAN • ALAKAAAAR • ALAKAAAAT • ALAKAAAAV • GMNERPILT • GILGFVFTM • TLNAWVKVV • KLNEPVLLL • AVVPFIVSV • Pseudo counts are important when only limited data is available • With large data sets only “true” observation should count •  is the effective number of sequences (N-1),  is the weight on prior

  22. Weight on pseudo count • ALAKAAAAM • ALAKAAAAN • ALAKAAAAR • ALAKAAAAT • ALAKAAAAV • GMNERPILT • GILGFVFTM • TLNAWVKVV • KLNEPVLLL • AVVPFIVSV • Example • If  large, p ≈ f and only the observed data defines the motif • If  small, p ≈ g and the pseudo counts (or prior) defines the motif •  is [50-200] normally

  23. Sequence weighting and pseudo counts • ALAKAAAAM • ALAKAAAAN • ALAKAAAAR • ALAKAAAAT • ALAKAAAAV • GMNERPILT • GILGFVFTM • TLNAWVKVV • KLNEPVLLL • AVVPFIVSV RLLDDTPEV 0.59 GLLGNVSTV 0.71 ALAKAAAAL 0.47 P7P and P7S > 0

  24. Position specific weighting • We know that positions 2 and 9 are anchor positions for most MHC binding motifs • Increase weight on high information positions • Motif found on large data set

  25. Weight matrices • Estimate amino acid frequencies from alignment including sequence weighting and pseudo count • What do the numbers mean? • P2(V)>P2(M). Does this mean that V enables binding more than M. • In nature not all amino acids are found equally often • qA = 0.070, qW = 0.013 • Finding 6% A is hence not significant, but 6% W highly significant • In nature V is found more often than M, so we must somehow rescale with the background A R N D C Q E G H I L K M F P S T W Y V 1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.08 2 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.10 3 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.07 4 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.05 5 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.08 6 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.13 7 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.08 8 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.05 9 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25

  26. How to score a sequence to a probability matrix? • pij describes a motif • The probability that a peptide fits the motif is A R N D C Q E G H I L K M F P S T W Y V 1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.08 2 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.10 3 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.07 4 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.05 5 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.08 6 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.13 7 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.08 8 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.05 9 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25

  27. How to score a sequence to a probability matrix? • pij describes a motif • The probability that a peptide fits the motif is • The probability that the peptide fits a random model is

  28. How to score a sequence to a probability matrix? • pij describes a motif • The probability that a peptide fits the motif is • The probability that the peptide fits a random model is • The ratio of the two gives the odds • The log gives the score

  29. Weight matrices • A weight matrix is given as Wij = log(pij/qj) • where i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j. • W is a L x 20 matrix, L is motif length A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

  30. Example • Wij = log(pij/qj) A R N D C Q E G H I L K M F P S T W Y V E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06 A R N D C Q E G H I L K M F P S T W Y V 0.08 0.05 0.04 0.05 0.02 0.03 0.06 0.07 0.02 0.06 0.10 0.06 0.02 0.04 0.04 0.06 0.05 0.01 0.03 0.07 qb|a q Calculate the weight matrix based on the following observation (use =50): Sequence = I Important. What is ?

  31. Example • Wij = log(pij/qj) So the score is simply the Blosum62 row for amino acid I!!! This is why  is called weight on prior. Our prior knowledge is Blosum. We will only accept a weight matrix different from Blosum if we have many data.

  32. Scoring a sequence to a weight matrix • Score sequences to weight matrix by looking up and adding L values from the matrix A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5 Which peptide is most likely to bind? Which peptide second? 0.59 0.71 0.47 11.9 14.7 4.3 RLLDDTPEV GLLGNVSTV ALAKAAAAL

  33. 10 peptides from MHCpep database Bind to the MHC complex Relevant for immune system recognition Estimate sequence motif and weight matrix Evaluate motif “correctness” on 528 peptides ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Example from real life

  34. Prediction accuracy Pearson correlation 0.45

  35. Predictive performance

  36. End of first part Take a deep breath Smile to you neighbor

  37. Hidden Markov Models • Weight matrices do not deal with insertions and deletions • In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension • HMM is a natural frame work where insertions/deletions are dealt with explicitly

  38. Why hidden? • Model generates numbers • 312453666641 • Does not tell which die was used • Alignment (decoding) can give the most probable solution/path (Viterby) • FFFFFFLLLLLL The unfair casino: Loaded die p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1 0.9 0.95 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 0.05 0.10 Loaded Fair

  39. ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics are derived from the non-core part of the alignment (black) HMM (a simple example) Core of alignment

  40. HMM construction • 5 matches. A, 2xC, T, G • 5 transitions in gap region • C out, G out • A-C, C-T, T out • Out transition 3/5 • Stay transition 2/5 • ACA---ATG • TCAACTATC • ACAC--AGC • AGA---ATC • ACCG--ATC .4 .2 A C G T .4 .2 .2 .6 .6 .8 A C G T A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. 1. 1. .4 .8 .2 .8 .2 .2 .2 .8 .2 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

  41. Align sequence to HMM • ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2=3.3x10-2 • TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8=0.0075x10-2 • ACAC--AGC =1.2x10-2 • AGA---ATC =3.3x10-2 • ACCG--ATC =0.59x10-2 • Consensus: • ACAC--ATC =4.7x10-2, ACA---ATC =13.1x10-2 • Exceptional: • TGCT--AGG =0.0023x10-2

  42. Score depends strongly on length Null model is a random model. For length L the score is 0.25L Log-odds score for sequence S Log( P(S)/0.25L) Positive score means more likely than Null model ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = -0.97 Align sequence to HMM - Null model Note!

  43. Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Loaded Fair Model decoding (Viterby)The unfair casino • Example: 1245666 • FFFFLLL • FFFFLLL

  44. HMM’s and weight matrices • In the case of un-gapped alignments HMM’s become simple weight matrices • To achieve high performance, the emission frequencies are estimated using the techniques of • Sequence weighting • Pseudo counts

  45. Profile HMM’s • Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner • Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix) • Profile HMM’s are ideal suited to describe such position specific variations

  46. Non-conserved Insertion Conserved Deletion Must have a G Any thing can match Profile HMM’s ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Core: Position with < 2 gaps

  47. Profile HMM’s All M/D pairs must be visited once L1-Y2A3V4R5-I6 P1D2P3P4I4P5D6P7

  48. Example. Sequence profiles • Alignment of protein sequences 1PLC._ and 1GYC.A • E-value > 1000 • Profile alignment • Align 1PLC._ against Swiss-prot • Make position specific weight matrix from alignment • Use this matrix to align 1PLC._ against 1GYC.A • E-value < 10-22. Rmsd=3.3

  49. Example continued Smith-Waterman score: 53; 26.2% identity in 61 aa overlap 10 20 30 1PLC._ IDVLLGADDGSLAFVPSEFSISPG--EKIV-----FKNNAG :: .: : .:: .: . :... 1GYC.A ILRYQGAPVAEPTTTQTTSVIPLIETNLHPLARMPVPGSPTPGGVDKALNLAFNFNGTNF 280 290 300 310 320 330 40 50 60 70 80 90 1PLC._ FPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVG : .: : ..: .. . ... .::: : 1GYC.A FINNASFTPPTVPVLLQILSGAQTAQDLLPAGSVYPLPAHSTIEITLPATALAPGAPHPF 340 350 360 370 380 390 1PLC._ KVTVN 1GYC.A HLHGHAFAVVRSAGSTTYNYNDPIFRDVVSTGTPAAGDNVTIRFQTDNPGPWFLHCHIDF 400 410 420 430 440 450

  50. Example continued Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + + Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V Sbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126 Rmsd=3.3 Å Model red Structure blue

More Related