1 / 37

Hyungwon Choi hwchoi@umich Department of Biostatistics University of Michigan MSSISS-2007

Semi-supervised Bayes Classifier for Peptide Identification in Shotgun Proteomics. Hyungwon Choi hwchoi@umich.edu Department of Biostatistics University of Michigan MSSISS-2007 March 16. Outline. High-throughput P eptide I dentification using M ass S pectrometry (MS)

tejano
Download Presentation

Hyungwon Choi hwchoi@umich Department of Biostatistics University of Michigan MSSISS-2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-supervised Bayes Classifier for Peptide Identification in Shotgun Proteomics Hyungwon Choi hwchoi@umich.edu Department of Biostatistics University of Michigan MSSISS-2007 March 16

  2. Outline • High-throughput PeptideIdentification using Mass Spectrometry (MS) • Simple Bayes Classifier constructed by Mixture Density Deconvolution • Bayesian Density Estimation Algorithm using reversible jump MCMC • Semi-supervised Classification using Augmented Decoy Peptide Database Search

  3. Our Research Question • Given a protein mixture, say a tumor specimen, Can we identify proteins in it? • Can we automate the large scale protein identification process? • Can we probabilistically validate such identifications? Tandem Mass Spectrometry & Database Search Statistical Model

  4. Peptide Identification via Mass Spectrometry • It is extremely hard to manipulate and directly sequence protein. • Proteins are digested into small fragments of manageable size, or peptides. • First identify these peptides and assemble the findings back to the protein level. >sp|P02754|LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI trypsin: cleaves after K and R MK||CLLLALALTCGAQALIVTQTMK||GLDIQK||VAGTWYSLAMAASDISLLDAQSAPLR||VYVEELK||PTPEGL ||EILLQK||WENGECAQK|| …

  5. ++ + ++ + ++ intensity m/z Tandem Mass Spectrometry Protein Mixture digestion peptide ions peptide fragments peptides + + ++ + + + + + + + + Mass Analysis Ionization Isolation Fragmentation tandem mass spectrum (MS/MS) peptide identification

  6. D L V T S V V T N P Sequence vs. Tandem Mass Spectrum VPTPNVSVVDLTCR 100 LTCR VDLTCR 987 DLTCR 1201 75 SVVDLTCR 1009 1102 TCR VVDLTCR Relative Abundance 1387 874 CR 50 TPNVSVVDLTCR PNVSVVDLTCR VSVVDLTCR 1300 NVSVVDLTCR 25 0 600 800 1000 1200 1400 1600 1800 m/z

  7. Automation of Identification • There are hundreds, or even thousands of potential peptide source in usual protein mixtures. • All tandem mass spectra generated from an experiment can be queried against existing protein database, and assigned to the best matches.

  8. Validation of Peptide Identifications entire dataset of all spectra spectra 1 Search Database 2 Scores 3 ISLLDAQSAPLR VVEELCPTPEGK DLLLQWCWENGK ECDVVSNTIIAEK GDAVFVIDALNR VPTPNVSVVDTNR 2.2 4.6 8.1 4.3 3.7 3.2 4 0.6 -0.3 -2.7 -4.1 RLPASQADLLSI KGEPTPCLEEVV KGNEWCWQLLLD KEAIITNSVVDCE 5 Decoy Database (Reverse Sequence) … N • Keller, A.I. Nesvizhskii, E. Kolker, R. Aebersold • Anal. Chem. 74, 5383 (2002)

  9. Mixture Deconvolution • The distribution of database search scores can be modeled with a two-component mixture: one for correct identifications and the other for incorrect identifications. Incorrect Correct

  10. Bayes Classifier • Given estimated mixture density, the posterior probability of correct identification is given by Bayes rule:

  11. Issue • There are many database search tools and associated scoring methods. • Accordingly, the database search score distribution varies depending on the scoring method. • For parametric density estimation, specific family of distribution must be specified a priori, potentially leading to model misspecification.

  12. Bayesian Density Estimation Model UNKNOWN • We use a standard Gibbs sampler coupled with reversible jump Metropolis-Hastings steps elaborated in Richardson and Green (JRSS-B,1997).

  13. Bayesian Density Estimation • Reversible Jump steps introduce dimension switching moves, from a mixture model with n components to either (n-1) or (n+1) components in the sampler. • This implies that the dimension of model space is also considered as RANDOM.

  14. Idea of Reversible Jump MCMC • Allows dimension switching moves, which is not possible with traditional Markov chain methods. • U is independently generated so that the following (simple) deterministic, one-to-one functional relationship holds:

  15. Idea of Reversible Jump MCMC • Balanced Detail Condition by Dimension Matching: through Acceptance Probability! and Let Then,

  16. Reversible Jump MCMC • One should make sure that the Metropolis-Hastings steps introduce a minimal degree of mixing across model spaces of different dimensions. • Such mixing could be difficult especially in large sample data because the sampler is easily anchored in low energy state.

  17. Reversible Jump MCMC in Mixture Incorrect Correct RJMCMC RJMCMC One Sampler

  18. for correct assignment for incorrect assignment Allocation Variables in Single Iteration Two layers of Allocation: Identification Status (Correct/Incorrect), and Subcomponent Membership given Identification Status.

  19. Model in Space of Fixed Dimensions Given the identification status, subcomponent membership, and the number of subcomponents K, for all i=1,2,…, K, the search score is distributed as: Identifiability Constraint

  20. Gibbs Sampling within Fixed Dimension Main mixture weights (Incorrect/Correct) Allocation to Identification Status (Incorrect/Correct)

  21. Gibbs Sampling within Fixed Dimension Allocation of data i to subcomponet j Mixture sub- component weights Mean for subcomponent j Variance for subcomponent j

  22. Metropolis-Hastings Steps for Varying Dimensions • After a single sweep of Gibbs sampler, perform reversible jump steps within each identification status. Incorrect Correct K-1 comp L-1 comp K comp L comp K+1 comp L+1 comp

  23. Proposal: Matching Moments Metropolis-Hastings Steps for Varying Dimensions Merging two subcomponents

  24. Metropolis-Hastings Steps for Varying Dimensions Dividing one component Into two components Proposal: Matching Moments 1) Generate 2) Set

  25. Metropolis-Hastings Steps for Varying Dimensions Proposal Create or Delete Empty Components Rarely happens in large sample data

  26. Metropolis-Hastings Steps for Varying Dimensions Acceptance Probability for Merging and Dividing Prior and Proposal Ratios Jacobian

  27. Semi-supervised Classification • We append decoy database to the original search database. • All the peptide assignments to the decoy peptides are known incorrect hits. • The data corresponding to these assignments only contribute to the estimation of parameters for incorrect identification.

  28. Data • Mixture of 18~20 known proteins. • Around 5,000 spectra were generated and queried against a large database with reverse sequence (decoy) database. • 5,000 spectra is just a miniature example.

  29. Data • Upon calculating the posterior probability of correct identification for all peptides, we can compare them to the actual probability. • Actual probability is calculated as follows: order peptides by estimated probability, and slide a window of size K peptides within which we calculate the average of known class labels.

  30. Histogram vs Fitted Density Counts Counts Search Score Search Score

  31. Estimated Probability and Accuracy High Probability Peptides underestimation 0.0 0.2 Estimated probability 0.8 1.0 0.0 0.2 Actual probability 0.8 1.0 overestimation 0.0 0.2 Estimated probability 0.8 1.0 Search Score

  32. Mixing of Model Spaces of Varying Dimension Number of Subcomponents in Incorrectly identified peptides Number of Subcomponents in Correctly identified peptides

  33. Comparison with EM algorithmGumbel / Normal distribution EM (Misspecified) bayesProphet Counts Counts Search Score Search Score

  34. Comparison with EM algorithmGumbel / Normal distribution 0.0 0.2 Estimated probability 0.8 1.0 0.0 0.2 Estimated probability 0.8 1.0 EM (Misspecified) bayesProphet Search Score Search Score

  35. Comparison with EM algorithmGumbel / Normal distribution underestimation underestimation 0.0 0.2 Actual probability 0.8 1.0 0.0 0.2 Actual probability 0.8 1.0 overestimation overestimation EM (Misspecified) bayesProphet Search Score Search Score

  36. Future Directions and Miscellaneous • Validate the robustness of the method on many variety of database search tools. • Improve on faster mixing of mixture models of varying dimensions. • Explore Bayesian mixture estimation using Dirichlet process prior. • Pre-release R package bayesPeptide is available upon request.

  37. Acknowledgements Dr. Alexey Nesvizhskii (Pathology) Dr. Debashis Ghosh (Biostatistics)

More Related