Hyungwon Choi hwchoi@umich Department of Biostatistics University of Michigan MSSISS-2007

Semi-supervised Bayes Classifier for Peptide Identification in Shotgun Proteomics Hyungwon Choi hwchoi@umich.edu Department of Biostatistics University of Michigan MSSISS-2007 March 16

Outline • High-throughput PeptideIdentification using Mass Spectrometry (MS) • Simple Bayes Classifier constructed by Mixture Density Deconvolution • Bayesian Density Estimation Algorithm using reversible jump MCMC • Semi-supervised Classification using Augmented Decoy Peptide Database Search

Our Research Question • Given a protein mixture, say a tumor specimen, Can we identify proteins in it? • Can we automate the large scale protein identification process? • Can we probabilistically validate such identifications? Tandem Mass Spectrometry & Database Search Statistical Model

Peptide Identification via Mass Spectrometry • It is extremely hard to manipulate and directly sequence protein. • Proteins are digested into small fragments of manageable size, or peptides. • First identify these peptides and assemble the findings back to the protein level. >sp|P02754|LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI trypsin: cleaves after K and R MK||CLLLALALTCGAQALIVTQTMK||GLDIQK||VAGTWYSLAMAASDISLLDAQSAPLR||VYVEELK||PTPEGL ||EILLQK||WENGECAQK|| …

++ + ++ + ++ intensity m/z Tandem Mass Spectrometry Protein Mixture digestion peptide ions peptide fragments peptides + + ++ + + + + + + + + Mass Analysis Ionization Isolation Fragmentation tandem mass spectrum (MS/MS) peptide identification

D L V T S V V T N P Sequence vs. Tandem Mass Spectrum VPTPNVSVVDLTCR 100 LTCR VDLTCR 987 DLTCR 1201 75 SVVDLTCR 1009 1102 TCR VVDLTCR Relative Abundance 1387 874 CR 50 TPNVSVVDLTCR PNVSVVDLTCR VSVVDLTCR 1300 NVSVVDLTCR 25 0 600 800 1000 1200 1400 1600 1800 m/z

Automation of Identification • There are hundreds, or even thousands of potential peptide source in usual protein mixtures. • All tandem mass spectra generated from an experiment can be queried against existing protein database, and assigned to the best matches.

Validation of Peptide Identifications entire dataset of all spectra spectra 1 Search Database 2 Scores 3 ISLLDAQSAPLR VVEELCPTPEGK DLLLQWCWENGK ECDVVSNTIIAEK GDAVFVIDALNR VPTPNVSVVDTNR 2.2 4.6 8.1 4.3 3.7 3.2 4 0.6 -0.3 -2.7 -4.1 RLPASQADLLSI KGEPTPCLEEVV KGNEWCWQLLLD KEAIITNSVVDCE 5 Decoy Database (Reverse Sequence) … N • Keller, A.I. Nesvizhskii, E. Kolker, R. Aebersold • Anal. Chem. 74, 5383 (2002)

Mixture Deconvolution • The distribution of database search scores can be modeled with a two-component mixture: one for correct identifications and the other for incorrect identifications. Incorrect Correct

Bayes Classifier • Given estimated mixture density, the posterior probability of correct identification is given by Bayes rule:

Issue • There are many database search tools and associated scoring methods. • Accordingly, the database search score distribution varies depending on the scoring method. • For parametric density estimation, specific family of distribution must be specified a priori, potentially leading to model misspecification.

Bayesian Density Estimation Model UNKNOWN • We use a standard Gibbs sampler coupled with reversible jump Metropolis-Hastings steps elaborated in Richardson and Green (JRSS-B,1997).

Bayesian Density Estimation • Reversible Jump steps introduce dimension switching moves, from a mixture model with n components to either (n-1) or (n+1) components in the sampler. • This implies that the dimension of model space is also considered as RANDOM.

Idea of Reversible Jump MCMC • Allows dimension switching moves, which is not possible with traditional Markov chain methods. • U is independently generated so that the following (simple) deterministic, one-to-one functional relationship holds:

Idea of Reversible Jump MCMC • Balanced Detail Condition by Dimension Matching: through Acceptance Probability! and Let Then,

Reversible Jump MCMC • One should make sure that the Metropolis-Hastings steps introduce a minimal degree of mixing across model spaces of different dimensions. • Such mixing could be difficult especially in large sample data because the sampler is easily anchored in low energy state.

Reversible Jump MCMC in Mixture Incorrect Correct RJMCMC RJMCMC One Sampler

for correct assignment for incorrect assignment Allocation Variables in Single Iteration Two layers of Allocation: Identification Status (Correct/Incorrect), and Subcomponent Membership given Identification Status.

Model in Space of Fixed Dimensions Given the identification status, subcomponent membership, and the number of subcomponents K, for all i=1,2,…, K, the search score is distributed as: Identifiability Constraint

Gibbs Sampling within Fixed Dimension Main mixture weights (Incorrect/Correct) Allocation to Identification Status (Incorrect/Correct)

Gibbs Sampling within Fixed Dimension Allocation of data i to subcomponet j Mixture subcomponent weights Mean for subcomponent j Variance for subcomponent j

Metropolis-Hastings Steps for Varying Dimensions • After a single sweep of Gibbs sampler, perform reversible jump steps within each identification status. Incorrect Correct K-1 comp L-1 comp K comp L comp K+1 comp L+1 comp

Proposal: Matching Moments Metropolis-Hastings Steps for Varying Dimensions Merging two subcomponents

Metropolis-Hastings Steps for Varying Dimensions Dividing one component Into two components Proposal: Matching Moments 1) Generate 2) Set

Metropolis-Hastings Steps for Varying Dimensions Proposal Create or Delete Empty Components Rarely happens in large sample data

Metropolis-Hastings Steps for Varying Dimensions Acceptance Probability for Merging and Dividing Prior and Proposal Ratios Jacobian

Semi-supervised Classification • We append decoy database to the original search database. • All the peptide assignments to the decoy peptides are known incorrect hits. • The data corresponding to these assignments only contribute to the estimation of parameters for incorrect identification.

Data • Mixture of 18~20 known proteins. • Around 5,000 spectra were generated and queried against a large database with reverse sequence (decoy) database. • 5,000 spectra is just a miniature example.

Data • Upon calculating the posterior probability of correct identification for all peptides, we can compare them to the actual probability. • Actual probability is calculated as follows: order peptides by estimated probability, and slide a window of size K peptides within which we calculate the average of known class labels.

Histogram vs Fitted Density Counts Counts Search Score Search Score

Estimated Probability and Accuracy High Probability Peptides underestimation 0.0 0.2 Estimated probability 0.8 1.0 0.0 0.2 Actual probability 0.8 1.0 overestimation 0.0 0.2 Estimated probability 0.8 1.0 Search Score

Mixing of Model Spaces of Varying Dimension Number of Subcomponents in Incorrectly identified peptides Number of Subcomponents in Correctly identified peptides

Comparison with EM algorithmGumbel / Normal distribution EM (Misspecified) bayesProphet Counts Counts Search Score Search Score

Comparison with EM algorithmGumbel / Normal distribution 0.0 0.2 Estimated probability 0.8 1.0 0.0 0.2 Estimated probability 0.8 1.0 EM (Misspecified) bayesProphet Search Score Search Score

Comparison with EM algorithmGumbel / Normal distribution underestimation underestimation 0.0 0.2 Actual probability 0.8 1.0 0.0 0.2 Actual probability 0.8 1.0 overestimation overestimation EM (Misspecified) bayesProphet Search Score Search Score

Future Directions and Miscellaneous • Validate the robustness of the method on many variety of database search tools. • Improve on faster mixing of mixture models of varying dimensions. • Explore Bayesian mixture estimation using Dirichlet process prior. • Pre-release R package bayesPeptide is available upon request.

Acknowledgements Dr. Alexey Nesvizhskii (Pathology) Dr. Debashis Ghosh (Biostatistics)

Hyungwon Choi hwchoi@umich Department of Biostatistics University of Michigan MSSISS-2007

Hyungwon Choi hwchoi@umich Department of Biostatistics University of Michigan MSSISS-2007

Presentation Transcript

False Discovery Rate for Functional Neuroimaging Thomas Nichols Department of Biostatistics University of Michigan

Michigan Department of Agriculture

University of Michigan

MICHIGAN DEPARTMENT OF EDUCATION

Christine James University of Michigan, Department of Chemical Engineering

University of Michigan

Scott L. Greer University of Michigan slgreer@umich

Crystal VanKooten University of Michigan vankootc@umich

Justine A. Neiderhiser The University of Michigan janeider@umich

MICHIGAN DEPARTMENT OF TREASURY

Vicky Choi Department of Computer Science Duke University

Michigan Department of Michigan Update

Wolfgang Lorenzon University of Michigan PSTP 2007

University of Michigan

Thomas Lumley Department of Biostatistics

University of Michigan

Stephen W. Bougher University of Michigan (bougher@umich)

Kwangwoo Choi Department of Physics Kangwon National University

Michigan Department of Corrections

Richard B. Rood (Room 2525, SRB) University of Michigan rbrood@umich

UNIVERSITY OF MICHIGAN