1 / 53

Outline

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Outline. Introduction Protein Subcellular Localization Document Classification PSLDoc Term and its weighting scheme Feature Reduction SVM learning

hinda
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

  2. Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion

  3. Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion

  4. Protein Subcellular Localization

  5. Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion

  6. Document Classification

  7. Vector Space Model • Salton’s Vector Space Model • Represent each document by a high-dimensional vector in the space of words Documents Vectors Gerald Salton

  8. Vectors in Term Space

  9. Term-document matrix is mn matrix wheremis number of terms and n is number of documents Term-Document Matrix document term

  10. Term Weighting by TFIDF • The term frequency (tf) in the given document dgives a measure of the importance of the term tiwithin the particular document with ni being the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms • The inverse document frequency (idf) is obtained by dividing the number of all documents by the number of documents containing the term ti, • |D| : total number of document in the corpus • : number of documents where the term ti appears tfidf = tf*idf

  11. Predicted by 1 Nearest-Neighbor based on Cosine Similarity • similarity between document and query

  12. Feature Reduction •  a best choice of axes – shows most variation in the data. => Found by linear algebra: Singular Value Decomposition (SVD) True plot in k dimensions Reduced-dimensionality plot

  13. Singular Value Decomposition 40 Term-document matrix Reduced feature size = 40 features

  14. Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion

  15. The Terms of Proteins - Gapped-dipeptides* • Let XdZ denote the amino acid coupling pattern of amino acid types X and Z that are separated by d amino acids If d= 20, there are 8400 (=20*20*21) features for a vector *Liang HK, Huang CM, Ko MT, Hwang JK. The Amino Acid-Coupling Patterns in Thermophilic Proteins. Proteins: Structure, Function and Bioinformatics (2005), 59, 58-63.

  16. Term Weighting Scheme – TF Position Specific Score Matrix (1/2) • Position Specific Score Matrix (PSSM) : A PSSM is constructed from a multiple alignment of the highest scoring hits in the BLAST search

  17. Term Weighting Scheme – TF Position Specific Score Matrix (2/2) • The weight of XdZ : where f(i,Y) denotes the normalized value of the PSSM entry at the ith row and the column corresponding to amino acid typeY • An example W(M2D,P) = f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D) = 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894

  18. Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion

  19. Feature Reduction - Probabilistic Latent Semantic Analysis (1/3)

  20. Feature Reduction - Probabilistic Latent Semantic Analysis (2/3) • A joint probability between a term w and a document d can be modeled as: Latent variable z (“small” #states) Concept expression probabilities Document-specific mixing proportions • The parameters could be estimated by maximum-likelihood function through EM algorithm.

  21. Feature Reduction - Probabilistic Latent Semantic Analysis (3/3)

  22. Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion

  23. Classifier – Support Vector Machines • Support Vector Machines (SVM) • LIBSVMsoftware • Five 1-v-rest SVM classifiers corresponding to five localization sites. • Kernel: Radial Basis Function (RBF) • Parameter selection • c (cost) and γ(gamma) are optimized • five-fold cross-validation SVMCP v.s. -CP SVMIM v.s. -IM SVMPP v.s. -PP SVMOM v.s. -OM SVMEC v.s. -EC *Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  24. System Architecture PSLDoc Protein Subcellular Localization prediction by Document classification

  25. Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion

  26. Data set (1/3) • Gram-negative bacteria : PS1444 • ePSORTdb version 2.0 Gram-negative • 1444 proteins PSHigh783 Pairwise Sequence identity > 30% PSLow661

  27. Data set (2/3) • Eukaryotic proteins, 7579 proteins, 12 localization sites Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003;19(13):1656-1663.

  28. Data set (3/3) • Human data set, 2197 proteins, 9 localization sites Scott MS, Thomas DY, Hallett MT. Predicting subcellular localization via protein motif co-occurrence. Genome Res 2004;14(10A):1957-1966.

  29. Evaluation • Accuracy (Acc) • l = 5is the number of total localization sites • Niare the number of proteins in localization site I • Matthew’s correlation coefficient (MCC)

  30. Simple Prediction Methods (1/2) • 1NN_TFIDF : 1NN + gapped-dipeptides + TFIDF • 1NN_TFPSSM : 1NN + gapped-dipeptides + PSSM

  31. Simple Prediction Methods (2/2) • 1NN_PSI-BLASTps , 1NN_PSI-BLASTnr • 1NN_ClustalW Training Database PSI-BLAST PSI-BLAST PSSM NCBI nr Database Training Database Query Protein Similar Protein PSSM ClustalW

  32. The comparison of 1NN_TFIDF and 1NN_TFPSSM on the PSHigh783and PSLow661 data sets.

  33. Comparison of 1NN_TFPSSM, 1NN_ClustalW, 1NN_PSI-BLASTps and 1NN_PSI-BLASTnr

  34. Evaluation and Results *HYBIRD combines the results of CELLO II and ALIGN.

  35. Evaluation and Results

  36. Prediction Confidence • The confidence of the final predicted class • Prediction Confidence =the largest probability - the second largest probability Largest Second Prediction Confidence = SVMCP – SVMOM

  37. Prediction Threshold (1/3)

  38. Prediction Threshold (2/3)

  39. Prediction Threshold (3/3) *The threshold is set such that the coverage is similar with PSLT.

  40. Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion

  41. Gapped-peptide signature • The size of topics = 80

  42. Gapped-peptide signature • The site-topic preferenceof the topic z for a localization site l = average { P(z|d)| d (a protein) belongs to l class} Acc.=90 Acc.=89

  43. Gapped-peptide signature • Distance = 13 (The size of gapped-dipeptides = 5,600)

  44. Gapped-peptide signature • For each localization site, ten preferred topics according to site-preference confidence ( = the largest site-topic preference - the second largest site-topic preference) • For each topic, five most frequent gapped-dipeptides are selected.

  45. Gapped-peptide signature

  46. Gapped-dipeptide signatures reflecting motifs relevant to protein localization sites • In the integral membrane proteins, in which helix-helix interactions are stabilized by aromatic residues. Specifically, the aromatic motif (WXXW or W2W) is involved in the dimerization of transmembrane domains by π-π interactions. • In the outer membrane class, where the C-terminal signature sequence is recognized by the assembly factor, OMP85, regulating the insertion and integration of OM proteins in the outer membrane of gram-negative bacteria. The C-terminal signature sequence contains a Phe (F) at the C-terminal position, preceded by a strong preference for a basic amino acid (K, R). => R0F

  47. The amino acid compositions of single residues and gapped-dipeptide signatures for each localization site

  48. The grouped amino acid compositions of single residues and gapped-dipeptide signature Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR), and A (aromatic: FYW)

  49. Gapped-dipeptide signatures and their amino acid compositions for each localization site Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR), and A (aromatic: FYW)

  50. Gapped-dipeptide signatures and their amino acid compositions for each localization site • IM has a high percentage of non-polar amino acids (60%) and no charged (0%) amino acids. • The physico-chemical properties of the lipid bilayer, in which non-polar amino acids are favored in the transmembrane domains of IM proteins. • Charged amino acids are disfavored due to the penalty incurred in energy terms in the assembly of IM proteins. • CP and EC classes have a high percentage of charged and polar amino acids, respectively. • The role of charged amino acids in the cytoplasm is probably related to pH homeostasis in which they act as buffers, whereas secreted proteins in the EC classes may require more polar amino acids for promoting interactions in the solvent environment.

More Related