G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

G54DMT – Data Mining Techniques and Applicationshttp://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 4: Applications Lecture 2: Protein Structure Prediction Some material taken from “Arthur Lesk, Introduction to Bioinformatics, 2nd edition, Oxford University Press, 2005, Livingstone, C.D., Barton, G.J.: Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Computer Applications in the Biosciences 9 (1993) 745-756 ”.

Outline • Brief introduction to protein structure • Motivation and definition of PSP • PSP: A family of problems • Data Mining protein’s structural aspects • Dimensionality reduction for protein datasets • Summary

Protein Structure: Introduction • Proteins are molecules of primary importance for the functioning of life • Structural Proteins (collagen nails hair etc.) • Enzymes • Transmembrane proteins • Proteins are polypeptide chains constructed by joining a certain kind of peptides amino acids in a linear way • The chain of amino acids however folds to create very complex 3D structures • There is a general consensus that the end state of the folding process depends on the amino acid composition of the chain

Protein Structure

Amino Acids

Protein Structure: Introduction • Different amino acids have different properties • These properties will affect the protein structure and function • Hydrophobicity, for instance, is the main driving force (but not the only one) of the folding process

Global Interactions Local Interactions Protein Structure: Hierarchical nature of protein structure Primary Structure = Sequence of amino acids MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTLPFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQREKIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKKHLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYLIKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE Secondary Structure Tertiary

Motivation for PSP • The function of a protein depends greatly on its structure • The structure that a protein adopts is vital to it’s chemistry • Its structure determines which of its amino acids are exposed to carry out the protein’s function • Its structure also determines what substrates it can react with • However the structure of a protein is very difficult to determine experimentally and in some cases almost impossible

Protein Structure Prediction • That is why we have to predict it • PSP aims to predict the 3D structure of a protein based on its primary sequence

Impact of PSP • PSP is an open problem. The 3D structure depends on many variables • It has been one of the main holy grails of computational biology for many decades • Impact of having better protein structure models are countless • Genetic therapy • Synthesis of drugs for incurable diseases • Improved crops • Environmental remediation

Prediction types of PSP • There are several kinds of prediction problems within the scope of PSP • The main one of course is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) based on its primary sequence • There are many structural properties of individual residues within a protein that can be predicted for instance: • The secondary structure state of the residue • If a residue is buried in the core of the protein or exposed in the surface • Accurate predictions of these sub-problems can simplify the general 3D PSP problem

Prediction types of PSP • There is an important distinction between the two classes of prediction • The 3D PSP is generally treated as an optimisation problem • The prediction of structural aspects of protein residues are generally treated as machine learning problems

DATA MINING PROTEIN’S STRUCTURAL ASPECTS

Prediction of structural aspects of protein residues • Many of these features are due to local interactions of an amino acid and its immediate neighbours • Can it be predicted using information from the closest neighbours in the chain? • In this simplified example to predict the SS state of residue i we would use information from residues i-1 i and i+1. That is a window of ±1 residues around the target Ri-5 SSi-5 Ri-4 SSi-4 Ri-3 SSi-3 Ri-2 SSi-2 Ri-1 SSi-1 Ri SSi Ri+2 SSi+2 Ri+3 SSi+3 Ri+4 SSi+4 Ri+5 SSi+5 Ri+1 SSi+1 Ri-1 Ri Ri+1 SSi Ri Ri+1 Ri+2 SSi+1 Ri+1 Ri+2 Ri+3  SSi+2

ARFF file for a simple PSP dataset @relation AA+CN_Q2 @attribute AA_-4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} @attribute AA_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute class {0,1} @data X,X,X,X,A,E,I,K,H,0 X,X,X,A,E,I,K,H,Y,0 X,X,A,E,I,K,H,Y,Q,0 X,A,E,I,K,H,Y,Q,F,0 A,E,I,K,H,Y,Q,F,N,0 E,I,K,H,Y,Q,F,N,V,0 I,K,H,Y,Q,F,N,V,V,0 K,H,Y,Q,F,N,V,V,M,1 H,Y,Q,F,N,V,V,M,T,0 Y,Q,F,N,V,V,M,T,C,1

What information do we include for each residue? • Early prediction methods used just the primary sequence  the AA types of the residues in the window • However the primary sequence has limited amount of information • It does not contain any evolutionary information it does not say which residues are conserved and which are not • Where can we obtain this information? • Position-Specific Scoring Matrices which is a product of a Multiple Sequence Alignment

Position-Specific Scoring Matrices (PSSM) • For each residue in the query sequence compute the distribution of amino acids of the corresponding residues in all aligned sequences (discarding those too similar to the query) • This distributions will tell us which mutations are likely and which mutations are less likely for each residue in the query sequence • In essence it’s similar to a substitution matrix but tailored for the sequence that we are aligning • A PSSM profile will also tell us which residues are more conserved and which residues are more subject to insertions or deletions

PSSM for the 10 first residues of 1n7lA A R N D C Q E G H I L K M F P S T W Y V A: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0 M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1 E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3 K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3 V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5 Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3 Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2 L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1 T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0 R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3

Secondary Structure Prediction • The most usual way is to predict whether a residue belongs to an α helix a β sheet or is in coil state • Several programs can determine the actual SS state of a protein from a PDB file. The most common of them is DSSP • Typically, a window of ±7 amino acids (15 in total) is used. This means 300 attributes (when using PSSM). • A dataset with 1000 proteins with ~250AA/protein would have ~250000 instances

Secondary Structure Prediction MSA PSSM1 PSSM2 PSSM3 PSSMn-1 PSSMn Primary sequence R1 R2 R3 Rn-1 Rn PSSM profile of sequence Prediction method Windows generation SSi? PSSMi-1 PSSMi PSSMi+1 Window of PSSM profiles Prediction

Coordination Number Prediction • Two residues of a chain are said to be in contact if their distance is less than a certain threshold (e.g. 8Å) • CN of a residue : count of contacts that a certain residue has • CN gives us a simplified profile of the density of packing of the protein Native State Contact Primary Sequence

CN as a classification problem • The number of contacts, depending on the definition can be either an integer or a continuous number • To treat this problem (and some others I will mention later) as classification problems, we need to discretise the output • Unsupervised methods are applied • Uniform length and uniform frequency disc. UF UL

SVM vs GAssist for CN prediction • Classification in 2, 3 and 5 states • For a dataset of 1050 proteins and ~234000 instances

Example of a rule set generated by GAssist for CN prediction • All AA types associated to the central residue are hydrophobic (core of a protein) • D E consistently do not appear in the predicates. They are negatively charges residues (surface of a protein)

Other predictions • Other kinds of residue structural aspects that can be predicted • Solvent accessibility: Amount of surface of each residue that is exposed to solvent • Recursive Convex Hull: A metric that models a protein as an onion and assigns each residue to a layer. Formally each layer is a convex hull of points • These features (and others) are predicted in a similar was as done for SS or CN

PSP datasets are good ML benchmarks • These problems can be modelled in may ways: • Regression or classification problems • Low/high number of classes • Balanced/unbalanced classes • Adjustable number of attributes • Ideal benchmarks !! • http://www.infobiotic.net/PSPbenchmarks/

Contact Map prediction • Prediction given two residues from a chain whether these two residues are in contact or not • This problem can be represented by a binary matrix. 1= contact 0 = non contact • Plotting this matrix reveals many characteristics from the protein structure helices sheets

Steps for CM prediction (Nottingham method) • Prediction of • Secondary structure (using PSIPRED) • Solvent Accessibility • Recursive Convex Hull • Coordination Number • Integration of all these predictions plus other sources of information • Final CM prediction (using BioHEL) Using BioHEL [Bacardit et al., 09]

Prediction of RCH, SA and CN • We selected a set of 2811 protein chains from PDB-REPRDB with: • A resolution less than 2Å • Less than 30% sequence identify • Without chain breaks nor non-standard residues • 90% of this set was used for training (~490000 residues) • 10% for test

Prediction of RCH, SA and CN • All three features were predicted based on a window of ±4 residues around the target • Evolutionary information (as a Position-Specific Scoring Matrix) is the basis of this local information • Each residue is characterised by a vector of 180 values • The domain for all three features was partitioned into 5 states

Characterisation of the contact map problem • Three types of input information were used • Detailed information of three different windows of residues centered around • The two target residues (2x) • The middle point between them • Information about the connecting segment between the two target residues and • Global protein information. 1 3 2

Contact Map dataset • The set of 2811 proteins was randomly halved • Moreover, all proteins with more than 350 amino acids were discarded • Still, the resulting training set contained more than 15.2 million instances and 631 attributes • Less than 2% of those are actual contacts • 36GB of disk space

Samples and ensembles Training set • 50 samples of 300K examples are generated from the training set with a ratio of 2:1 non-contacts/contacts • BioHEL is run 25 times for each sample • Prediction is done by a consensus of 1250 rule sets • Confidence of prediction is computed based on the votes distribution in the ensemble. • Whole training process takes about 289 CPU days (~5.5h/rule set) x50 Samples x25 Rule sets Consensus Predictions

Contact Map prediction in CASP • CASP = Critical Assessment of Techniques for Protein Structure Prediction. • Biannual community-wide experiment to assess the state-of-the-art in PSP • Predictor groups are asked to submit a list of predicted contacts and a confidence level for each prediction • The assessors then rank the predictions for each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10} • From these L/x top ranked contacts two measures are computed • Accuracy: TP/(TP+FP) • Xd: difference between the distribution of predicted distance and a random distribution

Accuracy Results • Accuracy for groups that predicted a common subset of targets Ezkudia et al. Proteins 2009; 77(Suppl 9):196-209

Xd results Ezkudia et al. Proteins 2009; 77(Suppl 9):196-209

Understanding the rule sets • Each rule set has in average 135 rules • We have a total of 168470 rules • Impossible to read all of them individually, but we can extract useful statistics • For instance, how often was each attribute used in the rules?

Distribution of frequency of use of attributes • All 631 attributes are actually used (min frequency=429) • However, some of them are used much more frequently than others

Top 10 attributes The four kind of residue’s predictions are highly ranked

DIMENSIONALITY REDUCTION FOR PROTEIN DATASETS

Motivation PSP is a very costly process As an example, one of the best PSP methods CASP8, Rosetta@Home could dedicate up to 104 computing years to predict a single protein’s 3D structure One of the possible ways to alleviate this computational cost is to simplify the representation used to model the proteins

Target for reduction: the primary sequence • The primary sequence of a protein is an usual target for such simplification • It is composed of a quite high cardinality alphabet of 20 symbols, which share commonalities between them • One example of reduction widely used in the community is the hydrophobic-polar (HP) alphabet, reducing these 20 symbols to just two • HP representation usually is too simple, too much information is lost in the reduction process [Stout et al., 06] • Can we automatically generate these reduced alphabets and tailor them to the specific problem at hand?

Alphabet Reduction protocol • We will use an automated information theory-driven method to optimize alphabet reduction policies for PSP datasets • The optimization process will use the Extended Compact Genetic Algorithm (ECGA) to find a reduction policy • ECGA will be guided by a fitness function based on the Mutual Information (MI) metric • Two PSP datasets will be used as testbed: • Coordination Number (CN) prediction • Relative Solvent Accessibility (SA) prediction • We will verify the optimized reduction policies with BioHEL [Bacardit & Krasnogor, 06], an evolutionary-computation based rule learning system

Alphabet Reduction protocol Size = N Test set Dataset Card=N Dataset Card=20 Ensemble of rule sets BioHEL ECGA Accuracy Mutual Information

Extended Compact Genetic Algorithm (ECGA) ECGA belongs to a class of Evolutionary Algorithms called Estimation of Distribution Algorithms (EDA) Instead of using crossover & mutation to generate new individuals these methods compute a probabilistic model of the structure of the problem from the population and then sample new individuals according to this model

Alphabet reduction strategies • Three strategies were evaluated • They represent progressive levels of sophistication • Mutual Information (MI) • Robust Mutual Information (RMI) • Dual Robust Mutual Information (DualRMI)

MI strategy • There are 21 symbols (20AA+end of chain) in the alphabet • Each symbol will be assigned to a group in the chromosome used by ECGA

MI strategy • Objective function for MI strategy: Mutual Information • Mutual Information is a measure that quantifies the interrelationship that two discrete variables have among each other • X is the reduced representation of the window of residues around the target. • Y is the two-state definition of CN or SA

MI strategy • Steps of objective function computation for the MI strategy • Reduction mappings are extracted from the chromosome • Instances of the training set are transformed into the lower cardinality alphabet • Mutual information between the class attribute and the string formed by concatenating the input attributes in computed • This MI is assigned as the result of the evaluation function

MI strategy • Problem of MI strategy • Mutual Information needs redundancy in order to become a good estimator • That is, each possible pattern in X and Y should be well represented in the dataset • Patterns in Y are always well represented. What happens with patterns in X in our dataset? • Our sample, despite having almost 260000 residues is too small

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

Presentation Transcript

Data Mining with Clementine

Frequent Item Mining

Sequence Data Mining: Techniques and Applications

CS490D: Introduction to Data Mining Prof. Walid Aref

Drug Safety Assessment and Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

BUDT 725: Models and Applications in Operations Research

Advanced Topics in Data Mining: Association Rules

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

CSE 538 Web Search and Mining Web Crawling

Approximate Mining of Consensus Sequential Patterns

Data Mining Tools

Chapter 32

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong

Data Mining with DB

Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 8 —

Data Mining with CANape 9.0