Machine Learning

Machine Learning

Objectives of Data Analysis • Classification • Pattern recognition • Diagnostics • Trends • Outliers • Quality control • Discrimination • Regression • Data comparisons

Pattern Matching • Searching for a patterns in data • In general, easy to spot or program • When a particular region has 55% GC content whereas neighboring regions have only 45%, easy to miss the regions with differences • Even when differences are noted, difficult to set the boundaries • Different regions may have biological significance, with one region representing a coding region • GC content may be of interest, but what about the content that may be significant but is not recognized yet ? • Probabilistic method may work • In particular, machine learning

Machine Learning • Machine Learning model • A blackbox with inputs and outputs • Blackbox is adjusted by parameters • Simplest form – yes or no output • Model parameters are random • Outputs will be random also, but can be trained to adjust parameters to fit the data • Typically, known data set is divided into a training set and a test set • Models • Hidden Markov Model • Neural Network • Probabilistic models • SVM (Support Vector Machine)

HMM of Loop/Helical • A letter depends on preceding letters AND on hidden state • In helical/loop problem, two hidden states: 0 for loop, 1 for helical • rB0, rB1 : probs. that 1st residue is loop or helical • r10, r11 : probs. of remaining in helical, or switching to loop • etc.

Hidden Markov Model (HMM) • Emission probs set to identical to AA frequencies • eo(a) = pl, e1(a) = ph • AAs occur independently as long as staying in either state 0 or 1 (zero-order) • Transition probs. Between hidden states is modeled by 1st-order • Values of transition probs. (r11, r10 ,…) control the relative frequency and relative lengths of the regions • If r01 is very small, difficult to initiate a new helical region, etc. • e.g. • Sequence xi: GHMESSLL KQT I NSWHLN • path i : B001 111000011110000E

Hidden Markov Model (HMM) • Path variables i describe the hidden states • Likelihood of the path i in the example, • L = [rB0 eo(G)] [r00 eo(H)] [r01 e1(M)] …. [r00 eo(T)] r0E • Model can be used to determine the most likely positions of helices and loops within the sequence (called decoding problem) • Two ways of doing this • Viterbi • Find the most probably path through the model, i.e., find the sequence of hidden states with the highest L • This gives a straightforward prediction that each site is either helix or loop • Forward/Backward • Consider all possible paths through the model, weighted according to their likelihood, and calculate the prob. that each site is in each of the hidden state

Profile HMM • Profile technique • position-specific scores are used to describe aligned families of protein sequences • Drawback is the reliance on ad hoc scoring schemes • Profile HMM is developed to capture the info in an alignments • 2 3 • W H . . E n • W H . . Y . • W - . . E . • S H . . E . • T H e . Y . • W H e r E .

Neural Networks • Simulate human nerve system • Neurons and synapse • Neuron puts out a real number between 0 and 1 • Feedforward network • Typically 10-20 residues are input • Usually used in supervised learning

Single Neuron • Connection from input to a neuron has positive/negative weight wij • Total input xj = ∑iwijyi • Output yj = g(xj) • A sigmoid function: g(xj) = 1/[1 + exp(-xj)] • Single output with multiple inputs is called a perceptron

Perceptron Example • Visualize • Can pick y2 = ¼ + 1/2 y1 • -1/4 - 1/2 y1 + y2 > 0 • w1 = -¼, w2 = -1/2, w0 = 1 (y1 y2) → y (0, ½) 1 (1,1) 1 (1,1/2) 0 (0,0) 0 w2/2 + w0 > 0 w1+ w2+ w0 > 0 w1 + w2/2 + w0 < 0 w0 < 0

Learning Algorithm • Backpropagation • Error at the output layer percolate down to the input layer • Weights are adjusted • Based on gradient descent method

Neural Network • Supervised training is required • Cannot tell how weights contribute to the output • Unsupervised learning system • Networks learn to form their own classifications of the training data without external help • To do this we have to assume that class membership is broadly defined by the input patterns sharing common features, and that the network will be able to identify those features across the range of input patterns

SOM (Self Organizing Map) • A class of unsupervised system based on competitive learning, in which the output neurons compete amongst themselves to be activated, with the result that only one is activated at any one time. • This activated neuron is called a winner-takes all neuron or simply the winning neuron. Such competition can be induced/implemented by having lateral inhibition connections (negative feedback paths) between the neurons. • The result is that the neurons are forced to organize themselves

Probabilistic Models • Likelihood ratios • Example: predict helices and loops in a protein • Known info: helices have a high content of hydrophobic residues • ph and pl: frequencies of AA being in the helix or loop • Lh and Ll : likelihoods that a sequence of N AAs are in a helix or a loop • Lh = ∏N ph , Ll = ∏N pl • Rather than likelihoods, their ratios have more info • Lh/Ll : is sequence more or less likely to be a helical or loop region • S = ln(Lh/Ll) = ∑N ln(ph/pl): positive for helical region • Partition a sequence into N-AA segments (N=300)

Prior and Posterior Probs. • Two hypotheses (Helix or Loop) • The sequence is described by models 0 and 1 • Models 0 and 1 are defined by ph and pl • Generalize to k hypotheses: Mk models (k=0,1,2,…) • Given a test dataset D, what is the prob. that D is described by each of the models ? • Known info: prior probs., Pprior(Mk) for each model from other info sources • Compute likelihood of D according to each of the models: L(D|Mk) • Of interest is not the prob of D arising from Mk but the prob of D being described by Mk • Namely, Ppost(Mk| D) ∞ L(D|Mk) Pprior(Mk) : posterior prob. • Ppost(Mk| D) = L(D|Mk) Pprior(Mk)/∑iL(D|ii) Pprior(Mi) • => Bayesian prob.

Bayesian Prob. • www.cs.uml.edu/~kim/580/review_baysian.pdf • Bayes’ Rule • p(x|y) = p(y|x)p(x)/p(y), • where p(y) = Σxp(y|x)p(x) • Example • Experiment • Stimulant (ST) present/not present (a priori prob. x) • Extracellular signal (SI): high/medium/low (posterior prob. y) • Stimulant High Med Low • present 0.6 0.3 0.1 • not present 0.1 0.2 0.7 • Inference • What is the prob. of ST being present when SI is high ? • Need a priori prob. (present = 0.4) • P(ST=p|SI=high) = p(SI=h|ST=p)p(ST=p)/Σxp(y|x)p(x) = 0.8

Model Parameter Set • Discrete data • θSI = p(SI|ST) • Stimulant High Med Low • present 0.6 0.3 0.1 • not present 0.1 0.2 0.7 • Continuous data – e.g. Gaussian

Bayesian Network • Multiple Variables • Stimulant (ST), Signal (SI), inhibitor (IN) of the signal, a G protein-coupled receptor bind (RE), a G protein (GP) and the cellular response (CR) • Express relationships • ST may or may not generate a signal • Concentration of the signal may affect the level of inhibitor • If the signal binds with receptor depends on the concentration of both the signal and the inhibitor • GP should become active if the receptor binds • An active GP initiates a cascade of reaction that causes the cellular response

Conditional Independence • CI if p(a,b|c) = p(a|c)p(b|c) • Three cases with an example of regulation of three genes x,y, and z • Serial: • If expression level of y is unknown, its level of x affects that of z • If that of y is known, that of z is conditionally independent from x

Conditional Independence • Three cases with an example of regulation of three genes x,y, and z • Diverging • If expression level of y is unknown, its level of x affects that of z (they are co-regulated, and if x is highly expressed, then the likely level of y may be inferred, which in turn would influence the expression level of z) • If that of y is known, that of z is conditionally independent from x

Conditional Independence • Converging • If expression level of y is unknown, its level of x does not help to infer that of z (x and z are independent) • If that of y is known, that of x does help to infer that of z • Y is dependent on both x and z, • P(x,z|y) ≠ p(x|y)p(z|y) • If y and x are known, it helps to infer the value of z, and x and z are no longer independent • P(z|x,y)=p(z)p(y|x,z)/Σzp(z)p(y|x,z) ≠ p(z)

Joint Prob. Distribution • BN with n variables (nodes) x = {x1, x2,…xn} and model parameter θ = {θ1, θ2, …, θn} (θi: a set of parameters describing the distribution for xi) • Each node has parents pa(xi) • p(x1, x2,…xn|θ) = Πip(xi|pa(xi), θ)

Prob. of ST present, given signal is high • Posterior prob • P(ST=present | SI=high) • = P(AI=h|ST=p) P(ST=p)/[P(SI=h|ST=p) P(ST=p) • + P(SI=h|ST=n) P(ST=n) ] • = 0.6*0.4/[0.6*0.4+0.1*0.6] • = 0.8

Bayesian Prob. • Basic principles • We make inference using posterior probs. • If a posterior prob. of one model is higher, it can be the best model with confidence • Special case: two models • Two prior probs.: Pprior0 , Pprior1 • Pposti = Li Ppriori/(L0 Pprior0 + L1 Pprior1) • Log-odd score: • S΄ = ln(L1Pprior1/L0Pprior0) = ln(L1/L0) + ln(Pprior1/Pprior0) • = S + ln(Pprior1/Pprior0) • Difference between S΄and S is simply the additive constant, and ranking will be identical whether we use S΄or S • Warning: if Pprior1 is small, S has to be high to make S΄positive • When Pprior0 = Pprior1, S΄= S • Ppost1 = 1/(1 + L0 Pprior0 /L1 Pprior1) = 1/(1 + exp(- S΄)) • S΄=0 →Ppost1 =1/2; S΄is large and negative → Ppost1 ≈1

Profile Model • How to estimate parameters from data when they are continuous ? • In helical/loop example, how to determine ph and pl? • In the same example, let AA a is observed na times from the sequence of ntot AAs • The likelihood of this occurring is L = ∏20 pana • The simplest way of choosing pa is to use na/ntot (indeed, maximizes L) • Generalize to a prob. model for a profile • Want to develop a position-specific score system • K aligned sequences of length N (ungapped) • ML frequencies at site i are pia = nia/Ki • Let xi be AA at site i in a new sequence and its likelihood Lprofile = ∏N pix (note: should be xi not x) • Need a model to compare with: • the one that is not position-specific: L0 = ∏N pa • S = ln(Lprofile/ L0) = ∑Nln(pix / px)

Bayesian Analysis • Instead of introducing prior probabilities, • More important approach to a Bayesian analysis is how to estimate pia • When building a profile model, frequencies at each site is to be known • What if an AA a does not occur at all at a given site ? • Lprofile is set to 0, and is too stringent • Bayesian model handles this by using pseudo-counts • To K aligned sequences, if we add A additional sequences, • Avg. # of additional AA is A*pa (pa is position-independent AA prob.) • pia = (nia+ Apa )/(K +A) • When K is mall, the estimated frequency is influenced by priors • As more data accumulates, observed data dominates • In Bayesian model, priors is important when sample size is small

Support Vector Machine • W. Noble, “What is SVM?” Computational Biology, Dec. 2006 (http://www.fml.tuebingen.mpg.de/raetsch/lectures/ismb09tutorial/images/WhatIsASVM.pdf ) • http://www.dtreg.com/svm.htm • SVM is similar to NN • An SVM with sigmoid kernel function is equivalent to a two-layer perceptron • SVM is used in • Credit card fraud detection • Recognize hand writing • Classification of microarray gene expression profiles

Cancer Gene Classificaiton • Affymetrix microarray • Contains probes for 6,817 human genes • For a bone marrow sample, it returns 6,817 values, each of which represents mRNA levels of individual genes • Trained for 27 Acute lymphoblastic leukemia (ALL) and 11 acute myeloid leukemia (AML) • For simplicity, assume two gene probes • Can plot gene expressions

Maximum Margin Hyperplane • How do we tell which line is better ? • From statistical learning theory • Margin is defined as the distance from a separating hyperplane to the nearest expression vector • Determine separating hyperplane as the one that maximizes the margin

Separating Hyperplane and Kernel Function • hyperplane • Data in m-dimension vector, translated into (m+i)-dimension may give a better separation

Soft Margin and Overfitting • Expect some errors – give a soft margin • Overfitting may increase ambiguity

ROC Curve • Evaluation tool of a classifier • Accuracy or error rates can be plotted • ROC (Receiver Operating Characteristics) is popular • Want to predict p(ositive) or n(egative) • From a test (experiment) • Actually p, predict p – true positive (TP) • Actually n, predict p – false positive (FP, false alarm) • Actually p, predict n – false negative (FN) • Actually n, predict n – false positive (FP) • ROC plots • TP rate vs. FP rate • TP – sensitivity of the model/test/experiment • FP – (1-specificity)

Machine Learning