Machine Learning CPBS7711 Oct 2, 2014

Machine LearningCPBS7711Oct 2, 2014 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health sonia.leach@gmail.com Center for Genes, Environment, and Health

Someone once said “Artificial Intelligence = Search”so Machine Learning = ?Induction of New Knowledge from experience and ability to improve? Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics. We might say the defining question of Computer Science is “How can we build machines that solve problems, and which problems are inherently tractable/intractable?” The question that largely defines Statistics is “What can be inferred from data plus a set of modeling assumptions, with what reliability?” The defining question for Machine Learning builds on both, but it is a distinct question. Whereas Computer Science has focused primarily on how to manually program computers, Machine Learning focuses on the question of how to get computers to program themselves (from experience plus some initial structure). Whereas Statistics has focused primarily on what conclusions can be inferred from data, Machine Learning incorporates additional questions about what computational architectures and algorithms can be used to most effectively capture, store, index, retrieve and merge these data, how multiple learning subtasks can be orchestrated in a larger system, and questions of computational tractability. We say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E. - Tom Mitchell http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf Also interesting discussion of differences among AI, ML, Data Mining, Stats : http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai Center for Genes, Environment, and Health

Machine Learning • From Wikipedia: • 7.1 Decision tree learning • 7.2 Association rule learning • 7.3 Artificial neural networks • 7.4 Inductive logic programming • 7.5 Support vector machines • 7.6 Clustering • 7.7 Bayesian networks • 7.8 Reinforcement learning • 7.9 Representation learning • 7.10 Similarity and metric learning • 7.11 Sparse Dictionary Learning • From Alppaydin Intro to Mach Learn: • Supervised Learning • Bayesian Decision Theory • Parametric Methods • Multivariate Methods • Dimensionality Reduction • Clustering • Nonparametric Methods • Decision Trees • Linear Discrimination • Multilayer Perceptrons • Local Models • Kernel Machines • Bayesian Estimation • Hidden Markov Models • Graphical Models • Combining Multiple Learners • Reinforcement Learning http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf Center for Genes, Environment, and Health

Machine Learning (what I will cover) • Unsupervised • Dimensionality Reduction • PCA • Clustering • k-Means, SOM, Hierarchical • Association Set Mining • Probabilistic Graphical Models • HMMs, Bayes Nets • Supervised • k-Nearest Neighbor • Neural Nets • Decision Trees/Random Forests • SVMs • Naïve Bayes • Issues • Regression/Classification • Feature selection/reduction • Missing data • Boosting/bagging/jackknife • Cross validation, generalization • Model selection Connections to other lectures: Miller (HMM), Pollock (HMM), Leach (HMM), Lozupone (PCA, Feature Importance Scores, Clustering), Kechris (Regression), [Hunter (Knowledge-Based Analysis), Cohen (BioNLP), Phang (Expr Analysis) ….] R: http://cran.r-project.org/web/views/MachineLearning.html Center for Genes, Environment, and Health

Unsupervised Learning Center for Genes, Environment, and Health

Dimensionality Reduction: Principal Components Analysis (PCA) • Motivation: Instead of considering all variables, use small number of linear combos of those variables with minimum information lost http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/ http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/ P1 var 2D data: What if could only choose 1 of the variables to represent data? Choose y-axis, explains more variance in data Amount of variance explained by P1 > explained by Y Amount of variance explained by single variable Center for Genes, Environment, and Health

Principal Components Analysis (PCA) • If X=(x1,x2,…,xn) is a random vector (mean vector , covariance matrix ), then principal component transformation X  Y = (X-) s.t.  is orthogonal, T   =  is diagonal, 1  2  … p  0. • Linear orthogonal transform of original data to new coordinate system • each component is linear combination of original variables • coefficient of variables in linear combo = Loadings • data transformed to new coords = Scores • components ordered by percentage of variance explained along new axis • number of components = minimum dimension of input data matrix • set of orthogonal vectors not unique, not scale-invariant (covariance vs correlation), computed by eigen value decomposition (as above & R princomp) or singular value decomposition (SVD) (R prncmp) Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Principal Components Analysis (PCA) diffgeom complex algebra reals stats 1 36 58 43 36 37 2 62 54 50 46 52 3 31 42 41 40 29 4 76 78 69 66 81 5 46 56 52 56 40 6 12 42 38 38 28 7 39 46 51 54 41 8 30 51 54 52 32 9 22 32 43 28 22 10 9 40 47 30 24 • If X is a random vector (mean , covariance matrix ), then principal component transformation X  Y=(X-) s.t.  is orthogonal, T   =  is diagonal, 1  2  … p  0. X What if we could only choose two dimensions? Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Principal Components Analysis (PCA) diffgeom complex algebra reals stats 1 36 58 43 36 37 2 62 54 50 46 52 3 31 42 41 40 29 4 76 78 69 66 81 5 46 56 52 56 40 6 12 42 38 38 28 7 39 46 51 54 41 8 30 51 54 52 32 9 22 32 43 28 22 10 9 40 47 30 24 • If X is a random vector (mean , covariance matrix ), then principal component transformation X  Y=(X-) s.t.  is orthogonal, T   =  is diagonal, 1  2  … p  0. X ~i Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 30.142 7.179 5.786 4.098 3.084 Proportion of Variance 0.890 0.050 0.032 0.016 0.009 Cumulative Proportion 0.890 0.941 0.974 0.990 1.000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 diffgeom 0.638 0.599 -0.407 -0.112 -0.237 complex 0.372 -0.230 0.593 -0.595 -0.320 algebra 0.240 -0.371 0.645 -0.624 reals 0.333 -0.671 -0.557 -0.234 0.271 statistics 0.535 0.414 0.404 0.615 Y(scores) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936 [2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994 [3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247 [4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369 [5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908 [6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075 [7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335 [8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420 [9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069 [10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687  (loadings) EXAMPLE IN R X = read.table('pca.input',sep=" ", header=TRUE) pc = princomp(X) mu = pc$center Gamma = pc$loadings Y = pc$scores XminusMu=sweep(X,MARGIN=2,mu,FUN="-") propOfVar= pc$sdev^2/sum(pc$sdev^2) eigenVals= pc$sdev^2 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Principal Components Analysis (PCA) ~i diffgeom complex algebra reals stats 1 36 58 43 36 37 2 62 54 50 46 52 3 31 42 41 40 29 4 76 78 69 66 81 5 46 56 52 56 40 6 12 42 38 38 28 7 39 46 51 54 41 8 30 51 54 52 32 9 22 32 43 28 22 10 9 40 47 30 24 Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 30.142 7.179 5.786 4.098 3.084 Proportion of Variance 0.890 0.050 0.032 0.016 0.009 Cumulative Proportion 0.890 0.941 0.974 0.990 1.000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 diffgeom 0.638 0.599 -0.407 -0.112 -0.237 complex 0.372 -0.230 0.593 -0.595 -0.320 algebra 0.240 -0.371 0.645 -0.624 reals 0.333 -0.671 -0.557 -0.234 0.271 statistics 0.535 0.414 0.404 0.615 X  (loadings) Y(scores) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936 [2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994 [3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247 [4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369 [5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908 [6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075 [7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335 [8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420 [9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069 [10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687 X = read.table('pca.input',sep=" ", header=TRUE) pc = princomp(X) mu = pc$center Gamma = pc$loadings Y = pc$scores XminusMu=sweep(X,MARGIN=2,mu,FUN="-") propOfVar= pc$sdev^2 /sum(pc$sdev^2) eigenVals= pc$sdev^2 ## Verify Y = (X-mu)*Gamma unique(Y-as.matrix(XminusMu)%*%Gamma) ## Verify X repr by Comp. i== Y[,i] par(mfrow=c(2,1),pty="s"),biplot(pc) plot(Y[,1],Y[,2],col="white") text(Y[,1],Y[,2],1:10) Arrows for original variables: Length=PropVarExplainedin 2 comps Direction=relative loadings in 2 comps ex) diffgeom largest(++,++) algebra smallest (+,-) Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Principal Components Analysis (PCA) diffgeom complex algebra reals stats 1 36 58 43 36 37 2 62 54 50 46 52 3 31 42 41 40 29 4 76 78 69 66 81 5 46 56 52 56 40 6 12 42 38 38 28 7 39 46 51 54 41 8 30 51 54 52 32 9 22 32 43 28 22 10 9 40 47 30 24 • If X is a random vector (mean , covariance matrix ), then principal component transformation X  Y=(X-) s.t.  is orthogonal, T   =  is diagonal, 1  2  … p  0. X What if we could only choose two dimensions? Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Clustering • Partitioning • Must specify number of clusters • K-Means, Self-Organizing Maps (SOM/Kohonen Net) • Hierarchical Clustering • Do not need to specify number of clusters • Need to specify distance metric and linkage method • Other approaches • Fuzzy clustering (probabilistic membership) • Spectral Clustering (using eigen value decomposition) Center for Genes, Environment, and Health

Clustering http://apandre.wordpress.com/visible-data/cluster-analysis/ Center for Genes, Environment, and Health

R package: mlbench: Machine Learning Benchmark Problems http://stackoverflow.com/questions/4722290/generating-synthetic-datasets Center for Genes, Environment, and Health

k-Means • Intitialize: Select the initial k Centroids • REPEAT • Form k clusters by assigning all points to the ‘closest’ Centroid • Recompute the Centroid for each cluster • UNTIL ”The Centroids don’t change or all changes are below predefined” • Initial Centroids are random vectors, randomly selected among vectors, first k vectors, etc or computed from random 1st assignment • ‘closest’ typically defined by Euclidean distance (Voronoi diagram) • Prone to local maxima so typically do N random restarts, take best (min sum of distE2 to centroids) • In practice, favors separated spherical clusters Center for Genes, Environment, and Health Images from wikipedia

k-Means Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Images from wikipedia Center for Genes, Environment, and Health http://en.wikipedia.org/wiki/K-means_clustering

Self-Organizing Maps (SOM) • Similar to k-Means, goal to assign data to map node (e.g. Centroid in k-Means) with ‘closest’ weight vector to data space vector (minimize distE(x,w)) • Difference: map nodes constrained by neighborhood relationships, whereas k-Means Centroids freely move • Must input initial topology, map ‘stretches’ to cover nD data in 2D, similar data assigned to map neighbors Image from wikipedia Center for Genes, Environment, and Health

Self-Organizing Maps (SOM) • 1. Initialization – Choose random values for initial weight vectors wj. • 2. Sampling – Draw a sample training input vector x from the input space. • 3. Matching – Find the winning neuron I(x) with weight vector closest to input vector (i.e.,mindistE) • 4. Updating – Apply the weight update equation wji = (t)Tj,I(x) (t)( xi-wji)where (t) = learning rate @ time t*Tj,I(x) (t)=neighborhood @ time t • 5. Continuation – keep returning to step 2 until the feature map stops changing. http://www.sciencedirect.com/science/article/pii/S0014579399005244 * Informal intro to simulated annealing, gradient descent… Center for Genes, Environment, and Health

Self-Organizing Maps (SOM) http://www.sciencedirect.com/science/article/pii/S0014579399005244 Center for Genes, Environment, and Health

Hierarchical Clustering • Divisive – (top down) start with all points in 1 cluster, successively sub-divide ‘farthest’ points until full tree • Agglomerative – (bottom up) start with each point in its own cluster (singleton), merge ‘closest’ pair of Clusters at each step until root • Requires metric to define ‘closest’ – distance no longer between points, but between clusters • Linkage strategy for which merge is often based on pairwise point comparisons • Dendrogram shows order of splits Center for Genes, Environment, and Health

Distance Metrics • Euclidean • distance in Euclidean space • Pearson Correlation • linear relationships • Spearman Correlation • monotonic relationships • Mutual Information • non-linear relationships • Polyserial Correlation • correlation continuous vs ordinal (polychoric if ordinal vs ordinal) • Hamming Distance, Jaccard, Dice (binary variables) Like Jaccard but 2*Matches Good when 0 gives no info Center for Genes, Environment, and Health

Distance Metrics • Euclidean vs Pearson (linear) vs Spearman (monotonic) Numbers are Pearson correlation Note Pearson invariant to slope A BCDEF A 1 1 -1 0.8 0 0 Pearson 1 1 -1 1 0 0 Spearman 0 8 9 6 17 19 EucDist B 8 0 1 6 22 23 EucDist C -1 -1 1 -0.7 0 0 Pearson E 0 0 0 0.3 1 0.85 Pearson 0 0 0 0 1 0.91 Spearman Pearson=0 if non-linear Center for Genes, Environment, and Health

Linkage Methods • Single LinkageargminS,T min sS,tTdist(s,t) • Complete LinkageargminS,T max sS,tTdist(s,t) • Average Linkage (a.k.a. group average)argminS,T average sS,tTdist(s,t) • Centroid Linkage (People err after Eisen et al 1998 Treeview paper think=Average Linkage!) – min dist(centroid(S), centroid(T)) • Ward’s Linkage (optimizes same criterion as kMeans) • UPGMA (UnweightedPair Group Method with Arithmetic Mean) from Lozupone lecture – assumes constant rate of evolution, average linkage, Euclidean distance Center for Genes, Environment, and Health

R package: mlbench: Machine Learning Benchmark Problems http://stackoverflow.com/questions/4722290/generating-synthetic-datasets Center for Genes, Environment, and Health

Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Comp.1 Comp.2 Murder -0.53 0.41 Assault -0.58 0.18 UrbanPop -0.27 -0.87 Rape -0.54 -0.16 Center for Genes, Environment, and Health

Center for Genes, Environment, and Health

Choosing the Number of Clusters B(K) • Rule of thumb: k= n/2 • Elbow or Knee method (bend in plot of metric) • K-means likes spherical so minimize within-cluster variation (SSE, sum dist of all points to cluster mean) or maximize between-cluster variation (dist between clusts) or both CH(K)=[B(K)/K-1]/[W(K)/(n-K)] • Gap Statistic • Calculate SSE, randomize dataset, calculate SSE rand, n times, gap= log(mean SSErand/ SSE) • Hierarchical – plot dist chosen at each merge (okay for single, complete) W(K) CH(K) *Calinski & Harabasz 1974 *Tibshirani, Walther, Hasties 2001 Gap(K) See also http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf for long list of indices, NbClust R package: http://cedric.cnam.fr/fichiers/art_2579.pdf and http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf Center for Genes, Environment, and Health

Association Set Mining • Also known as Market Basket Analysis {milk, eggs}  {butter} • Support of itemset X supp(X) = # transactions with itemset X • Confidence of rule conf(X Y) = supp(X &Y)/ supp(X) • Lift of rule (perf over assuming independent) lift(X Y) = supp(X &Y)/ (supp(X)*supp(Y)) • Want rules with max supp, conf, lift • Other measures found at: http://michael.hahsler.net/research/association_rules/measures.html Center for Genes, Environment, and Health

Association Set Mining • Tables of data converted to transactions by creating binary variables for all categories for all variables (must discretize continuous, missing data okay) { {gender_M=Y, age_child=Y, height_20-29=Y, race_WH=Y, diag_depr=Y}, {gender_M=Y, age_senior=Y, height_60-69=Y, race_BL=Y, diag_copd=Y}, {age_adult=Y, height_50-59=Y,race_AS=Y, diag_obes=Y}, {gender_F=Y, age_adol=Y, height_50-59=Y, race_BL=Y} } Center for Genes, Environment, and Health

Association Set MiningExample in R: arulespkg, apriori algorithm lhs rhs support confidence lift 1 {Class=2nd, Age=Child} => {Survived=Yes} 0.011 1.000 3.097 2 {Class=2nd, Sex=Female, Age=Child} => {Survived=Yes} 0.006 1.000 3.096 3 {Class=1st, Sex=Female} => {Survived=Yes} 0.064 0.972 3.010 4 {Class=1st, Sex=Female, Age=Adult} => {Survived=Yes} 0.064 0.972 3.010 … 12 {Sex=Female, Survived=Yes} => {Age=Adult} 0.143 0.918 0.966 27 {Class=2nd} => {Age=Adult} 0.118 0.915 0.963 Note that rule 2 subsumed by rule 1, which has better lift (and support) – can remove redundants Center for Genes, Environment, and Health

X X Hidden Markov Model (HMM) t-1 t Markov Process (MP) O O t t-1 Probabilistic Graphical Models Time Observability and Utility Observability Utility Partially Observable Markov Decision Process (POMDP) Markov Decision Process (MDP) Center for Genes, Environment, and Health

Hidden Markov Model X X Hidden Markov Model (HMM) • Finite set of N states X • Finite set of M observations O • Parameter set λ = (A, B, π) • Initial state distribution πi= Pr(X1 = i) • Transition probability aij= Pr(Xt=j | Xt-1 = i) • Emission probability bik= Pr(Ot=k | Xt = i) • Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)? t-1 t O O t t-1 Example: N=3, M=2 π=(0.25, 0.55, 0.2) A= B = obs1 obs2 st1 st2 st3 st1 st2 st3 st1 st2 st3 1 2 3 Center for Genes, Environment, and Health

Example: • πi = Pr(X1 = i) • aij = Pr(Xt=j | Xt-1 = i) • bik = Pr(Ot=k | Xt = i) • Probability of O is sum over all state sequencesPr(O|λ) = ∑all X Pr(O|X, λ) Pr(X|λ) = ∑all Xπx1 bx1o1ax1x2bx2o2 . . . axT-1xTbxToT • At each t, are N states to reach, so NT possible state sequences and 2T multiplications per seq, means O(2T*NT) operations • So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11! • Efficient dynamic programming algo: Forward algorithm (Baum&Welch) O(N2T) N=3, M=2 π=(0.25, 0.55, 0.2) A= B = 1 2 3 Center for Genes, Environment, and Health

Applications in Bioinformatics • DNA – motif matching, gene matching, multiple sequence alignment • Amino Acids – domain matching, fold recognition • Microarrays/Whole Genome Sequencing – assign copy number • ChIP-chip/seq – distinct chromatin states Center for Genes, Environment, and Health

Bayesian Networks • Given set of random variables, the joint probability distribution can be represented by: • Structure: Directed Acyclic Graph (DAG)‏ • variables are nodes, absence of arcs captures conditional independencies • Parameters: Local Conditional Probability Distributions (CPDs)‏ • conditional probability of variable given values of parents in graph • Joint Probability factors into product of local CPDs: Pr(X1, X2, …, Xn) = i=1to N Pr(Xi | Parents(Xi)) Center for Genes, Environment, and Health

Bayesian Networks • Generally can think of directed arcs as ‘causal’ (be careful!) • If the sprinkler is on OR it is raining, then the grass will be wet: Pr(W|S,R) • If observe wet grass, can determine whether because of sprinkler or rain • Pr(R|W) and Pr(S|W) • Bayes rule = Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) • Note S and R compete to explain W: this model says sprinkler usage is (conditionally) independent of rain, but if know the grass is wet, and it is raining, then it is less likely that the sprinkler being on is the explanation for W • Pr(S|W,R) < Pr(S|W) “explaining away” http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html Center for Genes, Environment, and Health

PMID: • 16873470 Applications in Bioinformatics Gene regulatory networks (Friedman et al, 2000, PMID: 11108481) Determining Regulators with PRMS (Segal et al, 2002, RECOMB) Predicting clinical outcomes using expression data (Gevaert et al, 2006, PMID: 16873470) Hanalyzer – edge scores(Leach et al, 2009, PMID: 19325874) Gene Function Prediction(Troyanskaya et al, 2003, PMID: 12826619 ) Center for Genes, Environment, and Health

Supervised Learning Center for Genes, Environment, and Health

Supervised Learning • Given examples (x,y) of input features x and output variable y, learn function f(x)=y • Regression (continuous response) vs Classification (discrete response) • Feature selection vs Feature (Dimensionality) Reduction • Cross validation (Leave-One-Out vs N-Fold) • Generalization (Training set error vs Test set error) • Model Selection (AIC, BIC) • Boosting/bagging/jackknife • Missing data and Imputation • Curse of dimensionality Center for Genes, Environment, and Health

Supervised Learning • Boosting (weak learners on different subsets) • Train H1 on random data split, sample among H1’s predictions so next data set to train H2 has half wrong, half right in H1. Train H3 where both H1 and H2 wrong. Return majority vote H1, H2, H3 (Adaboost weights examples, weighted vote) • Bagging (bootstrap aggregate) • Train multiple models on random with replacement (bootstrap) splits of input data, average predictions • Jackknife (vs bootstrap) – disjoint subsets of data • Model Selection: balance goodness of fit (likelihood L) with complexity of model (number of parameters k) for n samples • Bayesian information criterion (BIC): minimize k ln(n)-2 ln(L) • Akaike information criterion (AIC): minimize 2k – 2 ln(L) (less strong, better theory than BIC) • Curse of dimensionality – greater D, data samples sparser in covering space so need more & more data to learn properly Center for Genes, Environment, and Health

Decision Boundaries https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks Center for Genes, Environment, and Health

k-Nearest Neighbors • Store database of (x,y) pairs, classify new example by majority vote of k nearest neighbors (regression if assign (weighted) mean y in neighborhood) • No training needed, non-parametric,sensitive to local structure in data,frequent class tends to dominate • Curse of dimensionality if many variables, any query equidistant toall points – reduce features by PCA • Allows complicated boundariesbetween classes If k=3, (green, red) If k=5, (green, blue) Center for Genes, Environment, and Health

Neural Network: Linear Perceptron • Learning :(Backpropagation) • Initialize wt, choose learning rate  • 1) Calculate prediction y*j,t = f[wt  xj] • 2) Update weights wt+1 = wt+(yj – y*j,t)xj • Repeat 1&2 until (yj – y*j,t) < threshold • Can be generalized to multi-class • Optimal only if data linearly separable Step activation function vs Center for Genes, Environment, and Health

Neural Network: Multi-Layer Perceptron Smooth activation function (signmoid, tanh) • Smooth activationfunction instead • Can also havemultiple hidden layers • Can learn when data not linearly separable • Learn like before but backpropagation from output layer Input layer Hidden layer Output layer Center for Genes, Environment, and Health

Decision Tree BIOPSY+ • Node is attribute tested, branchis outcome, leaf is (majority) class (prob) • Discrete: X=xi?, Real: X<value? • Greedy algorithm choosesbest attribute to split upon: • pi = fraction items labeled i in set • Gini impurity: IG(p) =ijpipjprob items labeled i chosen *probi mistakenly assigned class j • Information gain: IE(p) =-i pi log2pi • Real value: SSE • EASY TO INTERPRET!!! Can overfit, large tree for XOR, biased in favor of attributes with more levels => ensembles Y N Rx SIDE EFFECT BREATH>90% Y N Y N Died: 15 Alive: 15 Died: 20 Alive: 57 Died: 3 Alive: 27 BREATH<30% Y N Died: 80 Alive: 1 Died: 30 Alive: 7 Center for Genes, Environment, and Health

Random Forest • Classifier consisting of ensemble of decision trees {h(x, k)} where k is some i.i.d. random vector, and each tree casts vote for class of x (Breiman 2001) • Bagging – k is random selection of N samples (with replacement) to grow tree • Dietterich 98: k is random split among n (best) splits • Ho 98: k is random subset of features to grow tree (√k ) • Adaboost-like: k is random weights on examples • 4 better than {2,3} better than 1 on generalization error • Out-of-bag estimates : internal estimates of generalization error, classifier strength and correlation between trees Center for Genes, Environment, and Health

Random Forest • Most popular implementation {h(x, k)}: bagging (random subset samples w/ repl.) + random subset features • If set of features small, trees more correlated, so can make new features as random linear combinations of orig. features • Out-of-bag classifier for specific {x,y} = aggregate over trees that didn’t use {x,y} as training data (removes need for setting aside test data) • Out-of-bag estimate is error rate for out-of-bag classifer for training set (can also estimate OOB strength and correlation) • Can estimate variable importance from OOB estimates • For m-th variable, permute values, compare misclassification rate of OOB classifiers on ‘noised-up’ data with OOB on real data, large increase implies m-th variable important Center for Genes, Environment, and Health

Support Vector Machine (SVM) • Support vectors are points that lie closest to decision surface, maximize ‘margin’, hyperplane separating examples (solution change if SVs removed) • Kernel function – maps not-linearly separable data to transformed space where transformed data is lin. sep. • Advantages: non-probabilistic, optimization not greedy search, not affected by local minima, theoretical guarantee of performance, escape curse of dimensionality Center for Genes, Environment, and Health

Support Vector Machine (SVM) • Distance between H and H1 is 1/||w|| so to maximize margin, need to minimize ||w||= sqrt(i wi2) s.t. no points between H1&H2: xi w + b  +1 when yi = +1 xi w + b  -1 when yi = -1 • Quadratic program (constrained optimization, solved by (dual of) Lagrangian multiplier)Max L = i- ½ijxixjs.t w=iyixi and iyi=0 • If not linearly separable, use transformation to space where is linearly separable, via kernels, i.e. (xi) not xi • If use L1-norm (not L2 above), weights = variable importance + yi(xi w)1 + + http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf Center for Genes, Environment, and Health

Machine Learning CPBS7711 Oct 2, 2014

Machine Learning CPBS7711 Oct 2, 2014

Presentation Transcript

Machine Learning: Lecture 2

Machine Learning - Lecture 2

CS 445/545 Machine Learning Winter, 2014

Machine Learning Math Essentials Part 2

Lecture 2: Introduction to Machine Learning

Machine Learning 2

CS 461: Machine Learning Lecture 2

Oct 10, 2014

Machine Learning, Chapter 2: Concept Learning

Oct. 27, 2014

Oct. 24, 2014

Oct. 6, 2014

Francis Weyzig Barcelona, 2 Oct 2014

Winnbrook Oct 2014

Virtual Memory #2 Oct. 1 , 2014

Oct. 20, 2014 and Oct. 21,2014

Machine learning Courses | Machine Learning Training

Machine Learning: UNIT-4 CHAPTER-2

Machine Learning Evolutionary Algorithm (2)

CS 461: Machine Learning Lecture 2

Machine Learning: UNIT-2 CHAPTER-1

Machine Learning Projects | Machine Learning Applications | Machine Learning Training | Simplilearn