1 / 44

Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology

Explore and analyze large quantities of data to extract meaningful patterns and discover previously unknown information. This book discusses the importance of data mining in both commercial and scientific fields, with a focus on applications in biology and biomedical informatics.

clarka
Download Presentation

Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer Science kumar@cs.umn.edu www.cs.umn.edu/~kumar

  2. Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns What is Data Mining?

  3. Why Data Mining? Commercial Viewpoint • Lots of data is being collected and warehoused • Web data • Yahoo! collects 10GB/hour • purchases at department/grocery stores • Walmart records  20 million transactions per day • Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong • Provide better, customized services for an edge (e.g. in Customer Relationship Management)

  4. Why Data Mining? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • NASA EOSDIS archives over 1-petabytes of earth science data / year • telescopes scanning the skies • Sky survey data • gene expression data • scientific simulations • terabytes of data generated in a few hours • Data mining may help scientists • in automated analysis of massive data sets • in hypothesis formation

  5. Data Mining for Biomedical Informatics • Recent technological advances are helping to generate large amounts of both medical and genomic data • High-throughput experiments/techniques • Gene and protein sequences • Gene-expression data • Biological networks and phylogenetic profiles • Electronic Medical Records • IBM-Mayo clinic partnership has created a DB of 5 million patients • Single Nucleotides Polymorphisms (SNPs) • Data mining offers potential solution for analysis of large-scale data • Automated analysis of patients history for customized treatment • Prediction of the functions of anonymous genes • Identification of putative binding sites in protein structures for drugs/chemicals discovery Protein Interaction Network

  6. Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniquesmay be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data Statistics/AI Machine Learning/ Pattern Recognition Data Mining Database systems

  7. Data Mining Tasks... Data Clustering Predictive Modeling Anomaly Detection Association Rules Milk

  8. How can data mining help biologists? • Data mining particularly effective if the pattern/format of the final knowledge is presumed; common in biology: • Protein complexes (clustering and association patterns) • Gene regulatory networks (predictive models) • Protein structure/function (predictive models) • Motifs (association patterns) • We will look at two examples: • Clustering of ESTs and microarray data • Identifying protein functional modules from protein complexes

  9. Model for predicting credit worthiness Employed Yes No No Education { High school, Graduate Undergrad } Number of Yes years > 7 yrs < 7 yrs Yes No Predictive Modeling: Classification • Find a model for class attribute as a function of the values of other attributes Class

  10. Test Set Model General Approach for Building Classification Model quantitative categorical categorical class Learn Classifier Training Set

  11. Classification Techniques • Base Classifiers • Decision Tree based Methods • Rule-based Methods • Nearest-neighbor • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines • Ensemble Classifiers • Boosting, Bagging, Random Forests

  12. Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Classifying credit card transactions as legitimate or fraudulent • Categorizing news stories as finance, weather, entertainment, sports, etc • Identifying intruders in the cyberspace

  13. Classification for Protein Function Prediction • Features could be several motifs, structure, domains, etc. • Important features (motifs, conserved domains etc) can be extracted using automated feature extraction techniques and using domain knowledge • Classification model can then be built on the extracted features and trained using several annotated proteins • Two proteins sharing significantly common features have a greater probability of having the same function • Several supervised and unsupervised schemes can be applied for this purpose

  14. Employed Yes Yes No No Education Worthy: 4 Not Worthy:3 Worthy: 4 Not Worthy:3 Worthy: 0 Not Worthy:3 Worthy: 0 Not Worthy:3 Graduate High School/ Undergrad Worthy: 2 Not Worthy:2 Worthy: 2 Not Worthy:4 Not Worthy Worthy 4 3 Employed = Yes Key Computation 0 3 Employed = No Constructing a Decision Tree Employed

  15. Constructing a Decision Tree Employed = Yes Employed = No

  16. Design Issues of Decision Tree Induction • How should training records be split? • Method for specifying test condition • depending on attribute types • Measure for evaluating the goodness of a test condition • How should the splitting procedure stop? • Stop splitting if all the records belong to the same class or have identical attribute values • Early termination

  17. Rule-based Classifier (Example)

  18. Application of Rule-Based Classifier • A rule rcovers an instance x if the attributes of the instance satisfy the condition of the rule The rule r1 covers a hawk => Bird The rule r3 covers the grizzly bear => Mammal

  19. Ordered Rule Set • Rules are rank ordered according to their priority • An ordered rule set is known as a decision list • When a test record is presented to the classifier • It is assigned to the class label of the highest ranked rule it has triggered • If none of the rules fired, it is assigned to the default class

  20. Compute Distance Test Record Training Records Choose k of the “nearest” records Nearest Neighbor Classifiers • Basic idea: • If it walks like a duck, quacks like a duck, then it’s probably a duck

  21. Nearest-Neighbor Classifiers • Requires three things • The set of stored records • Distance metric to compute distance between records • The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: • Compute distance to other training records • Identify k nearest neighbors • Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

  22. Nearest Neighbor Classification… • Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes

  23. Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: • Bayes theorem:

  24. Using Bayes Theorem for Classification • Approach: • compute posterior probability P(Y | X1, X2, …, Xd) using the Bayes theorem • Maximum a-posteriori: Choose Y that maximizes P(Y | X1, X2, …, Xd) • Equivalent to choosing value of Y that maximizes P(X1, X2, …, Xd|Y) P(Y) • How to estimate P(X1, X2, …, Xd | Y )?

  25. Naïve Bayes Classifier • Assume independence among attributes Xi when class is given: • P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj) • Can estimate P(Xi| Yj) for all Xi and Yj from data • New point is classified to Yj if P(Yj)  P(Xi| Yj) is maximal.

  26. Artificial Neural Networks (ANN) Training ANN means learning the weights of the neurons Activation Functions

  27. Perceptron • Single layer network • Contains only input and output nodes • Activation function: g = sign(wx) • Learning: Initialize the weights (w0, w1, …, wd) • Repeat • For each training example (xi, yi) • Compute f(w, xi) • Update the weights: • Until stopping condition is met

  28. Example of Perceptron Learning

  29. Perceptron Learning Rule • Since f(w,x) is a linear combination of input variables, decision boundary is linear • For nonlinearly separable problems, perceptron learning algorithm will fail because no linear hyperplane can separate the data perfectly, and hence more powerful learning techniques are needed

  30. Characteristics of ANN • Multilayer ANN are universal approximators • Can handle redundant attributes because weights are automatically learnt • Gradient descent may converge to local minimum • Model building can be very time consuming, but testing can be very fast

  31. Find a linear hyperplane (decision boundary) that will separate the data Support Vector Machines

  32. Which one is better? B1 or B2? How do you define better? Support Vector Machines

  33. Find hyperplane that maximizes the margin => B1 is better than B2 Support Vector Machines

  34. Support Vector Machines

  35. Evaluating Classifiers • Confusion Matrix: a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

  36. Accuracy • Most widely-used metric:

  37. Problem with Accuracy • Consider a 2-class problem • Number of Class 0 examples = 9990 • Number of Class 1 examples = 10 • If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % • This is misleading because the model does not detect any class 1 example • Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc)

  38. Alternative Measures

  39. ROC (Receiver Operating Characteristic) • A graphical approach for displaying trade-off between detection rate and false alarm rate • Developed in 1950s for signal detection theory to analyze noisy signals • ROC curve plots TPR against FPR • Performance of a model represented as a point in an ROC curve • Changing the threshold parameter of classifier changes the location of the point

  40. ROC Curve (TPR,FPR): • (0,0): declare everything to be negative class • (1,1): declare everything to be positive class • (1,0): ideal • Diagonal line: • Random guessing • Below diagonal line: • prediction is opposite of the true class

  41. ROC (Receiver Operating Characteristic) • To draw ROC curve, classifier must produce continuous-valued output • Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record • Many classifiers produce only discrete outputs (i.e., predicted class) • How to get continuous-valued outputs? • Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM

  42. Methods for Classifier Evaluation • Holdout • Reserve k% for training and (100-k)% for testing • Random subsampling • Repeated holdout • Cross validation • Partition data into k disjoint subsets • k-fold: train on k-1 partitions, test on the remaining one • Leave-one-out: k=n • Bootstrap • Sampling with replacement • .632 bootstrap:

  43. Methods for Comparing Classifiers • Given two models: • Model M1: accuracy = 85%, tested on 30 instances • Model M2: accuracy = 75%, tested on 5000 instances • Can we say M1 is better than M2? • How much confidence can we place on accuracy of M1 and M2? • Can the difference in performance measure be explained as a result of random fluctuations in the test set?

  44. Data Mining Book For further details and sample chapters see www.cs.umn.edu/~kumar/dmbook

More Related