1 / 63

Learning with Ambiguously Labeled Training Data

Learning with Ambiguously Labeled Training Data. Kshitij Judah Ph.D. student Advisor: Prof. Alan Fern Qualifier Oral Presentation. Outline. Introduction and Motivation Problem Definition Learning with Partial Labels Semi-Supervised Learning Multiple-Instance Learning Conclusion.

evadne
Download Presentation

Learning with Ambiguously Labeled Training Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning with Ambiguously Labeled Training Data Kshitij Judah Ph.D. student Advisor: Prof. Alan Fern Qualifier Oral Presentation EECS, OSU

  2. Outline • Introduction and Motivation • Problem Definition • Learning with Partial Labels • Semi-Supervised Learning • Multiple-Instance Learning • Conclusion EECS, OSU

  3. Introduction and Motivation • Supervised Learning Test Point Training Data (D) Learning Algorithm f Loss Function (From CS534 Machine Learning slides) EECS, OSU

  4. Introduction and Motivation • Obtaining labeled training data is difficult • Reasons: • Require substantial human effort (Information Retrieval) • Require expensive tests (Medical Diagnosis, Remote Sensing) • Disagreement among experts (Information Retrieval, Remote Sensing) EECS, OSU

  5. Introduction and Motivation • Reasons: • Specifying a single correct class is not possible but pointing incorrect classes is possible (Medical Diagnosis) • Labeling is not possible at instance level (Drug Activity Prediction) • Objective: Present the space of learning problems and algorithms that deal with ambiguously labeled training data EECS, OSU

  6. Space of Learning Problems Learning Problems Direct Supervision (Labels) Indirect Supervision (No Labels) Probabilistic Labels Deterministic Labels Reinforcement Learning Learning with Partial Labels Supervised Learning Semi-Supervised Learning Session-based Learning Unsupervised Learning Multiple-instance Learning EECS, OSU

  7. Problem Definition denote an instance space, • Let being domain of the ith feature • Instance is a vector denotes a discrete valued class variable with domain the set of all classes • Training data is a set of examples and is a vector of length m such that kth entry is 1 if is a possible class of , otherwise 0 • In case of unambiguously labeled , exactly one entry in is 1 EECS, OSU

  8. Problem Definition • In case of ambiguous labeling, more than one entries can be 1 • The learning task is to select using D an optimal hypothesis from a hypothesis space • is used to classify a new instance • Optimality refers to expected classification accuracy EECS, OSU

  9. Learning with Partial Labels EECS, OSU

  10. Problem Definition • Each training instance has a set of candidate class labels associated with it • is assumed to contain true label of • Different from multiple-label learning because each instance belongs to exactly one class • Applications: Medical diagnosis, Bioinformatics, information retrieval. EECS, OSU

  11. Taxonomy of Algorithms for Learning with Partial Labels • Probabilistic Generative Approaches • Mixture Models + EM • Probabilistic Discriminative Approaches • Logistic Regression for Partial Labels EECS, OSU

  12. Probabilistic Generative Approaches • Learn joint probability distribution • can be used to either model data or perform classification • learned using and • Classification of is performed using Bayes rule: and assigning to class that maximizes EECS, OSU

  13. Probabilistic Generative Approaches • Some examples of generative models are • Gaussian Distribution • Mixture Models e.g. Gaussian mixture model • Directed Graphical Models e.g. Naïve Bayes, HMMs • Undirected Graphical Models e.g. Markov Networks • Applications: Robotics, speech recognition, computer vision, forecasting, time series prediction etc. EECS, OSU

  14. Mixture Models • In a mixture model, the probability density function of is given by: • Model parameters are often learned using maximum likelihood estimation (MLE) EECS, OSU

  15. Mixture Models • In MLE approach, assuming instances in D are independently and identically distributed (i.i.d), the joint density of D is given by • The goal is to find parameters that maximizes the log of above quantity EECS, OSU

  16. ( ) 6 Expectation-Maximization Algorithm (EM) • Difficult to optimize - log of sums • Incomplete-data likelihood function – Generating component unknown • Convert to complete-data likelihood by introducing a random variable which tells, for each , which component generated it • If we observe , EECS, OSU

  17. Expectation-Maximization Algorithm (EM) • We do not however observe directly, it is a hidden variable • Use EM to handle hidden variables • EM is an iterative algorithm with two basic steps: • E-Step: Given current estimate of model parameters and data D, the expected value is calculated where expectation is to the marginal distribution of (i.e. each ): EECS, OSU

  18. Expectation-Maximization Algorithm (EM) • M-Step: In this, we maximize (8) computed in E-step: • Repeat until convergence. EECS, OSU

  19. Mixture Models + EM for Partial Labels • - set of possible mixture components that could have generated the sample • EM is modified so that is restricted to take values from • Modified E-Step if EECS, OSU

  20. Pros and Cons of Generative Models • Pros • Clear probabilistic framework • Effective when model assumptions are correct • Cons • Complex, lot of parameters • Difficult to specify and verify their correctness • Unlabeled data can hurt performance if model is incorrect • Parameter estimation is difficult • EM gets stuck in local maxima, computationally expensive • No guarantee of better classification EECS, OSU

  21. Discriminative Approaches • Discriminative approaches attempt to directly learn a mapping required for classification • Simpler than generative models • More focused towards classification or regression tasks • Two types: • Probabilistic: They learn posterior class probability • Non-probabilistic: They learn a classifier • Applications - text classification, information retrieval, machine perception, time series prediction, bioinformatics etc. EECS, OSU

  22. Logistic Regression(From CS534 Machine Learning slides) • Learns the conditional class distribution • For two class case {0, 1}, the probabilities are given as: • For 0-1 loss function LR is a linear classifier • For multiple class case, we learn weight parameters for each class EECS, OSU

  23. Maximum Entropy Logistic Regression • Learns such that it has maximum entropy while being consistent with the partial labels in the D • is close to uniform distribution over classes in candidate set and zero everywhere else • Presence of unlabeled data results in with low discrimination capacity • Disadvantages: • Does not use unlabeled data to enhance learning • Instead uses unlabeled data in unfavorable way that decreases discrimination capacity of EECS, OSU

  24. Minimum Commitment Logistic Regression • Learns such that classes belonging to candidate set are predicted with high probability • No preference is made among distributions that satisfy above property • Presence of unlabeled data has no effect on discrimination capacity of • Disadvantages : • Does not use unlabeled data to enhance learning EECS, OSU

  25. Self-Consistent Logistic Regression • Learns such that the entropy of the distribution is minimized • Learning is driven by the predictions that are produced by the learner itself • Advantage: • Can make use of unlabeled data to enhance learning • Disadvantage: • If initial estimate is incorrect, the algorithm will go down the wrong path EECS, OSU

  26. Semi-Supervised Learning EECS, OSU

  27. Problem Definition • Semi-supervised learning (SSL) is a special case of learning with partial labels • , and • For each in , = true class label of • For each in , = • Usually because unlabeled data is easily available as compared to labeled data e.g. information retrieval • SSL approaches makes use of in conjunction with to enhance learning • SSL has been applied to many domains like text classification, image classification, audio classification etc. EECS, OSU

  28. Taxonomy of Algorithms for Semi-Supervised Learning • Probabilistic Generative Approaches • Mixture Models + EM • Discriminative Approaches • Semi-supervised support vector machines (S3VMs) • Self Training • Co-Training EECS, OSU

  29. Mixture Models + EM for SSL • Nigam et al. proposed a probabilistic generative framework for document classification • Assumptions: (1) Documents are generated by a mixture model (2) One-to-one correspondence between mixture components and classes • Probability of a document is given as: • was modeled using Naïve Bayes EECS, OSU

  30. Mixture Models + EM for SSL • is estimated using MLE by maximizing log likelihood of : where • EM is used to maximize • Applications: Text classification, remote sensing, face orientation discrimination, pose estimation, audio classification EECS, OSU

  31. Mixture Models + EM for SSL • Problem: Unlabeled data decreased performance • Reason: Incorrect modeling assumption • One mixture component per class: Class includes multiple sub-topics • Solutions: • Decrease effect of unlabeled data • Correct modeling assumptions: Multiple mixture components per class EECS, OSU

  32. margin Class +1 Class -1 SVMs EECS, OSU

  33. SVMs • Idea: Do following things: A. Maximize margin B. Classify points correctly C. Keep data points outside margin • SVM optimization problem: EECS, OSU

  34. SVMs • A. is achieved by • B. and C. are achieved using hinge loss • C controls trade off between training error and generalization error From SSL tutorial by Xiaojin Zhu during ICML 2007 EECS, OSU

  35. S3VMs • Also known as Transductive SVMs From SSL tutorial by Xiaojin Zhu during ICML 2007 EECS, OSU

  36. S3VMs • Idea: Do following things: A. Maximize margin B. correctly classify labeled data C. Keep data points (labeled and unlabeled) outside margin • S3VM optimization problem: EECS, OSU

  37. S3VMs • A. is achieved by • B. is achieved using hinge loss • C. is achieved for labeled data using hinge loss and for unlabeled data using hat loss From SSL tutorial by Xiaojin Zhu during ICML 2007 EECS, OSU

  38. S3VMs • Pros • Same mathematical foundations as SVMs and hence applicable to tasks where SVMs are applicable • Cons • Optimization is difficult, finding exact solution is NP-Hard • Applications: text classification, Bioinformatics, biological entity recognition, image retrieval etc. EECS, OSU

  39. Self Training • A leaner uses its own predictions to teach itself • Self training algorithm: • Train a classifier using • Use h to predict labels of • Add most confidently labeled instances in to • Repeat until a stopping criterion is met EECS, OSU

  40. Self Training • Pros • One of the simplest SSL algorithms • Can be used with any classifier • Cons • Errors in initial predictions can build up and affect subsequent learning. • A possible solution: identify and remove mislabeled examples from self-labeled training set during each iteration. E.g. SETRED algorithm [Ming Li and Zhi-Hua Zhou] • Applications: object detection in images, word sense disambiguation, semantic role labeling, named entity recognition EECS, OSU

  41. Co-Training • Idea: Two classifiers and collaborate with each other during learning • and use disjoint subsets and of feature set • and each is assumed to be: • Sufficient for the classification task at hand • Conditionally independent of the other given class EECS, OSU

  42. Co-Training • Example: Web page classification • can be text present on the web page • can be anchor text on pages that link to this page • The two feature sets are sufficient for classification and conditionally independent given class of web page because each is generated by two different persons that know the class EECS, OSU

  43. Co-Training • Co-Training algorithm: • Use and to train ; and to train • Use to predict labels of • Use to predict labels of • Add h1‘s most confidently labeled instances to • Add h2‘s most confidently labeled instances to • Repeat until a stopping criterion is met EECS, OSU

  44. Co-Training • Pros • Effective when required feature split exists • Can be used with any classifier • Cons • What if feature split does not exist? Answer: (1) Use random feature split (2) Use full feature set • How to select confidently labeled instances? Answer: Use multiple classifiers instead of two and do majority voting. E.g. tri-training uses three classifiers [Zhou and Li], democratic co-learning uses multiple classifiers [Zhou and Goldman] • Applications: Web page classification, named entity recognition, semantic role labeling EECS, OSU

  45. Multiple-Instance Learning EECS, OSU

  46. Problem Definition • Set of objects • Each object has multiple variants called instances • , is called a bag • Bag is positive if at least one instance is positive else negative • Goal: learn EECS, OSU

  47. Why MIL is ambiguous label problem? • To learn , learn • maps an instance to a class • does not contain labels at instance level EECS, OSU

  48. Taxonomy of Algorithms for Multiple-Instance Learning • Probabilistic Generative Approaches • Diverse Density Algorithm (DD) • EM Diverse Density (EM-DD) • Probabilistic Discriminative Approaches • Multiple-instance logistic regression • Non-probabilistic Discriminative Approaches • Axis-Parallel Rectangles (APR) • Multiple-instance SVM (MISVM) EECS, OSU

  49. x x 2 1 4 4 3 1 3 Point A 3 1 3 4 3 3 3 3 3 1 3 2 2 2 1 2 2 2 3 4 2 2 2 1 2 Point B 1 2 3 2 2 2 4 1 1 3 1 2 1 4 2 1 1 2 2 4 2 2 2 4 Diverse Density Algorithm (DD) EECS, OSU

  50. Diverse Density Algorithm (DD) • Diverse density is a probabilistic quantity: • Using gradient ascent, maximum diverse density point can be located • Cons: Computationally expensive • Applications: person identification, stock selection, natural scene classification EECS, OSU

More Related