1 / 74

Machine Learning basics

Machine Learning basics. David Kauchak CS457 Fall 2011. Admin. Assignment 4 How’d it go? How much time? Assignment 5 Last assignment! “Written” part due Friday Rest, due next Friday Read article for discussion on Thursday. Final project ideas. spelling correction part of speech tagger

artie
Download Presentation

Machine Learning basics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning basics David KauchakCS457 Fall 2011

  2. Admin • Assignment 4 • How’d it go? • How much time? • Assignment 5 • Last assignment! • “Written” part due Friday • Rest, due next Friday • Read article for discussion on Thursday

  3. Final project ideas • spelling correction • part of speech tagger • text chunker • dialogue generation • pronoun resolution • compare word similarity measures (more than the ones we’re looking at for assign. 5) • word sense disambiguation • machine translation • compare sentence alignment techniques • information retrieval • information extraction • question answering • summarization • speech recognition

  4. Final project ideas • pick a text classification task • evaluate different machine learning methods • implement a machine learning method • analyze different feature categories • n-gram language modeling • implement and compare other smoothing techniques • implement alternative models • parsing • PCFG-based language modeling • lexicalized PCFG (with smoothing) • true n-best list generation • parse output reranking • implement another parsing approach and compare • parsing non-traditional domains (e.g. twitter) • EM • word-alignment for text-to-text translation • grammar induction

  5. Word similarity • Four general categories • Character-based • turned vs. truned • cognates (night, nacht, nicht, natt, nat, noc, noch) • Semantic web-based (e.g. WordNet) • Dictionary-based • Distributional similarity-based • similar words occur in similar contexts

  6. Corpus-based approaches Word ANY blurb aardvark beagle dog

  7. Corpus-based The Beagle is a breed of small to medium-sized dog. A member of the Hound Group, it is similar in appearance to the Foxhound but smaller, with shorter leg Beagles are intelligent, and are popular as pets because of their size, even temper, and lack of inherited health problems. Dogs of similar size and purpose to the modern Beagle can be traced in Ancient Greece[2] back to around the 5th century BC. From medieval times, beagle was used as a generic description for the smaller hounds, though these dogs differed considerably from the modern breed. In the 1840s, a standard Beagle type was beginning to develop: the distinction between the North Country Beagle and Southern

  8. Corpus-based: feature extraction • We’d like to utilize or vector-based approach • How could we we create a vector from these occurrences? • collect word counts from all documents with the word in it • collect word counts from all sentences with the word in it • collect all word counts from all words within X words of the word • collect all words counts from words in specific relationship: subject-object, etc. The Beagle is a breed of small to medium-sized dog. A member of the Hound Group, it is similar in appearance to the Foxhound but smaller, with shorter leg

  9. Word-context co-occurrence vectors TheBeagleis a breedof small to medium-sized dog. A member of the Hound Group, it is similar in appearance to the Foxhound but smaller, with shorter leg Beaglesare intelligent, andare popular as pets because of their size, even temper, and lack of inherited health problems. Dogs of similar size and purpose to the modernBeaglecan be tracedin Ancient Greece[2] back to around the 5th century BC. From medieval times,beaglewas used as a generic description for the smaller hounds, though these dogs differed considerably from the modern breed. In the 1840s, a standardBeagletype was beginning to develop: the distinction between the North Country Beagle and Southern

  10. Word-context co-occurrence vectors TheBeagleis a breed Beaglesare intelligent, and to the modernBeaglecan be traced From medieval times,beaglewas used as 1840s, a standardBeagletype was beginning the: 2 is: 1 a: 2 breed: 1 are: 1 intelligent: 1 and: 1 to: 1 modern: 1 … Often do some preprocessing like lowercasing and removing stop words

  11. Corpus-based similarity sim(dog, beagle) = sim(context_vector(dog),context_vector(beagle)) the: 5 is: 1 a: 4 breeds: 2 are: 1 intelligent: 5 … the: 2 is: 1 a: 2 breed: 1 are: 1 intelligent: 1 and: 1 to: 1 modern: 1 …

  12. Another feature weighting • TFIDF weighting takes into account the general importance of a feature • For distributional similarity, we have the feature (fi), but we also have the word itself (w) that we can use for information sim(context_vector(dog),context_vector(beagle)) the: 5 is: 1 a: 4 breeds: 2 are: 1 intelligent: 5 … the: 2 is: 1 a: 2 breed: 1 are: 1 intelligent: 1 and: 1 to: 1 modern: 1 …

  13. Another feature weighting Feature weighting ideas given this additional information? sim(context_vector(dog),context_vector(beagle)) the: 5 is: 1 a: 4 breeds: 2 are: 1 intelligent: 5 … the: 2 is: 1 a: 2 breed: 1 are: 1 intelligent: 1 and: 1 to: 1 modern: 1 …

  14. Another feature weighting • count how likely feature fi and word w are to occur together • incorporates co-occurrence • but also incorporates how often w and fi occur in other instances sim(context_vector(dog),context_vector(beagle)) the: 5 is: 1 a: 4 breeds: 2 are: 1 intelligent: 5 … the: 2 is: 1 a: 2 breed: 1 are: 1 intelligent: 1 and: 1 to: 1 modern: 1 …

  15. Mutual information • A bit more probability  When will this be high and when will this be low?

  16. Mutual information • A bit more probability  • if x and y are independent (i.e. one occurring doesn’t impact the other occurring) p(x,y) = p(x)p(y) and the sum is 0 • if they’re dependent then p(x,y) = p(x)p(y|x) = p(y)p(x|y) then we get p(y|x)/p(y) (i.e. how much more likely are we to see y given x has a particular value) or vice versa p(x|y)/p(x)

  17. Point-wise mutual information Mutual information How related are two variables (i.e. over all possible values/events) Point-wise mutual information How related are two events/values

  18. PMI weighting • Mutual information is often used for feature selection in many problem areas • PMI weighting weights co-occurrences based on their correlation (i.e. high PMI) context_vector(beagle) this would likely be lower the: 2 is: 1 a: 2 breed: 1 are: 1 intelligent: 1 and: 1 to: 1 modern: 1 … this would likely be higher

  19. The mind-reading game How good are you at guessing random numbers? Repeat 100 times: Computer guesses whether you’ll type 0/1 You type 0 or 1 http://seed.ucsd.edu/~mindreader/ [written by Y. Freund and R. Schapire]

  20. The mind-reading game The computer is right much more than half the time…

  21. The mind-reading game The computer is right much more than half the time… Strategy: computer predicts next keystroke based on the last few (maintains weights on different patterns) There are patterns everywhere… even in “randomness”!

  22. Why machine learning? • Lot’s of data • Hand-written rules just don’t do it • Performance is much better than what people can do • Why not just study machine learning? • Domain knowledge/expertise is still very important • What types of features to use • What models are important

  23. Machine learning problems • Lots of different types of problems • What data is available: • Supervised, unsupervised, semi-supervised, reinforcement learning • How are we getting the data: • online vs. offline learning • Type of model: • generative vs. disciminative • parametric vs. non-parametric • SVM, NB, decision tree, k-means • What are we trying to predict: • classification vs. regression

  24. Unsupervised learning Unupervised learning: given data, but no labels

  25. Unsupervised learning • Much easier to get our hands on unlabeled data • Examples: • learn clusters/groups without any label • learn grammar probabilities without trees • learn HMM probabilities without labels • Because there is no label, often can get odd results • unsupervised grammar learned often has little relation to linguistically motivated grammar • may cluster bananas/apples or green/red/yellow

  26. Supervised learning BANANAS APPLES Supervised learning: given labeled data

  27. Supervised learning • Given labeled examples, learn to label unlabeled examples APPLE or BANANA? Supervised learning: learn to classify unlabeled

  28. Supervised learning: training Labeled data Data Label 0 0 model 1 train a predictive model 1 0

  29. Supervised learning: testing/classifying Unlabeled data labels 1 0 0 model 1 0 predict the label

  30. Feature based learning Training or learning phase Raw data Label features Label f1, f2, f3, …, fm 0 0 0 f1, f2, f3, …, fm 0 classifier 1 f1, f2, f3, …, fm 1 train a predictive model f1, f2, f3, …, fm 1 extract features 1 f1, f2, f3, …, fm 0 0

  31. Feature based learning Testing or classification phase Raw data labels features 1 f1, f2, f3, …, fm 0 f1, f2, f3, …, fm classifier 0 f1, f2, f3, …, fm 1 extract features f1, f2, f3, …, fm predict the label 0 f1, f2, f3, …, fm

  32. Feature examples Raw data Features?

  33. Feature examples Raw data Features Clinton said banana repeatedly last week on tv, “banana, banana, banana” (1, 1, 1, 0, 0, 1, 0, 0, …) clinton said california across tv wrong capital banana Occurrence of words

  34. Feature examples Raw data Features Clinton said banana repeatedly last week on tv, “banana, banana, banana” (4, 1, 1, 0, 0, 1, 0, 0, …) clinton said california across tv wrong capital banana Frequency of word occurrence

  35. Feature examples Raw data Features Clinton said banana repeatedly last week on tv, “banana, banana, banana” (1, 1, 1, 0, 0, 1, 0, 0, …) clinton said said banana california schools across the tv banana wrong way capital city banana repeatedly Occurrence of bigrams

  36. Lots of other features • POS: occurrence, counts, sequence • Constituents • Whether ‘V1agra’ occurred 15 times • Whether ‘banana’ occurred more times than ‘apple’ • If the document has a number in it • … • Features are very important, but we’re going to focus on the models today

  37. Power of feature-base methods • General purpose: any domain where we can represent a data point as a set of features, we can use the method

  38. The feature space f2 Government Science Arts f1

  39. The feature space Spam not-Spam f2 f3 f1

  40. Feature space f1, f2, f3, …, fm m-dimensional space How big will m be for us?

  41. Bayesian Classification We represent a data item based on the features: Training a: b: For each label/class, learn a probability distribution based on the features

  42. Bayesian Classification We represent a data item based on the features: Classifying Given an new example, classify it as the label with the largest conditional probability

  43. Bayes rule for classification conditional (posterior)probability prior probability Why not model P(Label|Data) directly?

  44. Bayesian classifiers different distributions for different labels Bayes rule two models to learn for each label/class

  45. The Naive Bayes Classifier spam Conditional Independence Assumption: features are independent of each other given the class: buy assume binary features for now enlargement the now viagra

  46. Estimating parameters • p(‘v1agra’ | spam) • p(‘the’ | spam) • p(‘enlargement’ | not-spam) • … For us:

  47. Maximum likelihood estimates number of items with label total number of items number of items with the label with feature number of items with label

  48. Naïve Bayes Text Classification • Features: word occurring in a document (though others could be used…) • Does the Naïve Bayes assumption hold? • Are word occurrences independent given the label? • Lot’s of text classification problems • sentiment analysis: positive vs. negative reviews • category classification • spam

  49. Naive Bayes on spam email http://www.cnbc.cmu.edu/~jp/research/email.paper.pdf

  50. Linear models • A linear model predicts the label based on a weighted, linear combination of the features • For two classes, a linear model can be viewed as a plane (hyperplane) in the feature space f2 f3 f1

More Related