1 / 63

Word representations: A simple and general method for semi-supervised learning

Word representations: A simple and general method for semi-supervised learning. Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies: http://metaoptimize.com/projects/wordreprs/. Supervised training. Sup model. Sup data. Supervised training. Semi-sup training?. Sup model. Sup

azuka
Download Presentation

Word representations: A simple and general method for semi-supervised learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word representations:A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies: http://metaoptimize.com/projects/wordreprs/

  2. Supervised training Sup model Sup data

  3. Supervised training Semi-sup training? Sup model Sup data

  4. Supervised training Semi-sup training? Sup model Sup data

  5. Supervised training Semi-sup training? More feats Sup model Sup data

  6. More feats sup task 1 Sup model Sup data More feats sup task 2 Sup model Sup data

  7. Joint semi-sup Unsup data Semi-sup model Sup data

  8. Unsup data Unsup model unsup pretraining Semi-sup model Sup data semi-sup fine-tuning

  9. Unsup data Unsup model unsup training unsup feats

  10. Unsup data unsup feats unsup training Sup data Semi-sup model Sup training

  11. Unsup data unsup feats unsup training sup task 1 sup task 2 sup task 3

  12. What unsupervised featuresare most useful in NLP?

  13. Natural language processing • Words, words, words • Words, words, words • Words, words, words

  14. How do we handle words? • Not very well

  15. “One-hot”word representation • |V| = |vocabulary|, e.g. 50K for PTB2 Pr dist over labels m (3*|V|) x m 3*|V| word -1, word 0, word +1

  16. One-hot word representation • 85% of vocab words occur as only 10% of corpus tokens • Bad estimate of Pr(label|rare word) m |V| x m |V| word 0

  17. Approach

  18. Approach • Manual feature engineering

  19. Approach • Manual feature engineering

  20. Approach • Induce word reprs over large corpus, unsupervised • Use word reprs as word features for supervised task

  21. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs

  22. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs

  23. Distributional representations C F W (size of vocab) e.g. Fw,v = Pr(v follows word w) or Fw,v = Pr(v occurs in same doc as w)

  24. Distributional representations d C g( ) = f F W (size of vocab) g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand trans C >> d

  25. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs

  26. Class-based word repr • |C| classes, hard clustering m (|V|+|C|) x m |V|+|C| word 0

  27. Class-based word repr • Hard vs. soft clustering • Hierarchical vs. flat clustering

  28. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Brown (hard, hierarchical) clustering • HMM (soft, flat) clustering • Distributed word reprs

  29. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Brown (hard, hierarchical) clustering • HMM (soft, flat) clustering • Distributed word reprs

  30. Brown clustering • Hard, hierarchical class-based LM • Brown et al. (1992) • Greedy technique for maximizing bigram mutual information • Merge words by contextual similarity

  31. Brown clustering cluster(chairman) = `0010’ 2-prefix(cluster(chairman)) = `00’ (image from Terry Koo)

  32. Brown clustering • Hard, hierarchical class-based LM • 1000 classes • Use prefixes = 4, 6, 10, 20

  33. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Brown (hard, hierarchical) clustering • HMM (soft, flat) clustering • Distributed word reprs

  34. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs

  35. Distributed word repr • k- (low) dimensional, dense representation • “word embedding” matrix E of size |V| x k m k x m k word 0

  36. Sequence labeling w/ embeddings “word embedding” matrix E of size |V| x k m (3*k) x m |V| x k, tied weights word -1, word 0, word +1

  37. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs • Collobert + Weston (2008) • HLBL embeddings (Mnih + Hinton, 2007)

  38. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs • Collobert + Weston (2008) • HLBL embeddings (Mnih + Hinton, 2007)

  39. Collobert + Weston 2008 score > μ + score 1 100 50*5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5

  40. 50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008)

  41. Less sparse word reprs? • Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs • Collobert + Weston (2008) • HLBL embeddings (Mnih + Hinton, 2007)

  42. Log bilinear Language Model (LBL) Linear prediction of w5 } w1 w2 w3 w4 w5

  43. HLBL • HLBL = hierarchical (fast) training of LBL • Mnih + Hinton (2009)

  44. Approach • Induce word reprs over large corpus, unsupervised • Brown: 3 days • HLBL: 1 week, 100 epochs • C&W: 4 weeks, 50 epochs • Use word reprs as word features for supervised task

  45. Unsupervised corpus • RCV1 newswire • 40M tokens (vocab = all 270K types)

  46. Supervised Tasks • Chunking (CoNLL, 2000) • CRF (Sha + Pereira, 2003) • Named entity recognition (NER) • Averaged perceptron (linear classifier) • Based upon Ratinov + Roth (2009)

  47. Unsupervised word reprsas features • Word = “the” • Embedding = [-0.2, …, 1.6] • Brown cluster = 1010001100 • (cluster 4-prefix = 1010, • cluster 6-prefix = 101000, • …)

  48. Unsupervised word reprsas features Orig X = {pos-2=“DT”: 1, word-2=“the”: 1, ...} X w/ Brown = {pos-2=“DT”: 1 , word-2=“the”: 1, class-2-pre4=“1010”: 1, class-2-pre6=“101000”: 1} X w/ emb = {pos-2=“DT”: 1 , word-2=“the”: 1, word-2-dim00: -0.2, …, word-2-dim49: 1.6, ...}

  49. Embeddings: Normalization E = σ * E / stddev(E)

  50. Embeddings: Normalization(Chunking)

More Related