1 / 35

LING 696B: Graph-based methods and Supervised learning

LING 696B: Graph-based methods and Supervised learning. Road map. Types of learning problems: Unsupervised: clustering, dimension reduction -- Generative models Supervised: classification (today) -- Discriminative models Methodology:

nura
Download Presentation

LING 696B: Graph-based methods and Supervised learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 696B: Graph-based methodsandSupervised learning

  2. Road map • Types of learning problems: • Unsupervised: clustering, dimension reduction -- Generative models • Supervised: classification (today)-- Discriminative models • Methodology: • Parametric: stronger assumptions about the distribution (blobs, mixture model) • Non-parametric: weaker assumptions (neural nets, spectral clustering, Isomap)

  3. Puzzle from several weeks ago • How do people learn categories from distributions? Liberman et al.(1952)

  4. Graph-based non-parametric methods • “Learn locally, think globally” • Local learning produces a graph that reveals the underlying structure • Learning the neighbors • Graph is used to reveal global structure in the data • Isomap: geodesic distance through shortest path • Spectral clustering: connected components from graph spectrum (see demo)

  5. Clustering as a graph partitioning problem • Normalized-cut problem: splitting the graph into two parts, so that • Each part is not too small • The edges being cut don’t carry too many weights Weights on edges from A to B A B Weights on edges within A

  6. Normalized cut through spectral embedding • Exact solution of normalized-cut is NP-hard (explodes for large graph) • “Soft” version is solvable: looking for coordinates for the nodes x1, … xN to minimize • Strongly connected nodes stay nearby, weakly connected nodes stay faraway • Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding Neighborhood matrix

  7. Is this relevant to how people learn categories? • Maye & Gerken: learning a bi-modal distribution on a curve (living in an abstract manifold) from /d/ to /(s)t/ • Mixture model: transform the signal, and approximate with two “dynamic blobs” • Can people learn categories from arbitrary manifolds following a “local learning” strategy? • Simple case: start from a uniform distribution (see demo)

  8. Local learning from graphs • Can people learn categories from arbitrary manifolds following a “local learning” strategy? • Most likely no • What constrains the kind of manifolds that people can learn? • What are the reasonable metrics people use? • How does neighborhood size affect such type of learning? • Learning through non-uniform distributions?

  9. Switch gear • Supervised learning: learning a function from input-output pairs • Arguably, something that people also do • Example: perceptron • Learning a function f(x)= sign(<w,x> + b) • Also called a “classifier”: machine with yes/no output

  10. Speech perception as a classification problem • Speech perception is viewed as a bottom-up procedure involving many decisions • E.g. sonorant/consonant, voice/voiceless • See Peter’s presentation • A long-standing effort of building machines that do the same • Stevens’ view of distinctive features

  11. Knowledge-based speech recognition • Mainstream method: • Front end: uniform signal representation • Back end: hidden Markov models • Knowledge based: • Front end: sound-specific features based on acoustic knowledge • Back end: a series of decisions on how lower level knowledge is integrated

  12. The conceptual framework from (Liu, 96) and others • Each step is hard work Bypassed in Stevens 02

  13. Implications of flow-chart architecture • Requires accurate low-level decisions • Mistakes can build up very quickly • Thought experiment: “linguistic” speech recognition through a sequence of distinctive feature classifiers • Hand-crafted decision rules often not robust/flexible • The need for good statistical classifiers

  14. An unlikely marriage • Recent years have seen several sophisticated classification machines • Example: support vector machine by Vapnik (today) • Interest moving from neural nets to these new machines • Many have proposed to integrate the new classifiers as a back-end • Niyogi and Burges paper: building feature detectors with SVM

  15. Generalization in classification • Experiment: you are learning a line that separates two classes

  16. Generalization in classification • Question: Where does the yellow dot belong?

  17. Generalization in classification • Question: Where does the yellow dot belong?

  18. Margin and linear classifiers • We tend to draw a line that gives the most “room” between the two clouds margin

  19. Margin • Margin needs to be defined on “border” points

  20. Margin • Margin needs to be defined on “border” points

  21. Justification for maximum margin • Hopefully, they generalize well

  22. Justification for maximum margin • Hopefully, they generalize well

  23. Support vectors in the separable case • Data points that reaches the maximal margin from the separating line

  24. Formalizing maximum margin -- optimization for SVM • Need constrained optimization • f(x) = sign(<w,x>+b) is the same as sign(<Cw,x>+Cb), for any C>0 • Two strategies to choose a constrained optimization problem: • Limit the length of w, and maximize margin • Fix the margin, and minimize the length of w w

  25. SVM optimization (see demo) • Constrained quadratic programming problem • It can be shown (through Lagrange multiplier method) that solution looks like: Label Fixed margin A linear combination of training data!

  26. SVM applied to non-separable data • What happens when data is not separable? • The optimization problem has no solution (recall the XOR problem) • See demo

  27. Extension to non-separable data through new variables • Allow the data points to “encroach” the separating line(see demo) Original objective Tolerance +

  28. When things become wild: Non-linear extensions • The majority of “real world” problems are not separable • This can be due to some deep underlying laws, e.g. XOR data • Non-linearity from Neural nets: • Hidden layers • Non-linear activations • SVM initiates a more trendy way of making non-linear machines -- kernels

  29. Kernel methods • Model-fitting problems ill-posed without constraining the space • Avoid commitment to space: non-parametric method using kernels • Idea: let the space grow with data • How? Associate each data point with a little function, e.g. a blob, and set the space to be the linear combination of these • Connection to neural nets

  30. Kernel extension of SVM • Recall the linear solution: • Substituting this into f: • Using general kernel function K(x, xi) in the place of <x, xi> What matters is the dot product

  31. Kernel extension of SVM • This is very much like replacing linear with non-linear nodes in a neural net • Radial Basis Network: each K(x, xi) is a Gaussian centered at xi -- a small blob • “seeing” non-linearity: a theorem i.e. the kernel is still a dot product, except that it works in an infinite dimensional space of “features”

  32. This is not a fairy tale • Hopefully, by throwing data into infinite dimensions, they will become separable • How can things work in infinite dimensions? • The infinite dimension is implicit • Only support vectors act as “anchors” for the separating plane in feature space • All the computation is done in finite dimensions by searching through support vectors and their weights • As a result, we can do lots of things with SVM by playing with kernels (see demo)

  33. Reflections • How likely this is a human learning model?

  34. Reflections • How likely this is a human learning model? • Are all learning problems reducible to classification?

  35. Reflections • How likely this is a human learning model? • Are all learning problems reducible to classification? • What learning models are appropriate for speech?

More Related