360 likes | 516 Views
LING 696B: Graph-based methods and Supervised learning. Road map. Types of learning problems: Unsupervised: clustering, dimension reduction -- Generative models Supervised: classification (today) -- Discriminative models Methodology:
E N D
Road map • Types of learning problems: • Unsupervised: clustering, dimension reduction -- Generative models • Supervised: classification (today)-- Discriminative models • Methodology: • Parametric: stronger assumptions about the distribution (blobs, mixture model) • Non-parametric: weaker assumptions (neural nets, spectral clustering, Isomap)
Puzzle from several weeks ago • How do people learn categories from distributions? Liberman et al.(1952)
Graph-based non-parametric methods • “Learn locally, think globally” • Local learning produces a graph that reveals the underlying structure • Learning the neighbors • Graph is used to reveal global structure in the data • Isomap: geodesic distance through shortest path • Spectral clustering: connected components from graph spectrum (see demo)
Clustering as a graph partitioning problem • Normalized-cut problem: splitting the graph into two parts, so that • Each part is not too small • The edges being cut don’t carry too many weights Weights on edges from A to B A B Weights on edges within A
Normalized cut through spectral embedding • Exact solution of normalized-cut is NP-hard (explodes for large graph) • “Soft” version is solvable: looking for coordinates for the nodes x1, … xN to minimize • Strongly connected nodes stay nearby, weakly connected nodes stay faraway • Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding Neighborhood matrix
Is this relevant to how people learn categories? • Maye & Gerken: learning a bi-modal distribution on a curve (living in an abstract manifold) from /d/ to /(s)t/ • Mixture model: transform the signal, and approximate with two “dynamic blobs” • Can people learn categories from arbitrary manifolds following a “local learning” strategy? • Simple case: start from a uniform distribution (see demo)
Local learning from graphs • Can people learn categories from arbitrary manifolds following a “local learning” strategy? • Most likely no • What constrains the kind of manifolds that people can learn? • What are the reasonable metrics people use? • How does neighborhood size affect such type of learning? • Learning through non-uniform distributions?
Switch gear • Supervised learning: learning a function from input-output pairs • Arguably, something that people also do • Example: perceptron • Learning a function f(x)= sign(<w,x> + b) • Also called a “classifier”: machine with yes/no output
Speech perception as a classification problem • Speech perception is viewed as a bottom-up procedure involving many decisions • E.g. sonorant/consonant, voice/voiceless • See Peter’s presentation • A long-standing effort of building machines that do the same • Stevens’ view of distinctive features
Knowledge-based speech recognition • Mainstream method: • Front end: uniform signal representation • Back end: hidden Markov models • Knowledge based: • Front end: sound-specific features based on acoustic knowledge • Back end: a series of decisions on how lower level knowledge is integrated
The conceptual framework from (Liu, 96) and others • Each step is hard work Bypassed in Stevens 02
Implications of flow-chart architecture • Requires accurate low-level decisions • Mistakes can build up very quickly • Thought experiment: “linguistic” speech recognition through a sequence of distinctive feature classifiers • Hand-crafted decision rules often not robust/flexible • The need for good statistical classifiers
An unlikely marriage • Recent years have seen several sophisticated classification machines • Example: support vector machine by Vapnik (today) • Interest moving from neural nets to these new machines • Many have proposed to integrate the new classifiers as a back-end • Niyogi and Burges paper: building feature detectors with SVM
Generalization in classification • Experiment: you are learning a line that separates two classes
Generalization in classification • Question: Where does the yellow dot belong?
Generalization in classification • Question: Where does the yellow dot belong?
Margin and linear classifiers • We tend to draw a line that gives the most “room” between the two clouds margin
Margin • Margin needs to be defined on “border” points
Margin • Margin needs to be defined on “border” points
Justification for maximum margin • Hopefully, they generalize well
Justification for maximum margin • Hopefully, they generalize well
Support vectors in the separable case • Data points that reaches the maximal margin from the separating line
Formalizing maximum margin -- optimization for SVM • Need constrained optimization • f(x) = sign(<w,x>+b) is the same as sign(<Cw,x>+Cb), for any C>0 • Two strategies to choose a constrained optimization problem: • Limit the length of w, and maximize margin • Fix the margin, and minimize the length of w w
SVM optimization (see demo) • Constrained quadratic programming problem • It can be shown (through Lagrange multiplier method) that solution looks like: Label Fixed margin A linear combination of training data!
SVM applied to non-separable data • What happens when data is not separable? • The optimization problem has no solution (recall the XOR problem) • See demo
Extension to non-separable data through new variables • Allow the data points to “encroach” the separating line(see demo) Original objective Tolerance +
When things become wild: Non-linear extensions • The majority of “real world” problems are not separable • This can be due to some deep underlying laws, e.g. XOR data • Non-linearity from Neural nets: • Hidden layers • Non-linear activations • SVM initiates a more trendy way of making non-linear machines -- kernels
Kernel methods • Model-fitting problems ill-posed without constraining the space • Avoid commitment to space: non-parametric method using kernels • Idea: let the space grow with data • How? Associate each data point with a little function, e.g. a blob, and set the space to be the linear combination of these • Connection to neural nets
Kernel extension of SVM • Recall the linear solution: • Substituting this into f: • Using general kernel function K(x, xi) in the place of <x, xi> What matters is the dot product
Kernel extension of SVM • This is very much like replacing linear with non-linear nodes in a neural net • Radial Basis Network: each K(x, xi) is a Gaussian centered at xi -- a small blob • “seeing” non-linearity: a theorem i.e. the kernel is still a dot product, except that it works in an infinite dimensional space of “features”
This is not a fairy tale • Hopefully, by throwing data into infinite dimensions, they will become separable • How can things work in infinite dimensions? • The infinite dimension is implicit • Only support vectors act as “anchors” for the separating plane in feature space • All the computation is done in finite dimensions by searching through support vectors and their weights • As a result, we can do lots of things with SVM by playing with kernels (see demo)
Reflections • How likely this is a human learning model?
Reflections • How likely this is a human learning model? • Are all learning problems reducible to classification?
Reflections • How likely this is a human learning model? • Are all learning problems reducible to classification? • What learning models are appropriate for speech?