1.16k likes | 1.43k Views
Lecture 5 Machine Learning 第 5 讲 机器学习. 5.1 Introduction 5.2 Supervised Learning 监督学习 5.3 Parametric Methods 参数化方法 5.4 Clustering 聚类 5.5 Nonparametric Methods 非参数化方法 5.6 Decision Trees 决策树. 5.1 Introduction. Why “Learn” ?.
E N D
Lecture 5 Machine Learning第5讲 机器学习 5.1 Introduction 5.2 Supervised Learning监督学习 5.3 Parametric Methods参数化方法 5.4 Clustering聚类 5.5 Nonparametric Methods非参数化方法 5.6 Decision Trees 决策树
Why “Learn” ? • Machine learning is programming computers to optimize a performance criterion using example data or past experience. • There is no need to “learn” to calculate payroll • Learning is used when: • Human expertise does not exist (navigating on Mars), • Humans are unable to explain their expertise (speech recognition) • Solution changes in time (routing on a computer network) • Solution needs to be adapted to particular cases (user biometrics)
What We Talk About When We Talk About“Learning” • Learning general models from a data of particular examples • Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and scarce. • Example in retail: Customer transactions to consumer behavior: People who bought “Da Vinci Code” also bought “The Five People You Meet in Heaven” (www.amazon.com) • Build a model that is a good and useful approximation to the data.
Data Mining • Retail: Market basket analysis, Customer relationship management (CRM) • Finance: Credit scoring, fraud detection • Manufacturing: Optimization, troubleshooting • Medicine: Medical diagnosis • Telecommunications: Quality of service optimization • Bioinformatics: Motifs, alignment • Web mining: Search engines • ...
What is Machine Learning? • Optimize a performance criterion using example data or past experience. • Role of Statistics: Inference from a sample • Role of Computer science: Efficient algorithms to • Solve the optimization problem • Representing and evaluating the model for inference
Applications • Association • Supervised Learning • Classification • Regression • Unsupervised Learning • Reinforcement Learning
Learning Associations • Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services. Example: P ( chips | beer ) = 0.7
Example: Credit scoring Differentiating between low-risk and high-risk customers from their income and savings Classification Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
Classification: Applications • Aka Pattern recognition • Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style • Character recognition: Different handwriting styles. • Speech recognition: Temporal dependency. • Use of a dictionary or the syntax of the language. • Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for speech • Medical diagnosis: From symptoms to illnesses • ...
Face Recognition Training examples of a person Test images AT&T Laboratories, Cambridge UK http://www.uk.research.att.com/facedatabase.html
Regression • Example: Price of a used car • x : car attributes y : price y = g (x | θ) g ( ) model, θ parameters y = wx+w0
(x,y) α2 α1 Regression Applications • Navigating a car: Angle of the steering wheel (CMU NavLab) • Kinematics of a robot arm α1= g1(x,y) α2= g2(x,y) • Response surface design
Supervised Learning: Uses • Prediction of future cases: Use the rule to predict the output for future inputs • Knowledge extraction: The rule is easy to understand • Compression: The rule is simpler than the data it explains • Outlier detection: Exceptions that are not covered by the rule, e.g., fraud
Unsupervised Learning • Learning “what normally happens” • No output • Clustering: Grouping similar instances • Example applications • Customer segmentation in CRM • Image compression: Color quantization • Bioinformatics: Learning motifs
Reinforcement Learning • Learning a policy: A sequence of outputs • No supervised output but delayed reward • Credit assignment problem • Game playing • Robot in a maze • Multiple agents, partial observability, ...
Resources: Datasets • UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html • UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html • Statlib: http://lib.stat.cmu.edu/ • Delve: http://www.cs.utoronto.ca/~delve/
Resources: Journals • Journal of Machine Learning Research www.jmlr.org • Machine Learning • Neural Computation • Neural Networks • IEEE Transactions on Neural Networks • IEEE Transactions on Pattern Analysis and Machine Intelligence • Annals of Statistics • Journal of the American Statistical Association • ...
Resources: Conferences • International Conference on Machine Learning (ICML) • ICML05: http://icml.ais.fraunhofer.de/ • European Conference on Machine Learning (ECML) • ECML05: http://ecmlpkdd05.liacc.up.pt/ • Neural Information Processing Systems (NIPS) • NIPS05: http://nips.cc/ • Uncertainty in Artificial Intelligence (UAI) • UAI05: http://www.cs.toronto.edu/uai2005/ • Computational Learning Theory (COLT) • COLT05: http://learningtheory.org/colt2005/ • International Joint Conference on Artificial Intelligence (IJCAI) • IJCAI05: http://ijcai05.csd.abdn.ac.uk/ • International Conference on Neural Networks (Europe) • ICANN05: http://www.ibspan.waw.pl/ICANN-2005/ • ...
Learning a Class from Examples • Class C of a “family car” • Prediction: Is car x a family car? • Knowledge extraction: What do people expect from a family car? • Output: Positive (+) and negative (–) examples • Input representation: x1: price, x2 : engine power
Hypothesis class H Error of h onH
S, G, and the Version Space most specific hypothesis, S most general hypothesis, G h ÎH, between S and G is consistent and make up the version space (Mitchell, 1997)
VC Dimension • N points can be labeled in 2Nways as +/– • HshattersN if there exists h Î H consistent for any of these: VC(H ) = N An axis-aligned rectangle shatters 4 points only !
Probably Approximately Correct (PAC) Learning • How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ? (Blumer et al., 1989) • Each strip is at most ε/4 • Pr that we miss a strip 1‒ ε/4 • Pr that N instances miss a strip (1 ‒ ε/4)N • Pr that N instances miss 4 strips 4(1 ‒ ε/4)N • 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x) • 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
Noise and Model Complexity Use the simpler one because • Simpler to use (lower computational complexity) • Easier to train (lower space complexity) • Easier to explain (more interpretable) • Generalizes better (lower variance - Occam’s razor)
Multiple Classes, Ci i=1,...,K Train hypotheses hi(x), i =1,...,K:
Model Selection & Generalization • Learning is an ill-posed problem; data is not sufficient to find a unique solution • The need for inductive bias,assumptions about H • Generalization:How well a model performs on new data • Overfitting: H more complex than C or f • Underfitting: H less complex than C or f
Triple Trade-Off • There is a trade-off between three factors (Dietterich, 2003): • Complexity of H, c (H), • Training set size, N, • Generalization error, E, on new data • As N, E¯ • As c (H), first E¯and then E
Cross-Validation • To estimate generalization error, we need data unseen during training. We split the data as • Training set (50%) • Validation set (25%) • Test (publication) set (25%) • Resampling when there is few data
Dimensions of a Supervised Learner • Model : • Loss function: • Optimization procedure:
Parametric Estimation • X= { xt }t where xt ~ p (x) • Parametric estimation: Assume a form for p (x | θ) and estimate θ,its sufficient statistics, using X e.g., N ( μ, σ2) where θ = { μ, σ2}
Maximum Likelihood Estimation • Likelihood of θ given the sample X l (θ|X) = p (X|θ) = ∏tp (xt|θ) • Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ) • Maximum likelihood estimator (MLE) θ* = argmaxθL(θ|X)
Examples: Bernoulli/Multinomial • Bernoulli: Two states, failure/success, x in {0,1} P (x) = pox(1 – po )(1 – x) L (po|X) = log ∏tpoxt(1 – po )(1 – xt) MLE: po = ∑txt / N • Multinomial:K>2 states, xi in {0,1} P (x1,x2,...,xK) = ∏ipixi L(p1,p2,...,pK|X) = log ∏t ∏ipixit MLE: pi = ∑txit / N
Gaussian (Normal) Distribution • p(x) = N ( μ, σ2) • MLE for μ and σ2: μ σ
Bias and Variance Unknown parameter θ Estimator di = d (Xi) on sample Xi Bias: bθ(d) = E [d] – θ Variance: E [(d–E [d])2] Mean square error: r (d,θ) = E [(d–θ)2] = (E [d] – θ)2 + E [(d–E [d])2] = Bias2 + Variance
Bayes’ Estimator • Treat θ as a random var with prior p (θ) • Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X) • Full:p(x|X) = ∫p(x|θ) p(θ|X) dθ • Maximum a Posteriori (MAP):θMAP = argmaxθp(θ|X) • Maximum Likelihood (ML):θML = argmaxθp(X|θ) • Bayes’:θBayes’ = E[θ|X] = ∫θp(θ|X) dθ
Bayes’ Estimator: Example • xt ~ N (θ, σo2) and θ ~ N ( μ, σ2) • θML = m • θMAP = θBayes’ =
Given the sample ML estimates are Discriminant becomes
Equal variances Single boundary at halfway between means
Variances are different Two boundaries