1 / 32

A Brief Tour of Machine Learning

A Brief Tour of Machine Learning. David Lindsay. What is Machine Learning?. Very multidisciplinary field – statistics, mathematics, artificial intelligence, psychology, philosophy, cognitive science… In a nutshell – developing algorithms that learn from data

dana
Download Presentation

A Brief Tour of Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Brief Tour of Machine Learning David Lindsay

  2. What is Machine Learning? • Very multidisciplinary field – statistics, mathematics, artificial intelligence, psychology, philosophy, cognitive science… • In a nutshell – developing algorithms that learn from data • Historically – flourished from advances in computing in the early 60’s, resurgence in the late 90’s

  3. Main areas in Machine Learning #1 Supervised learning assumes a teacher exists to label/annotate data #2 Unsupervised learning no need for a teacher, try to learn relationships automatically #3 Reinforcement learning biologically plausible, try to learn from reward/punishment stimuli/feedback

  4. Supervised Learning Learning with a teacher

  5. More about Supervised Learning Perhaps the most well studied area of machine learning – lots of nice theory adapted from statistics/mathematics. Assume the existence of a training and test set Main sub-areas of research are: • Pattern recognition (discrete labels) • Regression (continuous labels) • Time series analysis (temporal dependence in data) i.i.d. assumption commonly made

  6. The formalisation of data • How to we formally describe our data? Property of the object that we want to predict in the future using our training data – e.g.. screening cancer labels could be Y = {normal, benign, malignant} Label + Commonly represented as a feature vector – this describes the object Object The individual features can be real, discrete, symbolic… eg. patient symptoms: temperature, sex, eye colour…

  7. The formalisation of data (continued) • What is training and test data? y 7 6 1 7 ? ? 2 New test images – labels either not known or withheld from the learner x Training set of images We learn from the training data, and try to predict new unseen test data. More formally we have a set of n training and test examples (information pairs – object + label) from the some unknown probability distributionP(X,Y).

  8. More about Pattern Recognition Lots of algorithms/techniques – the main contenders: • Support Vector Machines (SVM) • Nearest Neighbours • Decision Trees • Neural Networks • Multivariate Statistics • Bayesian algorithms • Logic programming

  9. ■ ■ ■ ☺ ■ ■ ☺ ☺ ☺ ☺ ☺ ☺ ☺ The mighty SVM algorithm • Very popular technique – lots of followers, relatively new • Very simple technique – related to the Perceptron, is a linear classifier (separates data into half spaces). Concept – keep the classifier simple, don’t over fit the data the classifier generalises well on new test data (Occams razor) Concept – if data not linearly separable use a kernel Φ map into another higher dimensional feature space and data may be separable

  10. Hot topics in SVM’s • Kernel design – central to the application to data, eg. when the objects are text documents, the features are words  incorporate domain knowledge about grammar. • Applying the kernel technique to other learning algorithms e.g.. Neural Networks

  11. The trusty old Nearest Neighbour algorithm • Born in the 60’s – probably the most simple of all algorithms to understand. • Decision rule = classify new test examples by finding the closest neighbouring example in the training set and predict the same label as the closest. • Lots of theory justifying its convergence properties. • Very lazy technique, not very fast – has to search for each test example.

  12. Problems with Nearest Neighbours • View examples in Euclidean space, can be very sensitive to feature scaling. • Finding computationally efficient ways to search for the Nearest Neighbour example.

  13. Decision Trees • Many different varieties C4.5, CART, ID3… • Algorithms build classification rules using a tree of if-then statements. • Constructs tree using Minimum Description Length (MDL) principles (tries to make the tree as simple as possible) IF temperature > 65 Patient has fever IF dehydrated = yes Patient has flu Patient has pneumonia

  14. Benefits/Issues with Decision Trees • Instability – minor changes to training data makes huge changes to decision tree • User can visualise/interpret the hypothesis directly, can find interesting classification rules • Problems with continuous real attributes, must be discretalised. • Large AI following, and widely used in industry

  15. Mystical Neural Networks • Very flexible, learning is a gradient descent process (back propagation) • Training neural networks involves a lot of design choices: • what network structure, how many hidden layers… • how to encode the data (must be values [0,1]) • use momentum to speed up convergence • Use weight decay to keep simple

  16. 1 Input layer Hidden Layer Output layer 0 Menopausal status Ultrasound score E(w) CA125 w2 w1 Training a neural network The aim in training the neural network is find the weight vector w that minimises the error E(w) on the training set Learnt hypothesis is represented by the weights that interconnect each neuron Gradient descent problem Sigmoid function

  17. Interesting applications • Bioinformatics: • genetic/protein code analysis • microarray analysis • gene regulatory pathways • WWW: • classifying text/html documents • filtering images • filtering emails

  18. Bayesian Algorithms • Try to model interrelationships between variables probabilistically. • Can model expert/domain knowledge directly into the classifier as prior belief in certain events. • Use basic axioms of probability theory to extract probabilistic estimates

  19. Bayesian algorithms in practice • Lots of different algorithms – Relevance Vector Machine (RVM), Naïve Bayes, Simple Bayes, Bayesian Belief Networks (BBN)… • Has a large following – especially Microsoft Research Weather = sunny Causal links between features can be modelled Temperature < 65 Humidity > 100 Play Monopoly Play Tennis

  20. Issues with Bayesian algorithms • Tractability – to find solutions need numerical approximations or take computational shortcuts • Can model causal relationships between variables • Need lots of data to estimate probabilties using obsevered training data frequencies

  21. Very important side problems • Feature Selection/Extraction – Using Principle Component Analysis, Wavelets, Cananonical Correlation, Factor Analysis, Independent Component Analysis • Imputation – what to do with missing features? • Visualisation – make the hypothesis human readable/interpretable • Meta learning – how to add functionality to existing algorithms, or combine the prediction of many classifiers (Boosting, Bagging, Confidence and Probability Machines)

  22. Very important side problems (continued) • How to incorporate domain knowledge into a learner • Trade off between complexity (accuracy on training) vs. generalisation (accuracy on test) • Pre-processing of data, normalising, standardising, discretalising. • How to test – leave one out, cross validation, stratify, online, offline…

  23. Unsupervised Learning Learning without a teacher

  24. An introduction to Unsupervised Learning • No need for a teacher/supervisor • Mainly clustering – trying to group objects into sensible clusters • Novelty detection – finding strange examples in data Clustering examples Novelty detection

  25. Algorithms available • For clustering: EM algorithm, K-Means, Self Organising Maps (SOM) • For novelty detection: 1-Class SVM, support vector regression, Neural Networks

  26. Issues and Applications • Very useful for extracting information from data. • Used in medicine to identify disease sub types. • Used to cluster web documents automatically • Used to identify customer target groups in buisness • Not much publicly available data to test algorithms with

  27. Reinforcement Learning Learning inspired by nature

  28. An introduction • Most biologically plausible – feedback given through stimuli reward/punishment • A field with a lot of theoryneeding for real life applications (other than playing BackGammon) • But also encompasses the large field of Evolutionary Computing • Applications are more open ended • Getting closer to what public consider AI.

  29. Traditional Reinforcement Learning • Techniques use dynamic programming to search for optimal strategy • Algorithms search to maximise their reward. • Q – Learning (Chris Watkins next door) is most well known technique. • Only successful applications are to games and toy problems. • A lack of real life applications. • Very few researchers in this field.

  30. Evolutionary Computing • Inspired by the process of biological evolution. • Essentially an optimisation technique – the problem is encoded as a chromosome. • We find new/better solutions to problem by sexual reproduction and mutation. • This will encourage mutation

  31. Techniques available in Evolutionary Computing • Lower level optimisers: • Evolutionary Programming, Evolutionary Algorithms • Genetic Programming, Genetic Algorithms, • Evolutionary Strategy • Simulated Annealing • Higher level optimisers: • TABU search • Multi-objective optimisation Pareto front of optimal solutions – which one should we pick? Objective 2 Objective 1

  32. Issues in Evolutionary Computing • How to encode the problem is very important • Setting mutation/crossover rates is very adhoc • Very computationally/memory intensive • Not much theory can be developed – frowned upon by machine learning theorists

More Related