1 / 20

Artificial Intelligence Lecture 2

Artificial Intelligence Lecture 2. Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University boyuan@sjtu.edu.cn. Review of Lecture One. O verview of AI Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics

yardley
Download Presentation

Artificial Intelligence Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Artificial IntelligenceLecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University boyuan@sjtu.edu.cn

  2. Review of Lecture One • Overview of AI • Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics • Kernel-based heuristics (neural network, SVM, …) : Connection for nonlinearity • Learning and inference (Bayesian, Markovian, …) : To sparsely sample for convergence • Interactive and stochastic computing (Uncertainty, heterogeneity) : To overcome the limit of Turin Machine • Course Content • Focus mainly on learning and inference • Discuss current problems and research efforts • Perception and behavior (vision, robotic, NLP, bionics …) not included • Exam • Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS) • Course materials

  3. Today’s Content • Overview of machine learning • Linear regression • Gradient decent • Least square fit • Stochastic gradient decent • The normal equation • Applications

  4. Basic Terminologies • x= Input variables/features • y= Output variables/target variables • (x, y) = Training examples, the ith training example = (x(i), y(i)) • m(j) = Number of training examples (1, …, m) • n(i) = Number of input variables/features (0, …,n) • h(x) = Hypothesis/function/model that outputs the predicative value under a given input x • q = Parameter/weight, which parameterizes the mapping of X to its predictive value, thus • We define x0 = 1 (the intercept), thus able to use a matrix representation:

  5. Gradient Decent Using the matrix to representthe training samples with respect to q: The Cost Function is defined as: The gradient decent is based onthe partial derivatives with respect to q: The algorithm is therefore: Loop { There is another alternative to iterate, called stochastic gradient decent: } (for every j)

  6. Normal Equation An explicit way to directly obtain q

  7. The Optimization Problem by the Normal Equation We set the derivatives to zero, and obtain the Normal Equations:

  8. Today’s Content • Linear Regression • Locally Weighted Regression (an adaptive method) • Probabilistic Interpretation • Maxima Likelihood Estimation vs. Least Square (Gaussian Distribution) • Classification by Logistic Regression • LMS updating • A Perceptron-based Learning Algorithm

  9. Linear Regression • Number of Features • Over-fitting and under-fitting Issue • Feature selection problem (to be covered later) • Adaptive issue • Some definitions: • Parametric Learning (fixed set of q, withn being constant) • Non-parametric Learning (number of q grows with m linearly) • Locally-Weighted Regression (Loess/Lowess Regression) non-parametric • A bell-shape weighting (not a Gaussian) • Every time you need to use the entire training data set to train for a given input to predict its output (computational complexity)

  10. Extension of Linear Regression • Linear Additive (straight-line): x1=1, x2=x • Polynomial: x1=1, x2=x, …, xn=xn-1 • Chebyshev Orthogonal Polynomial: x1=1, x2=x, …, xn=2x(xn-1-xn-2) • Fourier Trigonometric Polynomial: x1=0.5, followed by sin and cos of different frequencies of xn • Pairwise Interaction: linear terms + xk1,k2 (k =1, …, N) • … • The central problem underlying these representations are whether or not the optimization processes for q are convex.

  11. Probabilistic Interpretation • Why Ordinary Least Square (OLE)? Why not other power terms? • Assume • PDF for Gaussian is • This implies that • Or, ~ = Random Noises, ~ Why Gaussian for random variables? Central limit theorem?

  12. Maximum Likelihood (updated) • Consider training data are stochastic • Assume are i.i.d. (independently identically distributed) • Likelihood of L(q) = the probability of y given x parameterized by q • What is Maximum Likelihood Estimation (MLE)? • Chose parameters qto maximize the function , so to make the training data set as probable as possible; • Likelihood L(q) of the parameters, probability of the data.

  13. The Equivalence of MLE and OLE = J(q) !?

  14. Sigmoid (Logistic) Function Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize Linear Methods) that the choice of the logistic function is a natural one.

  15. Recall (Note the positive sign rather than negative) Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule:

  16. One Useful Property of the Logistic Function

  17. Identical to Least Square Again?

More Related