Mini-course on Artificial Neural Networks and Bayesian Networks

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Mini-course on Artificial Neural Networks and Bayesian Networks Michal Rosen-Zvi Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Section 1: Introduction Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Networks (1) • Networks serve as a visual way for displaying relationships: • Social networks are examples of ‘flat’ networks where the only information is relation between entities Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Example: collaboration network 1. Analyzing Cortical Activity using Hidden Markov Models Itay Gat, Naftali Tishby, and Moshe Abeles "Network, Computation in Neural Systems", August 1997. 2. Cortical Activity Flips Among Quasi Stationary States Moshe Abeles, Hagai Bergman, Itay Gat, Isaac Meilijson, Eyal Seidemann, Naftali Tishby, Eilon Vaadia Prepared: Feb 1, 1995, Appeared in the Proceedings of the National Academy of Science (PNAS) 3. Rigorous Learning Curve Bounds from Statistical Mechanics David Haussler, Michael Kearns, H. Sebastian Seung, and Naftali Tishby Prepared: July 1994. Full version, Machine Learning (1997). 4. H. S. Seung, Haim Sompolinsky, Naftali Tishby: Learning Curves in Large Neural Networks. COLT 1991: 112-127 5. Yann LeCun, Ido Kanter, Sara A. Solla: Second Order Properties of Error Surfaces. NIPS 1990: 918-924 6. Esther Levin, Naftali Tishby, Sara A. Solla: A Statistical Approach to Learning and Generalization in Layered Neural Networks. COLT 1989: 245-260 7. Litvak V, Sompolinsky H, Segev I, and Abeles M (2003) On the Transmission of Rate Code in Long Feedforward Networks with Excitatory-Inhibitory Balance. Journal of Neuroscience, 23(7):3006-30158. Senn, W., Segev, I., and Tsodyks, M. (1998). Reading neural synchrony with depressing synapses. Neural Computation 10: 815-819 8. Tsodkys, M., I.Mit'kov, H.Sompolinsky (1993): Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Phys. Rev. Lett., 9. Memory Capacity of Balanced Networks (Yuval Aviel, David Horn and Moshe Abeles) 10. The Role of Inhibition in an Associative Memory Model of the Olfactory Bulb. (Ofer Hendin, David Horn and Misha Tsodyks) 11 Information Bottleneck for Gaussian Variables Gal Chechik, Amir Globerson, Naftali Tishby and Yair Weiss Prepared: June 2003. Submitted to NIPS-2003 [matlab] Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Networks (2) Artificial Neural Networks represent rules – deterministic relations - between input and output Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Networks (3) Bayesian Networks represent probabilistic relations - conditional independencies and dependencies between variables Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Outline • Introduction/Motivation • Artificial Neural Networks • The Perceptron, multilayered FF NN and recurrent NN • On-line (supervised) learning • Unsupervised learning and PCA • Classification • Capacity of networks • Bayesian networks (BN) • Bayes rules and the BN semantics • Classification using Generative models • Applications: Vision, Text Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Motivation • The research of ANNs is inspired by neurons in the brain and (partially) driven by the need for models of the reasoning in the brain. • Scientists are challenged to use machines more effectively for tasks traditionally solved by humans (example - driving a car, inferring scientific referees to papers and many others) Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Questions • How can a network learn? • What will be the learning rate? • What are the limitations on the network capacity? • How networks can be used to classify results with no labels (unsupervised learning)? • What are the relations and differences between learning in ANN and learning in BN? • How can network models explain high-level reasoning? Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Hebbian Learning rule Perceptron Hopfield Network Statistical Physics McCulloch and Pitts Model Pearl’s Book Minsky and Papert’s book History of (modern) ANNs and BNs Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Section 2: On-line Learning Based on slides from Michael Biehl’s summer course Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Section 2.1: The Perceptron Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס The Perceptron Input:  Adaptive Weights J Output: S Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס W Perceptron: binary output Implements a linearly separable classification of inputs Milestones: Perceptron convergence theorem, Rosenblatt (1958) Capacity, winder (1963) Cover(1965) Statistical Physics of perceptron weights, Gardner (1988) How does this device learn? Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Learning a linearly separable rule from reliable examples • Unknown rule: ST()=sign(B) =±1 Defines the correct classification. Parameterized through a teacher perceptron with weights BRN, (BB=1) • Only available information: example data D= {, ST()=sign(B) for =1…P } Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Learning a linearly… (Cont.) • Training: finding the student weights J • J parameterizes a hypothesis SS()=sign(J) • Supervised learning is based on the student performance with respect to the training data D • Binary error measure T(J)= [SS(),ST()] T(J)=1 if SS()ST() T(W)=0 if SS()=ST() Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Off-line learning • Guided by the minimization of a cost function H(J), e.g., the training error H(J) tT(J) Equilibrium statistical mechanics treatment: • Energy H of N degrees of freedm • Ensemble of systems is in thermal equilibrium at formal temperature • Disorder avg. over random examples (replicas) assumes distribution over the inputs • Macroscopic description, order parameters • Typical properties of large sustems, P= N Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס On-line training • Single presentation of uncorrelated (new) {,ST()} • Update of student weights: • Learning dynamics in discrete time Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס On-line training - Statistical Physics approach • Consider sequence of independent, random • Thermodynamic limit • Disorder average over latest example self-averaging properties • Continuous time limit Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Generalization Performance of the student (after training) with respect to arbitrary, new input • In practice: empirical mean of mean error measure over a set of test inputs • In the theoretical analysis: average over the (assumed) probability density of inputs Generalization error: Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Generalization (cont.) The simplest model distribution: Isotropic density P(),  uncorrelated with B and J Consider vectors of independent identically distributed (iid) components j with Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס B J  Geometric argument Projection of data into (B, J)-plane yields isotropic density of inputs g=/ ST()=SS() For |B|=1 Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Overlap Parameters Sufficient to quantify the success of learning R=BJ Q=JJ Random guessing R=0, g=1/2 Perfect generalization , g=0 Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Derivation for large N Given B, J, and uncorrelated random input i=0, i j =ij, consider student/teacher fields that are sums of (many) independent random quantities: x=J=∑iJiI y=B=∑iBii Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Central Limit Theorem Joint density of (x,y) is for N→∞, a two dimensional Gaussian, fully specified by the first and the second moments x=∑iJii=0y=∑iBii=0 x2 = ∑ijJiJjij = ∑iJi2 = Q y2 = ∑ijBiBjij = ∑iBi2 = 1 xy = ∑ijJiBjij = ∑iJiBi = R Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Central Limit Theorem (Cont.) Details of the input are irrelevant. Some possible examples: binary, i1, with equal prob. Uniform, Gaussian. Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Generalization Error The isotropic distribution is also assumed to describe the statistics of the example data inputs Exercise: Derive the generalization error as a function of R,Q use Mathematical notes Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Assumptions about the data • No spatial correlatins • No distinguished directions in the input space • No temporal correlations • No correlations with the rule • Single presentation without repeatitions Consequences: • Average over data can be performed step by step • Actual choice of B is irrelevant, it is not necessary to averaged over the teacher Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Hebbian learning (revisited) Hebb 1949 • Off-line interpretation Vallet 1989 Choice of student weights given D={,ST}=1P J(P)= ∑ST/N • Equivalent On-line interpretation Dynamics upon single presentation of examples J()= J(-1) + ST/N Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Hebb: on-line From microscopic to macroscopic: recursions for overlaps Exercise: Derive the update equations of R,Q Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Hebb: on-line (Cont.) Average over the latest example … The random input, enters only through the fields The random input  and J(-1), Bare statistically independent The Central Limit Theorems applies and obtains the joint density Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Hebb: on-line (Cont.) Exercise: Derive the update equations of R,Q as a function of  use Mathematical notes [off-line] Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Hebb: on-line (Cont.) Continuous time limit, N→∞, = /N, d=1/N Initial conditions - tabula rasa R(0)=Q(0)=0 What are the mean values after training with N examples??? [See matlab code] Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Hebb: on-line mean values The order parameters, Q and R, are self averaging for infinite N Self average properties of A(J): • The observation of a value of A different from its mean occurs with vanishing probability Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Learning Curve:  dependent of the order parameters Exercise: Solve the differential equations for R and Q Exercise: Find the function () Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Learning Curve:  dependent of the order parameters The normalized overlap between the two vectors, B, J provides the angle between the vectors two vectors Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Learning Curve:  dependent of the order parameters Exercise: Find asymptotic behavior of () Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Asymptotic expansion [draw w. matlab] Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Questions: • What are other learning algorithms that can be used for efficient learning? • What training algorithm will provide the best learning/ the fastest asymptotic decrease? Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Modified Hebbian learning The training algorithm is defined by a modulation function f J()= J(-1) +f(…) ST/N Restriction: f may depend on available quantities: f(J(-1),,ST) Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Perceptron Rosenblatt 1959 • If classification is correct don’t change the weights. • If classification is incorrect • if the right class for the  example is 1J(). increases. • if right class for the  example is -1J(). decreases Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Perceptron • Only informative points are used (mistake driven) • The solution is a linear combination of the training points • Converges only for linearly separable data Exercise: Derive the update equations of ,Q as a function of , J,B and  Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס On-line dynamics Biehl and Riegler 1994 Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Questions: • Find the asymptotic behavior (by simulations and/or analytically) of the generalization error for the perceptron algorithm and Hebb algorithm, which one is better? • What training algorithm will provide the best learning/ the fastest asymptotic decrease? Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Learning Curve - Hebb and Perceptron Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Section 2.2: On-line by gradient descent Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Introduction Commonly used in practical applications: Multilayered neural network with continuous activation functions, where output is a differentiable function of the adaptive parameters Can be used for fitting a function to a data Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס y  Linear perceptron and linear regression (1D) x=J Using a quadratic loss function and gradient descent for finding the best curve to fit a data set [see ◘, off-line] Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס Simple case: ‘Linear perceptron’ Teacher: ST()=y=B Student: SS()=x=J Training and performance evaluation are based on the quadratic error Consider the training dynamics Exercise: Derive the update equations of R,Q as a function of  Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

תשס״דבר־ אילןאוניברסיטתהמוחלחקרברשתות המרכזהרבתחומימרוכזקורס ‘Linear perceptron’ (cont.) Some exercises: • Write a matlab code for the linear perceptron, teacher-student scenario. • Show that • Investigate the role of the learning rate  • Find the asymptotic decrease to zero errors Mini-course on ANN and BN, The Multidisciplinary Brain Research center, Bar-Ilan University, May 2004

Mini-course on Artificial Neural Networks and Bayesian Networks