620 likes | 770 Views
Jose Principe and Sudhir Rao University of Florida principe@cnel.ufl.edu www.cnel.ufl.edu. Information Theoretic Learning Finding structure in data. Outline. Structure Connection to Learning Learning Structure – the old view A new framework Applications. Structure.
E N D
Jose Principe and Sudhir Rao University of Florida principe@cnel.ufl.edu www.cnel.ufl.edu Information Theoretic LearningFinding structure in data ...
Outline • Structure • Connection to Learning • Learning Structure – the old view • A new framework • Applications
Structure Patterns / Regularities Amorphous/chaos Interdependence between subsystems White Noise
Type of Learning • Supervised Learning • Data • Desired Signal/Teacher • Reinforcement Learning • Data • Rewards/Punishments • Unsupervised Learning • Only the Data
Unsupervised Learning • What can be done only with the data?? Examples First Principles Auto associative memory, ART PCA, Linsker’s “informax” rule … Preserve maximum information Barlow’s minimum redundancy principle, ICA etc Extract independent features Gaussian Mixture Models, EM algorithm, Parametric Density Estimation. Learn the probability distribution
Connection to Self Organization “If cell 1 is one of the cells providing input to cell 2, and if cell 1’s activity tends to be “high” whenever cell 2’s activity is “high”, then the future contributions that the firing of cell 1 makes to the firing of cell 2 should increase..” -Donald Hebb, 1949, Neuropsychologist. What is the purpose???? A - “Does the Hebb-type algorithm cause a developing perceptual network to optimize some property that is deeply connected with the mature network’s functioning as a information processing system.” C + B Increase wb proportional to activity of B and C + - Linsker, 1988
Linsker’s Infomax principle Linear Network X1 w1 noise X2 Under Gaussian assumptions and uncorrelated noise the rate for a linear network is , Y XL-1 wL XL Maximize Rate = Maximize Shannon Rate I(X,Y) Hebbian Rule!!
Minimum Entropy Coding Stimulus 1 Feature 1 Feature N Stimulus M Barlow’s redundancy principle Independence features no redundancy ICA!!! Converting an M dimensional problem N one dimensional problems N conditional probabilities required for an event V P(V|Feature i) 2M conditional probabilities required for an event V P(V|stimuli)
Summary 1 Global Objective Function example, Infomax Extracting desired signal from the data itself Self Organizing Rule example, Hebbian rule Revealing the structure through interaction of the data points Unsupervised Learning Discovering structure in Data example, PCA
Questions • Can we go beyond these preprocessing stages?? • Can we create global cost function which extract “goal oriented structures” from the data? • Can we derive self organizing principle from such a cost function?? A big YES!!!
ITL is a methodology to adapt linear or nonlinear systems using criteria based on the information descriptors of entropy and divergence. Center piece is a non-parametric estimator for entropy that: Does not require an explicit estimation of pdf Uses the Parzen window method which is known to be consistent and efficient Estimator is smooth Readily integrated in conventional gradient descent learning Provides a link to Kernel learning and SVMs. Allows an extension to random processes What is Information Theoretic Learning?
Moment expansions, in particular Second Order moments are still today the workhorse of statistics. We automatically translate deep concepts (e.g. similarity, Hebb’s postulate of learning ) in 2nd order statistical equivalents. ITL replaces 2nd order moments with a geometric statistical interpretation of data in probability spaces. Variance by Entropy Correlation by Correntopy Mean square error (MSE) by Minimum error entropy (MEE) Distances in data space by distances in probability spaces ITL is a different way of thinking about data quantification
1 0.4 (x) 0.5 (x) x f x f 0.2 0 0 -5 0 -5 0 5 x Information Theoretic LearningEntropy Not all random variables (r.v.) are equally random! • Entropy quantifies the degree of uncertainty in a r.v. Claude Shannon defined entropy as 5
Information Theoretic LearningRenyi’s Entropy • Norm of the pdf: Renyi’s entropy equals Shannon’s as
1 1 N=10 N = 1000 (x) (x) 0.5 0.5 x x f f Kernel function 0 0 -5 0 5 -5 0 5 x x N=10 N = 1000 0.4 0.4 (x) (x) x x f f 0.2 0.2 0 0 -5 0 5 -5 0 5 x x Information Theoretic LearningParzen windowing Given only samples drawn from a distribution: Convergence:
Information Theoretic Learning Renyi’s Quadratic Entropy Order-2 entropy & Gaussian kernels: Pairwise interactions between samples O(N2) Information potential,V2(X) provides a potential field over the space of the samples parameterized by the kernel size s Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
xi xj Information Theoretic Learning Information Force • In adaptation, samples become information particles that interact through information forces. Information potential: Information force: Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000. Erdogmus, Principe, Hild, Natural Computing, 2002.
What will happen if we allow the particles to move under the influence of these forces? Information force within a dataset arising due to H(X)
Desired Adaptive System IT Criterion Adjoint Network Input Output Information Forces Weight Updates Information Theoretic Learning Backpropagation of Information Forces Information forces become the injected error to the dual or adjoint network that determines the weight updates for adaptation.
Information Theoretic Learning Quadratic divergence measures Kulback-Liebler Divergence: Renyi’s Divergence: Euclidean Distance: Cauchy- Schwartz Distance : Mutual Information is a special case (divergence between the joint and the product of marginals) Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
Information Theoretic Learning Unifying criterion for learning from samples
Training ADALINE sample by sample Stochastic information gradient (SIG) Theorem: The expected value of the stochastic information gradient (SIG), is the gradient of Shannon’s entropy estimated from the samples using Parzen windowing. For the Gaussian kernel and M=1 The form is the same as for LMS except that entropy learning works with differences in samples. The SIG works implicitly with the L1 norm of the error. Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.
SIG Hebbian updates In a linear network the Hebbian update is The update maximizing Shannon output entropy with the SIG becomes Which is more powerful and biologically plausible? Hebbian updates would converge to any direction but SIG found consistently the 90 degree direction! Generated 50 samples of a 2D distribution where the x axis is uniform and the y axis is Gaussian and the sample covariance matrix is 1 Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.
System identification Feature extraction ITL Clustering Blind source separation ITL - Applications www.cnel.ufl.edu ITL has examples and Matlab code
Renyi’s cross entropy Let be two r.vs with iid samples. Then Renyi’s cross entropy is given by Using parzen estimate for the pdfs gives
“Cross” information potential and “cross” information force Force between particles of two datasets
Cross information force between two datasets arising due to H(X;Y)
Cauchy Schwartz Divergence A measure of similarity between two datasets Same probability density functions
A New ITL Framework:Information Theoretic Mean Shift STATEMENT Consider a dataset with iid samples. We wish to find a new dataset which captures “interesting structures” of the original dataset . FORMULATION Cost = Redundancy Reduction term + Similarity Measure Term Weighted Combination
Information Theoretic Mean Shift Form 1 This cost looks like a reaction diffusion equation: Entropy term implements diffusion Cauchy Schwarz implements attraction to the original data
Analogy The weighting parameter λ squeezes the information flow through a bottleneck extracting different levels of structure in the data. • We can also visualize λ as a slope parameter. The previous methods used only λ=1 or
Self organizing rule Rewriting cost function as Differentiating w.r.to xk={1,2,…,N} and rearranging gives Fixed Point Update!!
An Example Crescent shaped Dataset
Summary 2 Starting with the Data λ= 0 λ = 1 λ∞ Back to Data Single Point Modes
Applications- Clustering Statement Segment data into different groups such that samples belonging to same group are “closer” to each other than samples of different groups. The idea Mode Finding Ability Clustering
Mean Shift – a review Modes are stationary points of the equation,
Two variants: GBMS and GMS Gaussian Blurring Mean Shift Gaussian Mean Shift Single dataset X Initialize X=Xo Two datasets X and Xo Initialize X=Xo
Connection to ITMS λ = 1 λ= 0 GMS GBMS
Applications- Clustering 10 Random Gaussian Clusters and its pdf plot
GBMS result GMS result
GBMS GMS
Applications- Principal Curves • Non linear extension of PCA. • “Self-consistent” smooth curves which pass through the “middle” of a d-dimensional probability distribution or data cloud. A new definition (Erdogmus et al.) A point is an element of the d-dimensional principal set ,denoted by iff is orthonormal to at least (n-d) eigenvectors of and is a strict local maximum in the subspace spanned by these eigenvectors.
PC continued… • is a 0-dimensional principal set corresponding to modes of the data. is the 1-dimensional principal curve, is a 2-dimensional principal surface and so on … • Hierarchical structure, . . • ITMS satisfies this definition (experimentally). • Gives principal curve for .
Denoising Chain of Ring Dataset
Applications -Vector Quantization • Limiting case of ITMS (λ ∞). • Dcs(X;Xo) can be seen as distortion measure between X and Xo. • Initialize X with far fewer points than Xo
Comparison ITVQ LBG