840 likes | 1.4k Views
Elementary Concepts of Neural Networks. Preliminaries of artificial neural network computation. Learning. Behavioral improvement through increased information about the environment . An experiment in learning. Pigeons as art experts (Watanabe et al. 1995) Experiment:
E N D
Elementary Concepts of Neural Networks Preliminaries of artificial neural network computation
Learning Behavioralimprovement through increased information about the environment.
An experiment in learning • Pigeons as art experts (Watanabe et al. 1995) • Experiment: • Pigeon in Skinner box • Present paintings of two different artists (e.g. Chagall / Van Gogh) • Reward when presented a particular artist (e.g. Van Gogh)
Pigeons as art experts • Pigeons were able to discriminate between Van Gogh and Chagall with 95% accuracy (when presented with pictures they had been trained on) • Discrimination still 85% successful for previously unseen paintings of the artists • Pigeons do not simply memorise the pictures • They can extract and recognise patterns (the ‘style’) • They generalise from the already seen to make predictions • This is what neural networks (biological and artificial) are good at (unlike conventional computer)
What are Neural Networks? • Models of the brain and nervous system • Highly parallel • Learning • Very simple principles • Very complex behaviours • Applications • as biological models • as powerful problem solvers
Goals of neural computation • To understand how the brain actually works • To understand a new style of computation inspired by neurons and their adaptive connections • Very different style from sequential computation • should be good for things that brain is good • should be bad for things that brain is bad • to solve practical problems by using novel learning algorithms • Learning algorithms can be very useful even if they have nothing to do with how the brain works
Gross physical structure: There is one axon that branches There is a dendritic tree that collects input from other neurons Axons typically contact dendritic trees at synapses A spike of activity in the axon causes charge to be injected into the post-synaptic neuron Spike generation: There is an axon that generates outgoing spikes whenever enough charge has flowed A typical cortical neuron axon dendritic tree
Brain vs. Network Brain neuron Neural network
When a spike travels along an axon and arrives at a synapse it causes vesicles of transmitter chemical to be released The transmitter molecules diffuse across the synaptic cleft and bind to receptor molecules in the membrane of the post-synaptic neuron thus changing their shape. The effectiveness of the synapse can be changed Synapses are slow, but they have advantages over RAM Massively parallel, they adapt using locally available signals (but how?) Synapses
Each neuron receives inputs from other neurons Some neurons also connect to receptors Neurons use spikes to communicate The timing of spikes is important The effect of each input line on the neuron is controlled by a synaptic weight The weights can be positive or negative The synaptic weights adapt so that the whole network learns to perform useful computations Recognizing objects, understanding language, making plans, controlling the body How the brain works
Idealized neurons • To model things we have to idealize them (e.g. atoms) • Idealization removes complicated details that are not essential for understanding the main principles • Allows us to apply mathematics and to make analogies to other, familiar systems. • Once we understand the basic principles, its easy to add complexity to make the model more faithful • It is often worth understanding models that are known to be wrong (but we mustn’t forget that they are wrong!) • E.g. neurons that communicate real values rather than discrete spikes of activity.
Binary threshold neurons • McCulloch-Pitts (1943): influenced Von Neumann! • First compute a weighted sum of the inputs from other neurons • Then send out a fixed size spike of activity if the weighted sum exceeds a threshold. • Maybe each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition! 1 1 if y 0 0 otherwise z threshold
Linear neurons • These are simple but computationally limited • If we can make them learn we may get insight into more complicated neurons bias th y i input 0 weight on b 0 output th i input index over input connections
Linear threshold neurons These have a confusing name. They compute a linear weighted sum of their inputs The output is a non-linear function of the total input y 0 0 otherwise z threshold
Sigmoid neurons • These give a real-valued output that is a smooth and bounded function of their total input. • Typically they use the logistic function • They have nice derivatives which make learning easy (see lecture 4). • Local basis functions (radial) are also used 1 0.5 0 0
For backpropagation, we need neurons that have well-behaved derivatives. Typically they use the logistic function The output is a smooth function of the inputs and the weights. Non-linear neurons with smooth derivatives 1 0.5 0 0
Feedforward networks These compute a series of transformations Typically, the first layer is the input and the last layer is the output. Recurrent networks These have directed cycles in their connection graph. They can have complicated dynamics. More biologically realistic. Types ofconnectivity output units hidden units input units
Types of learning task • Supervised learning • Learn to predict output when given input vector • Who provides the correct answer? • Reinforcement learning • Learn action to maximize payoff • Not much information in a payoff signal • Payoff is often delayed • Unsupervised learning • Create an internal representation of the input e.g. form clusters; extract features • How do we know if a representation is good?
Single Layer Feed-forward Output layer of neurons Input layer of source nodes
Multi layer feed-forward 3-4-2 Network Output layer Input layer Hidden Layer
z-1 z-1 z-1 Recurrent networks Recurrent Network with a hidden neuron system input hidden output
The Neuron Bias b x1 w1 Activation function Local Field v Output y Input values x2 w2 Summing function xm wm weights
The Neuron • The neuron is the basic information processing unit of a NN. It consists of: • A set of links, describing the neuron inputs, with weights W1, W2, …, Wm • An adder function (linear combiner) for computing the weighted sum of the inputs (real numbers): • Activation function (squashing function) for limiting the amplitude of the neuron output.
Bias as extra input w0 x0 = +1 Activation function x1 w1 Local Field v Input signal Output y x2 w2 Summing function Synaptic weights xm wm
Neuron Models • The choice of f determines the neuron model • Step function: • Ramp function: • Sigmoid function: • Gaussian function (Radial Basis Functions)
b (bias) x1 w1 v y x2 w2 (v) wn xn Perceptron: Single Neuron Model • The (McCulloch-Pitts) perceptron is a single layer NN with a non-linear , the sign function
Perceptron’s geometric view • The equation below describes a (hyper-)plane in the input space consisting of real valued m-dimensional vectors. The plane splits the input space into two regions, each of them describing one class. x2 w1x1 + w2x2 + w0 >= 0 decision boundary C1 x1 C2 w1x1 + w2x2 + w0 = 0
Learning with hidden units • Networks without hidden units are very limited in the input-output mappings they can model. • More layers of linear are still linear. • We need multiple layers of adaptive non-linear hidden units. This gives us a universal approximator. But how can we train such nets? • We need an efficient way of adapting all the weights is hard
Randomly perturb one weight and see if it improves performance. If so, save the change. Very inefficient. We need to do multiple forward passes on a representative set of training data just to change one weight. Towards the end of learning, large weight perturbations will nearly always make things worse. We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes. Not any better because we need lots of trials to “see” the effect of changing one weight through the noise created by all the others. Learning by perturbing weights
We don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity. Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities. Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined. We can compute error derivatives for all the hidden units efficiently. Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit. The idea behind backpropagation
Sketch of backpropagation (d-rule) let’s derive it ....
How often to update after each training case? after a full sweep through the training data? How much to update use a fixed learning rate? adapt the learning rate? don’t use steepest descent? Ways to use weight derivatives
Overfitting • The training data contains information about the regularities in the mapping from input to output. But it also contains noise • The target values may be unreliable. • There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen. • When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. • So it fits both kinds of regularity. • If the model is very flexible it can model the sampling error really well. This is a disaster.
Which model do you believe? The complicated model fits the data better. But it is not realistic! A model is convincing when it fits a lot of data surprisingly well. It is not surprising that a complicated model can fit a small amount of data. Ockam’s Razor Simple overfitting example
Neural Network Training as a Mathematical Programming Problem
Key characteristics • NNs are versatile and “general” models • Require little, if any, insight • Usually impossible to interpret • is this yet another multivariate parameter estimation approach? • Well ... It depends on how they are used.. • The basic concept behind NN modeling is to identify complex emergent behavior by combining simple elements • NNs should not be viewed as exercises in optimization • People either love of hate NNs !!!
Some thoughts ... • How do we interpret (artificial) NNs? • Nature shows 3 key characteristics • highly robust (recover memory w/ partial knowledge ... see next page) • highly adaptable (connections created and/or bypassed) • complexity emerging from simplicity • These could be the results of MASSIVE PARALLELISM • 1012-1012 neurons • 1014 synapses • Can we really built models like that?
Some thoughts ... • How do we interpret (artificial) NNs? • Nature shows 3 key characteristics • highly robust (recover memory w/ partial knowledge ... see next page) • highly adaptable (connections created and/or bypassed) • complexity emerging from simplicity • These could be the results of MASSIVE PARALLELISM • 1012-1012 neurons • 1014 synapses • Can we really built models like that?
Applications • Too many ... Anytime you look for an I/O relation and you lack fundamental understanding and first principles (or even “gray”) models • optimization Hopfield Networks • classification FFNN/BP • dimensionality reduction Autoassociative NNs • visualization SOM • modeling Recurrent NNs • cognitive sciences • a very legitimate domain ...
Dimensionality reduction • PCA, nonlinear PCA ...
Recurrent Networks • A recurrent network with 5 nodes 4 x1 1 z4 x2 x3 3 x4 2 z5 x5 5
Partial pattern Memories as Attractors • Attractor Network [Hopfield, 1982] • store memories as dynamical attractors Recover memory using partial information
Stochastic Optimization Basic preliminaries of Simulated Annealing and Genetic Algorithms
Optimization, dynamic systems and iterative maps let’s talk about that ...
Why stochastic algorithms • Like poker ... 15 min to learn a lifetime to master ... • Deceptively easy to grasp and implement ... which means that the implementations can become tricky.. • It’s straightforward to incorporate domain specific knowledge • Will always be producing something • insensitive to minor details such as differentiability, scaling, bad modeling, etc. • They have physical analogues making them attractive to physical scientists