750 likes | 1.02k Views
Back Propagation. Amir Ali Hooshmandan Mehran Najafi Mohamad Ali Honarpisheh. Contents. What is it? History Architecture Activation Function Learnig Algorithm EBP Heuristics How Long to Train Virtues AND Limitations of BP About Initialization Accelerating training An Application
E N D
Back Propagation Amir Ali Hooshmandan Mehran Najafi Mohamad Ali Honarpisheh
Contents • What is it? • History • Architecture • Activation Function • Learnig Algorithm • EBP Heuristics • How Long to Train • Virtues AND Limitations of BP • About Initialization • Accelerating training • An Application • Different Problems Require Different Learning Rate Adaptive Methods
What is it • A Supervised learning algorithm • Base Error correction learning rule • Generalize Adaptive Filtering Algorithm
History • 1986 • Rumelhart • Paper Why are “what” and “where” processed by separate cortical visual systems? • Book Parallel Distributed Processing: Explorations in Micro Structures of cognition • Parker • Optimal algorithms for adaptive networks: second order back propagation second order direct propagation • 1974 & 1969
Architecture Zk k Wj,k j Zj Vi,j Xi i
Activation Function Characteristics: output of neuron J
Learnig Algorithm ek(n) = dk(n) - yk(n) Energy in this error Total energy of the output of net: Energy of the error produced by output Neuron J in epoch n
Learnig Algorithm(cont.’d) Purpose Minimizing E(n) e_yk(n) = dk(n) - yk(n) Local Field
Learnig Algorithm(cont.’d) Chain Rule in Derivation - Local gradient
Chain Rule Computing Δvi,j for none output layers Problem We don’t have error because it is responsible for many errors Find another way to compute δj
Computing Weight correction (Weight correction) = (Learning rate parameter) * (local gradient) * (input signal of previouse layer Neuron)
Training Algorithm Step 0 : Initialize weights (Set to random variables with zero mean and variance one) Step 1: While stopping condition is false do Step 2-9. Step 2: For each training pair do Steps 3-8. Feed forward Step 3: Each input unit(Xi,i=1,..,n) receives input signal xi and broadcasts this signal to all units in the layer above(the hidden units(
Training Algorithm(Cont) Step 4: Each hidden unit (Zj j=1,…,p) sums its weighted input signals applies its activation function to compute its output signal and sends this signal to all units in the layer above (outputunits)
Training Algorithm(Cont) Step 5: Each output unit (Yk ,k=1,…..,m) sums its weighted input signals, y_ink=wOk+ and applies its activation function to compute its output signal. yk=f(y_ink).
Training Algorithm(Cont) Backpropagation of error: Step 6: Each output unit ( Yk ,k=1,…,m) receives a target pattern corresponding to the input training patern computes its error information term. calculates its weight correction term (used to update wjk later), calculates its bias correction term ( used to update wOk later). and sends to units in the layer below,
Training Algorithm(Cont) Step 7: Each hidden units (Zj, j=1,…,p) sums its delta inputs from units in the layer above). O_inj= multiplies by the derivative of its activation function to calculate its error information term, calculates its weight correction term(used to update vij later), and calculates its bias correction term(used to update voj later),
Training Algorithm(Cont) Update weights and biases: Step 8: Each output units(Yk,k=1,….,m) updates its bias and weights(j=0,…,p): wjk(new)=wjk(old)+ Each hidden unit(Zj j=1,….,p) updates its bias and weights(i=0,….,n): vij(new)=vij(old)+ Step 9: Test stopping condition
EBP Heuristics • Number Of Hidden Layers : • Theoretical & Simulation results showed that there is no need to have more than two hidden layers. • One OR Two hidden layers…? • Chester ( 1990 ) : “Why two hidden layers are better than one” • Gallant ( 1990 ) : “Never try a multilayer model for fitting data until you have first tried a single-layer model” Both architectures are theoretically able to approximate any continuous function to the desired degree of accuracy.
EBP Heuristics (cont’d) • Number of hidden layers… : • Its difficult to say which topology is better : • Size of NN • Learning Time • Implementability in hardware • Accuracy • Solving problem using NN with one hidden layer first , seems appropriate.
EBP Heuristics (cont’d) • Every adjustable network parameter of the cost function should have its own individual learning rate parameter. • Every Learning rate parameter should be allowed to vary from one iteration to the next. • Increasing weight ‘s LR that has same derivative sign for several iterations. • Decreasing weight’s LR that has alternating derivative sign.
How Long to Train • The Aim is to balance between Generalization & Memorization ( Minimizing cost function is not necessarily good idea ). • Hecht-Nielsen( 1990 ) : Using two disjoint sets for training • Training Set • Training-Testing Set • As long as the error for the training-testing set decreases , training continues. • When the error begin to increase , the net is starting to memorize.
Virtues AND Limitations of BP • Connectionism • Biological Reasons • No excitatory or inhibitory for real neurons • No Global connection in MLP • No backward propagation in real neurons • Useful in parallel hardware implementation • Fault Tolerance
Virtues…( Cont’d ) • Computational Efficiency • Computational complexity of alg. is measured in terms of multiplications,additions… • Learning Algorithm is said to be computationally efficient , when its complexity is polynomial… • The BP algorithm is computationally efficient. • In MLP with a total W weights , its complexity is linear in W
Virtues…(cont’d) • Convergence Saarinen (1992 ) : Local convergence rates of the BP algorithm are linear • Too flat OR too curved • Wrong Direction • Local Minima • BP learning is basically a Hill climbing technique. • Presence of local minima( isolated valleys )
About Init… (cont) • Other issues : • Initialization of OL weights shouldn’t result in small weighs...? • If the output layer weights are small, then so is the contribution of the HL neurons to the output error, and consequently the effect of the hidden layer weights is not visible enough. • If the OL weights are too small, deltas( for HLs ) also become very small, which in turn leads to small initial changes in the hidden layer weights.
About Init… (cont) • Initialization by using random numbers is very important in avoiding the effects of symmetry in the network. all the HL neurons should start with guaranteed different weight. • If they have similar (or, even worse, the same) weights, they will perform similarly (the same) on all data pairs by changing weights in similar (the same) directions. • This makes the whole learning process unnecessarily long (or learning will be the same for all neurons, and there will practically be no learning).
Nguyen – Widrow Initialization… • Two Layer NNs have been proven capable of approximating any arbitrary functions… • How this work? • and method for speeding up training process…
Behavior of hidden nodes… • For simplicity two layer network with one input is trained to approximate a function of one variable d(x). “x” as input and using BP algorithm… • Output is in the form of : • Sigmoid function ( tanh ) : • Approximately linear with slope 1 for x between -1 and 1. • Saturate to -1 or 1 as x becomes large in magnitude • Each term in above some is linear function of x over small interval
Size of each interval is determined by wi • Location of interval is determined by wbi • Network learns to implement desired function by building piece-wise linear approximations • Pieces are summed to form the complete approximation • (Random Initialization)
Improving Learning Speed • Main Idea : • Divide desired region into small intervals • Setting weights in a manner that each hidden node is assigned to its own interval at start of training. • Training is as before…
Improving … (cont) • Desired region : (-1,1) , has length 2 • H hidden units • So, each hidden unit is responsible for an interval of length 2/H • Sigmoid(wi x + wbi ) is approximately linear over : • Which has length 2/wi , therefore : • Its preferable to have intervals overlap : • For wbi :
Training of a net initialized as discussed… Net whose weights are initialized to random values between -.5 and .5 • Improvement is best when a large number of hidden units is used with a complicated desired response. • Training time decreased from 2 days to 4 hours for Truck-Backer-Upper
Momentum • Weight change’s Direction : Combination of current gradient and previous gradient. • Advantage : Reduce the role of outliers • Doesn’t adjust LR directly. Momentum Parameter , its in the range from 0 to 1
Momentum (cont’d) • Allows large weight adjustments as long as the correction is in the same direction… • Forms an exponentially weighted sum : • BP Vs MOM : XOR function with bipolar representation
Delta-Bar-Delta • Allows each weight to have its own learning rate • Lets learning rates vary with time • Two heuristics are used to determine appropriate changes : • If weight changes is in the same direction for several time steps , LR for that weight should be increased. • If direction of weight change alternates , the LR should be decreased. • Note : these heuristics wont always improve the performance.
DBD (cont’d) • DBD rule consists of : • Weight update rule • LR update rule • DBD rule changes the weights as follows : Use information of current and past derivative to form “delta-bar”
DBD (cont’d) • The 1st heuristic is implemented by increasing the LR by a constant amount : • 2nd heuristic is implemented by decreasing LR by a proportion of its current value : • LR increase linearly and decrease exponentially.
DBD (cont’d) Results for XOR …
Computer Network Intrusion Detection Via Neural Networks Methods
Goals • neural network (NN) techniques can be used in detecting intruders logging on to a computer network. • compares the performance of the four neural network methods in intrusion detection. • The NN techniques used are 1. gradient descent back propagation (BP) 2. gradient descent BP with momentum 3. variable learning rate gradient descent BP 4. Conjugate Gradient BP (CGP)
Background on Intrusion Detection Systems (IDS) • Information assurance is a field that deals with protecting information on computers or computer networks from being compromised. • Intrusion detection : detecting unauthorized users from accessing the information on those computers. • Current intrusion detection techniques can not detect new and novel attacks • The relevance of NN in intrusion detection becomes apparent when one views the intrusion detection problem as a pattern classification problem.
Pattern Classification Problem • By building profiles of authorized computer users one can train the NN to classify the incoming computer traffic into authorizedtraffic or not authorized traffic. • The task of intrusion detection is to construct a model that captures a set of user attributes and determine if that user’s set of attributes belongs to the authorized user or those of the intruder.
Problem Definition • Attribute set consists of the unique characteristics of the user logging onto a computer network: • Authorized user and Intruder • The problem can be stated as: • Where: x = input vector consisting a user’s attributes y = {authorized user, intruder} We want to map the input set x to an output
Solving the Intrusion DetectionProblem Using The BackPropagation • Multilayer Perceptron with Two Hidden Layers The error of our model is: e = d - y d = desired output y = actual output
Continue • Activation functions Sigmoidal : • That users in the UNIX OS environment could be profiled via four attributes command, host, time, execution time • For simplicity in testing the back propagation methods, we decided to generate a user profile data file without profile drift
Continue • The generated data used here was organized into two parts. Training Data : 90% Authorized traffic 10% Intrusion traffic Testing Data: 98% Authorized traffic 2% Intrusion traffic