Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training

Vitit Kantabutra & Batsukh Tsendjav Computer Science Program College of Engineering Idaho State University Pocatello, ID 83209 Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training Elena Zheleva Dept. of CS/EE The University of Vermont Burlington, VT 05405

New Algorithm for Neural Network Training • Convergence of training algorithms is one of the most important issues in NN field today • We solve the problem for some well-known difficult-to-train networks • Parity 4 – 100% fast conv. • 2-Spiral – same • Character recog. – same

Our Glide Algorithms • Our first “Glide Algorithm” was a simple modification of gradient descent. • When the gradient is small, go a constant distance instead of a distance equal to a constant times the gradient • The idea was that flat regions are seemingly “safe,” enabling us to go a relatively long distance (“glide”) without missing the solution • Originally we even thought of going a longer distance when the gradient is smaller! • We simply didn’t believe in the conventional wisdom of going a longer distance on steep slopes.

Hairpin Observation – problem with original Glide Algorithm • Our original Glide Algorithm did converge significantly faster than plain gradient descent • But ours didn’t even converge as reliably as plain gradient descent! • What seemed to be wrong? • We weren’t right about flat regions being always safe!! • We experimented by running plain gradient descent and observe its flat region behavior • Flat regions are indeed often safe • But sometimes gradient descent makes a sharp “hairpin” turn!! • This sometimes derailed our first Glide Algorithm

Second Glide Algorithm: “Glide Algorithm with Tunneling” • In flat regions, we still try to go far • But we check error at tentative destination • Don’t go so far if error increase much • Can afford the time easily • But even if error increases a little, go anyway to “stir things up” • Also has mechanism for battling zigzagging • Direction of motion is average of 2 or 4 gradient descent moves • Seems better than momentum • Also has “tunneling” • Means linear search very locally, but fancier

Reducing the zigzagging problem • Direction of next move usually determined by averaging 2 or 4 (or 6, 8, etc) gradient descent moves Gradient Descentzigzagging despite momentum!!

Importance of Tunneling • Serves to set the weight at the “bottom of the gutter” error distance

A Few Experimental Results Didn’t converge CPU time, G.D. odd runs with m=0.9 Problem: Parity-4 with 4 hidden neurons y=running time (sec) until convergence Even runs: starting with previous run’s weights Odd runs: random starting wts X=run number

Some GAT Data Voting Records Parity 4

Testing Parity 4 • Network Information • One Hidden layer • 4 inputs • 6 Hidden Neurons • 1 output Neuron • Fully Connected between layers • Machine Used • Windows XP • AMD Athlon 2.0 GHz Processor • 1 GB Memory

Testing Parity 4 • Parity 4 (Even) • Number of Instances: 16 • True (=1) • False (=-1) • Number of Attributes: 4 • True (=1) • False (=-1)

X1 X2 X3 X4 Out 0 -1 -1 -1 -1 1 1 -1 -1 -1 1 -1 2 -1 -1 1 -1 -1 3 -1 -1 1 1 1 4 -1 1 -1 -1 -1 5 -1 1 -1 1 1 6 -1 1 1 -1 1 7 -1 1 1 1 -1 8 1 -1 -1 -1 -1 9 1 -1 -1 1 1 10 1 -1 1 -1 1 11 1 -1 1 1 -1 12 1 1 -1 -1 1 13 1 1 -1 1 -1 14 1 1 1 -1 -1 15 1 1 1 1 1 Testing Parity 4 • Patterns Used • X1 – X4 are inputs

Testing Parity 4

Statistics Times GAT Grad Total GAT Grad # of Tests 35 35 Seconds 788 1488 Iterations Minutes 13.13 24.81 GAT Grad Mean 28,599 211,081 Seconds GAT Grad St Dev 26,388 96,424 Mean 23 43 St Dev 28 19 Testing Parity 4

Testing on Voting Records • Network Information • One Hidden layer • 16 inputs • 16 Hidden Neurons • 1 output Neuron • Fully Connected between layers • Machine Used • Windows XP • AMD Athlon 2.0 GHz Processor • 1 GB Memory

Testing on Voting Records • 1984 United States Congressional Voting Records Database (taken from the UCI Machine learning Repository - http://www.ics.uci.edu/~mlearn/) • Number of Instances: 435 • 267 democrats (=1) • 168 republicans (=-1) • Number of Attributes: 16+class name = 17 • Yes Vote (1) • No Vote (-1) • Abstained (0)

Testing on Voting Records • 1. Class Name: 2 (democrat, republican) • 2. handicapped-infants: 2 (y,n) • 3. water-project-cost-sharing: 2 (y,n) • 4. adoption-of-the-budget-resolution: 2 (y,n) • 5. physician-fee-freeze: 2 (y,n) • 6. el-salvador-aid: 2 (y,n) • 7. religious-groups-in-schools: 2 (y,n) • 8. anti-satellite-test-ban: 2 (y,n) • 9. aid-to-nicaraguan-contras: 2 (y,n) • 10. mx-missile: 2 (y,n) • 11. immigration: 2 (y,n) • 12. synfuels-corporation-cutback: 2 (y,n) • 13. education-spending: 2 (y,n) • 14. superfund-right-to-sue: 2 (y,n) • 15. crime: 2 (y,n) • 16. duty-free-exports: 2 (y,n) • 17. export-administration-act-south-africa: 2 (y,n)

Testing on Voting Records

Statistics Times GAT Grad Total GAT Grad # of Tests 20 20 Seconds 4338 32603 Iterations Minutes 72.31 543.39 GAT Grad Hours 1.21 9.06 Mean 4,636 107,303 St Dev 4,386 31,949 Minutes GAT Grad Mean 3.62 27.17 St Dev 3.52 8.11 Testing on Voting Records

Two-Spiral Problem • Very hard problem • Glide algorithm • combined with gradient descent for quicker initial error reduction • number of epochs required for convergence varies widely • average 30453 epochs • Gradient descent • often did not converge

Tuning Insensitivity of Glide-Tunnel Algorithm!! Random params: odd runs Random params: even runs

Glide algorithm tested on character recognition problem • The network was built to recognize digits 0 through 9 • The algorithm was implemented in C++ • Glide Algorithm was shown to outperform regular gradient descent method by the test runs.

Small Neural Network • The network was 48-24-10 • Bipolar inputs • Trained on 200 training patterns • 20 samples for each digit • Trained and tested on printed characters • After the training, the recognition rate for test patterns was 70% on average. • Not enough training patterns

Network Structure • 6X8 pixel resolution • 48 bipolar inputs(1/-1) • Hidden Layer • 24 neurons • tanh(x) for activation • Output Layer • 10 neurons • tanh(x) activation function

Experimental results • 60 official runs of Glide Algorithm • All but 4 runs converged under 5000 epochs. • Average run time was 47 sec. • Parameters used • Eta = 0.005 (learning rate) • Lambda = 1 (steepness parameter )

Experimental results • 20 runs of Regular Gradient Descent Algorithm • All the runs after 20,000 epochs did not converge. • Average run time was 3.7 min. • Higher order methods exist • Not Stable • Not very efficient when the error surface is flat

Conclusion • New Glide Algorithm has been shown to perform really well for flat regions • With tunneling, the algorithm is very stable converging on all the test runs for different test problems • Converge more reliably than Gradient Descent and, presumably, than second-order methods • Some individual steps are computationally expensive but worth the CPU time because overall performance is far superior to regular gradient descent

Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training

Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training

Presentation Transcript

Science Policy Overview: What is it, How is it Made, and Why it is Important

DESIGNING HOSPITALS FOR SAFE AND ECONOMICAL PRACTICE The Quality Colloquium Harvard University August 21, 2007

Special Report for Preferential Subject 1 Transformer Protection, Monitoring and Control Special Reporter: Simon Chano*

Dark Matter

Building a Confident and Competent Workforce: How Evidence-Based Practices are Changing the Landscape for Personnel Dev

Fiber to the Home

Background Error Covariance Modeling

Inflation, infinity, equilibrium and the observable Universe Andreas Albrecht UC Davis

149144 - FFE Colloquium Big Steps and Small Steps: Making Progress on Employment Issues

Kevin McFarland University of Rochester

Providing Infrastructure for Optical Communication Networks

NCAR Advanced Studies Program Summer Colloquium The Challenges of Convective Forecasting

Adam Fontecchio Enrique Alvarez Chandrabhan Sharma Douglas Gorham Budapest, Hungary

Matrices, connections, matings and reasoning: The connection method

Quantum entanglement and the phases of matter

Library Automation Update and persepctives

Judgment, Communication and Decisions Under Uncertainty: A Psychological Perspective

Potential Implications of the Higgs Boson

All about the size: puzzle of proton charge radius

Leveraging Logistics: a 3PL Perspective