1.34k likes | 1.5k Views
Searching in the Right Space. Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst Barto@cs.umass.edu. Computational Reinforcement Learning. Artificial Intelligence
E N D
Searching in the Right Space Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst Barto@cs.umass.edu
Computational Reinforcement Learning Artificial Intelligence (machine learning) “Reinforcement learning (RL) bears a tortuous relationship with historical and contemporary ideas in classical and instrumental conditioning.” —Dayan 2001 Control Theory and Operations Research Psychology Computational Reinforcement Learning (RL) Neuroscience Artificial Neural Networks
The Plan • High-level intro to RL • Part I: The personal odyssey • Part II: The modern view • Part III: Intrinsically Motivated RL
The View from Machine Learning • Unsupervised Learning • recode data based on some given principle • Supervised Learning • “Learning from examples”, “Learning with a teacher”, related to Classical (or Pavlovian) Conditioning • Reinforcement Learning • “Learning with a critic”, related to Instrumental (or Thorndikian) Conditioning
Classical Conditioning Pavlov, 1927 Tone (CS: Conditioned Stimulus) Food (US: Unconditioned Stimulus) Salivation (UR: Unconditioned Response) • • • Anticipatory salivation (CR: Conditioned Response)
Edward L. Thorndike (1874-1949) Learning by “Trial-and-Error” puzzle box
Trial-and-Error = Error Correction Artificial Neural Network: learns from a set of examples viaerror-correction
x 1 x 2 x n “Least-Mean-Square” (LMS) Learning Rule “delta rule”, Adaline, Widrow and Hoff, 1960 w 1 input pattern V w actual output 2 w n – z + desired output + adjust weights V x z [ ] a w - D = i i
Trial-and-Error? • “The boss continually seeks a better worker by trial and error experimentation with the structure of the worker. Adaptation is a multidimensional performance feedback process. The `error’ signal in the feedback control sense is the gradient of the mean square error with respect to the adjustment.” Widrow and Hoff, “Adaptive Switching Circuits” 1960 IRE WESCON Conventional Record
MENACE Michie 1961“Matchbox Educable Noughts and Crosses Engine” x o o x x o o x x o o x o o o x o x o x x o o o x x o x x x o o o o o x x x x x o o o o o x o x o o o o x o x x o x o x o o x o o x x o o x o x o x o x x o x x o o o o x x o o x x o o x x x o o o x o x x x o x x o o o x o o x o o o x o o x x x o o x x o o o o o o o x
Essence of RL (for me at least!): Search + Memory • Search: Trial-and-Error, Generate-and-Test, Variation-and-Selection, . . . • Memory: remember what worked best for each situation and start from there next time RL is about caching search results (so you don’t have to keep searching!)
Generate-and-Test • Generator should be smart: • Generate lots things that are likely to be good based on prior knowledge and prior experience • But also take chances … • Tester should be smart too: • Evaluate based on real criteria, not convenient surrogates • But be able to recognize partial success
The Plan • High-level intro to RL • Part I: The personal odyssey • Part II: The modern view • Part III: Intrinsically Motivated RL
Key Players • Harry Klopf • Rich Sutton • Me
Arbib, Kilmer, and Spinelliin Neural Mechanisms of Learning and Memory, Rosenzweig and Bennett, 1974 “Neural Models and Memory”
A. Harry Klopf “Brain Function and Adaptive Systems -- A Heterostatic Theory” Air Force Cambridge Research Laboratories Technical Report 3 March 1972 “…it is a theory which assumes that living adaptive systems seek, as their primary goal, a maximal condition (heterostasis), rather than assuming that the primary goal is a steady-state condition (homeostasis). It is further assumed that the heterostatic nature of animals, including man, derives from the heterostatic nature of neurons. The postulate that the neuron is a heterostat (that is, a maximizer) is a generalization of a more specific postulate, namely, that the neuron is a hedonist.”
Klopf’s theory (very briefly!) • Inspiration: The nervous system is a society of self-interested agents. • Nervous Systems = Social Systems • Neuron = Man • Man = Hedonist • Neuron = Hedonist • Depolarization = Pleasure • Hyperpolarization = Pain • A neuronal model: • A neuron “decides” when to fire based on comparing a spatial and temporal summation of weighted inputs with a threshold. • A neuron is in a condition of heterostasis from time t to t+ if it maximizes the amount of depolarization and minimizes the amount of hyperpolarization over this interval. • Two ways to adapt weights to do this: • Push excitatory weights to upper limits; zero out inhibitory weights • Make neuron control its input.
Heterostatic Adaptation • When a neuron fires, all of its synapses that were active during the summation of potentials leading to the response become eligible to undergo changes in their transmittances. • The transmittances of an eligible excitatory synapse increases if the generation of an action potential is followed by further depolarization for a limited time after the response. • The transmittances of an eligible inhibitory synapse increases if the generation of an action potential is followed by further hyperpolarization for a limited time after the response. • Add a mechanism that prevents synapses that participate in the reinforcement from undergoing changes due to that reinforcement (“zerosetting”).
Key Components of Klopf’s Theory • Eligibility • Closed-loop control by neurons • Extremization (e.g., maximization) as goal instead of zeroing something • “Generalized Reinforcement”: reinforcement is not delivered by a specialized channel The Hedonistic Neuron A Theory of Memory, Learning, and Intelligence A. Harry Klopf Hemishere Publishing Corporation 1982
Eligibility Traces Klopf, 1972 a histogram of the lengths of feedback pathways in which the neuron is embedded Optimal ISI The same curve as the reinforcement- effectiveness curve in conditioning: max at 400 ms; 0 after approx 4 s.
Later Simplified Eligibility Traces visits to state s accumulating trace replace trace TIME
Rich Sutton • BA Psychology, Stanford, 1978 • As an undergrad, discovered Klopf’s 1972 tech report • Two unpublished undergraduate reports: • “Learning Theory Support for a Single Channel Theory of the Brain” 1978 • “A Unified Theory of Expectation in Classical and Instrumental Conditioning” 1978 (?) • Rich’s first paper: • “Single Channel Theory: A Neuronal Theory of Learning” Brain Theory Newsletter, 1978.
Sutton’s Theory • Aj: level of activation of mode j at time t • Vij: sign and magnitude of association from mode i to mode j at time t • Eij: eligibility of Vij for undergoing changes at time t. It is proportional to the average of the product Ai(t)Aj(t) over some small past time interval (or an average of the logical AND). • Pj: expected level of activation of mode j at time t (a prediction of level of activation of mode j) • Cij a constant depending on particular association being changed
What exactly is Pj? • Based on recent activation of the mode: The higher the activation within the last few seconds, the higher the level expected for the present . . . • Pj(t) is proportional to the average of the activation level over some small time interval (a few seconds or less) before t.
Sutton’s theory • Contingent Principle: based on reinforcement a neuron receives after firings and the synapses which were involved in the firings, the neuron modifies its synapses so that they will cause it to fire when the firing causes an increase in the neuron’s expected reinforcement after the firing. • Basis of Instrumental, or Thorndikian, conditioning • Predictive Principle: if a synapse’s activity predicts (frequently precedes) the arrival of reinforcement at the neuron, then that activity will come to have an effect on the neuron similar to that of reinforcement. • Basis of Classical, or Pavlovian, conditioning
Sutton’s Theory • Main addition to Kopf’s theory: addition of the difference term — a temporal difference term • Showed relationship to the Rescorla-Wagner model (1972) of Classical Conditioning • Blocking • Overshadowing • Sutton’s model was a real-time model of both classical and instrumental conditioning • Emphasized conditioned reinforcement
Rescorla Wagner Model, 1972 “Organisms only learn when events violate their expectations.” : change in associative strength of CS A a: parameter related to CS intensity l: parameter related to US intensity : sum of associative strengths of all CSs present (“composite expectation”) A “trial-level” model
Conditioned Reinforcement • Stimuli associated with reinforcement take on reinforcing properties themselves • Follows immediately from the predictive principle: “By the predictive principle we propose that the neurons of the brain are learning to have predictors of stimuli have the same effect on them as the stimuli themselves” (Sutton, 1978) • “In principle this chaining can go back for any length …” (Sutton, 1978) • Equated Pavlovian conditioned reinforcement with instrumental higher-order conditioning
Where was I coming from? • Studied at the University of Michigan: at the time a hotbed of genetic algorithm activity due to John Holland’s influence (PhD in 1975) • Holland talked a lot about the exploration/exploitation tradeoff • But I studied dynamic system theory, relationship between state-space and input/output representations of systems, convolution and harmonic analysis, finally cellular automata • Fascinated by how simple local rules can generate complex global behavior: • Dynamic systems • Cellular automata • Self-organization • Neural networks • Evolution • Learning
Sutton and Barto, 1981 • “Toward a Modern Theory of Adaptive Networks: Expectation and Prediction” Psych Review 88, 1981 • Drew on Rich’s earlier work, but clarified the math and simplified the eligibility term to be non-contingent: just a trace of x instead of xy. • Emphasized anticipitory nature of CR • Related to “Adaptive System Theory”: • Other neural models (Hebb, Widrow & Hoff’s LMS, Uttley’s “Informon”, Anderson’s associative memory networks) • Pointed out relationship between Rescorla-Wagner model and Adaline, or LMS algorithm • Studied algorithm stability • Reviewed possible neural mechanisms: e.g. eligibility = intracellular Ca ion concentrations
Temporal Primacy Overrides Blocking in SB model our simulation Kehoe, Schreurs, and Graham 1987
Adaline Learning Rule LMS rule, Widrow and Hoff, 1960 target output input pattern
“Rescorla–Wagner Unit” US to UR CS to CR vector of “associative strengths” “composite expectation”
Important Notes • The “target output” of LMS corresponds to the US input to Rescorla-Wagner model • In both cases, this input is specialized in that it does not directly activate the unit but only directs learning • The SB model is different, with the US input activating the unit and directing learning • Hence, SB model can do secondary reinforcement • SB model stayed with Klopf’s idea of “generalized reinforcement”
A Major Problem: US offset • e.g., if a CS has same time course as US, weights would change so US will be cancelled out. US CS Final result Why? Because trying to zero out yt – yt–1
Associative Memory Networks Kohonen et al. 1976, 1977; Anderson et al. 1977
Associative Search Network Barto, Sutton, Brouwer, 1981 Problem of context transitions: add a predictor “one-step-ahead LMS predictor”
Relation to Klopf/Sutton Theory Did not include generalized reinforcement since z(t) is a specialized reward input Associative version of the ALOPEX algorithm of Harth & Tzanakou, and later Unnikrishnan
“Landmark Learning” Barto & Sutton 1981 An illustration of associative search
“Landmark Learning” swap E and W landmarks
y y y 1 2 3 Note: Diffuse Reward Signal reward x 1 x 2 x 3 Units can learn different things despite receiving identical inputs . . .
Provided there is variability • ASN just used noisy units to introduce variability • Variability drives the search • Needs to have an element of “blindness”, as in “blind variation”: i.e. outcome is not completely known beforehand • BUT does not have to be random • IMPORTANT POINT: Blind Variation does not have to be random, or dumb
Pole Balancing Barto, Sutton, & Anderson 1984 Widrow & Smith, 1964 “Pattern Recognizing Control Systems” Michie & Chambers, 1968 “Boxes: An Experiment in Adaptive Control”
MENACE Michie 1961“Matchbox Educable Noughts and Crosses Engine” x o o x x o o x x o o x o o o x o x o x x o o o x x o x x x o o o o o x x x x x o o o o o x o x o o o o x o x x o x o x o o x o o x x o o x o x o x o x x o x x o o o o x x o o x x o o x x x o o o x o x x x o x x o o o x o o x o o o x o o x x x o o x x o o o o o o o x