Life: play and win in 20 trillion moves Stuart Russell

Life: play and win in 20 trillion movesStuart Russell

White to play and win in 2 moves

Life • 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions • (Not to mention choosing brain activities!) • The world has a very large, partially observable state space and an unknown model • So, being intelligent is “provably” hard • How on earth do we manage?

Hierarchical structure!! (among other things) • Deeply nested behaviors give modularity • E.g., tongue controls for [t] independent of everything else given decision to say “tongue” • High-level decisions give scale • E.g., “go to IJCAI 13” 3,000,000,000 actions • Look further ahead, reduce computation exponentially [NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05, ICAPS 07, ICAPS 08, ICAPS 10, IJCAI 11]

Reminder: MDPs • Markov decision process (MDP) defined by • Set of states S • Set of actions A • Transition model P(s’ | s,a) • Markovian, i.e., depends only on st and not on history • Reward function R(s,a,s’) • Discount factor γ • Optimal policyπ* maximizes the value function Vπ= Σt γt R(st,π(st),st+1) • Q(s,a) = Σs’ P(s’ | s,a)[R(s,a,s’) + γV(s’)]

Reminder: policy invariance • MDP M with reward function R(s,a,s’) • Any potential function Φ(s) then • Any policy that is optimal for M is optimal for any version M’ with reward function R’(s,a,s’) = R(s,a,s’) + γΦ(s’) - Φ(s) • So we can provide hints to speed up RL by shaping the reward function

Reminder: Bellman equation • Bellman equation for a fixed policy π : Vπ(s) = Σs’P(s’|s,π(s)) [R(s,π(s),s’) + γVπ (s’) ] • Bellman equation for optimal value function: V(s) = maxaΣs’P(s’|s,a) [R(s,a,s’) + γV(s’) ]

Reminder: Reinforcement learning • Agent observes transitions and rewards: • Doing a in s leads to s’with reward r • Temporal-difference (TD) learning: • V(s) <- (1-α)V(s) + α[r+γV(s’) ] • Q-learning: • Q(s,a) <- (1-α)Q(s,a) + α[r+γ maxa’Q(s’,a’) ] • SARSA (after observing the next action a’ ): • Q(s,a) <- (1-α)Q(s,a) + α[r+γQ(s’,a’) ]

Reminder: RL issues • Credit assignment (problem: • Generalization problem:

Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.

Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc. Oops – lost the Markov property! Action outcome depends on the low-level state, which depends on history

Abstract actions • Forestier & Varaiya, 1978: supervisory control: • Abstract action is a fixed policy for a region • “Choice states” where policy exits region Decision process on choice states is semi-Markov

Semi-Markov decision processes • Add random duration N for action executions: P(s’,N|s,a) • Modified Bellman equation: Vπ = Σs’,N P(s’,N|s,π(s)) [R(s,π(s),s’,N) + γNVπ (s’) ] • Modified Q-learning with abstract actions: • For transitions within an abstract action πa: rc = rc + γcR(s,πa(s),s’) ; γc = γcγ • For transitions that terminate action begun at s0 Q(s0,πa) = (1-α)Q(s0,πa) + α(rc + γc maxa’Q(s’,πa’) ]

Hierarchical abstract machines • Hierarchy of NDFAs; states come in 4 types: • Action states execute a physical action • Call states invoke a lower-level NDFA • Choice points nondeterministically select next state • Stop states return control to invoking NDFA • NDFA internal transitions dictated by percept

Running example • Peasants can move, pickup and dropoff • Penalty for collision • Cost-of-living each step • Reward for dropping off resources • Goal : gather 10 gold + 10 wood • (3L)n++ states s • 7n primitive actions a

Part of 1-peasant HAM

Joint SMDP • Physical states S, machine states Θ, joint state space Ω = S x Θ • Joint MDP: • Transition model for physical state (from physical MDP) and machine state (as defined by HAM) • A(ω) is fixed except when θ is at a choice point • Reduction to SMDP: • States are just those where θ is at a choice point

Hierarchical optimality • Theorem: the optimal solution to the SMDP is hierarchically optimal for the original MDP, i.e., maximizes value among all behaviors consistent with the HAM

Partial programs • HAMs (and Dietterich’s MAXQ hierarchies and Sutton’s Options) are all partially constrained descriptions of behaviors with internal state • A partial program is like an ordinary program in any programming language, with a nondeterministic choice operator added to allow for ignorance about what to do • HAMs, MAXQs, Options are trivially expressible as partial programs • Standard MDP: loop forever choose an action

RL and partial programs Partial program a Learning algorithm s,r Completion

RL and partial programs Partial program a Learning algorithm s,r Completion Hierarchically optimal for all terminating programs

Single-threaded Alisp program Program state θ includes Program counter Call stack Global variables (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action‘get-wood) (nav *base-loc*) (action‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action‘get-gold) (nav *base-loc*)) (action‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice‘nav-choice (move ‘(N S E W NOOP)) (action move))))

Q-functions • Represent completions using Q-function • Joint stateω = [s,θ] env state + program state • MDP + partial program = SMDP over {ω} • Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter” • Modified Q-learning [AR 02] finds optimal completion

Internal state • Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies • E.g., while navigating to location (x,y), moving towards (x,y) is a good idea • Natural local shaping potential (distance from destination) impossible to express in external terms

Temporal Q-decomposition Top Top Top GatherGold GatherWood Nav(Mine2) Nav(Forest1) Qr Qc Qe Q • Temporal decompositionQ = Qr+Qc+Qewhere • Qr(ω,u) = reward while doing u (may be many steps) • Qc(ω,u) = reward in current subroutine after doing u • Qe(ω,u) = reward after current subroutine

State abstraction • Temporal decomposition => state abstraction E.g., while navigating, Qc independent of gold reserves • In general, local Q-components can depend on few variables => fast learning

Handling multiple effectors Get-gold Get-wood Multithreadedagent programs • Threads = tasks • Each effector assigned to a thread • Threads can be created/destroyed • Effectors can be reassigned • Effectors can be created/destroyed (Defend-Base) Get-wood

An example Concurrent ALisp program An example single-threaded ALisp program (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

Concurrent Alisp semantics Peasant 1 (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) • defun gather-wood () • (setf dest (choose *forests*)) • (nav dest) • (action ‘get-wood) • (nav *home-base-loc*) • (action ‘put-wood)) Thread 1 Running Paused Waiting for joint action Making joint choice (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Peasant 2 Paused Thread 2 Running Making joint choice Waiting for joint action Environment timestep 27 Environment timestep 26

Concurrent Alisp semantics • Threads execute independently until they hit a choice or action • Wait until all threads are at a choice or action • If all effectors have been assigned an action, do that joint action in environment • Otherwise, make joint choice

Q-functions • To complete partial program, at each choice state ω, need to specify choices for all choosing threads • So Q(ω,u) as before, except u is a joint choice • Suitable SMDP Q-learning gives optimal completion Example Q-function

Problems with concurrent activities • Temporal decomposition of Q-function lost • No credit assignment among threads • Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly • Peasant 2 thinks he’s done very well!! • Significantly slows learning as number of peasants increases

Threadwise decomposition • Idea : decompose reward among threads (Russell+Zimdars, 2003) • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants • Qjπ(ω,u) = “Expected total reward received by thread j if we make joint choice u and then do π” • Threadwise Q-decompositionQ = Q1+…Qn • Recursively distributed SARSA => global optimality

Learning threadwise decomposition Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread r decomp a Action state

Learning threadwise decomposition Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax = u Q(ω,·) Choice state ω

Threadwise and temporal decomposition Peasant 3 thread • Qj = Qj,r + Qj,c + Qj,e where • Qj,r (ω,u) = Expected reward gained by thread jwhile doing u • Qj,c (ω,u) = Expected reward gained by thread jafter u but before leaving current subroutine • Qj,e (ω,u) = Expected reward gained by thread jafter current subroutine Top thread Peasant 2 thread Peasant 1 thread

Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)

Summary of Part I • Structure in behavior seems essential for scaling up • Partial programs • Provide natural structural constraints on policies • Decompose value functions into simple components • Include internal state (e.g., “goals”) that further simplifies value functions, shaping rewards • Concurrency • Simplifies description of multieffector behavior • Messes up temporal decomposition and credit assignment (but threadwise reward decomposition restores it)

Outline • Hierarchical RL with abstract actions • Hierarchical RL with partial programs • Efficient hierarchical planning

Abstract Lookahead Rxa1 c2 c2 Kxa1 Qa4 ‘a’ ‘b’ a b ‘b’ ‘a’ ‘b’ ‘a’ aa ab ba bb • k-step lookahead >> 1-step lookahead • e.g., chess • k-step lookahead no use if steps are too small • e.g., first k characters of a NIPS paper Kc3 • Abstract plans (high-level executions of partial programs) are shorter • Much shorter plans for given goals => exponential savings • Can look ahead much further into future

High-level actions [Act] [GoSFO, Act] [GoOAK, Act] • Start with restricted classical HTN form of partial program • A high-level action (HLA) • Has a set of possible immediate refinements into sequences of actions, primitive or high-level • Each refinement may have a precondition on its use • Executable plan space = all primitive refinements of Act [WalkToBART, BARTtoSFO, Act]

Lookahead with HLAs • To do offline or online planning with HLAs, need a model • See classical HTN planning literature

Lookahead with HLAs • To do offline or online planning with HLAs, need a model • See classical HTN planning literature – not! • Drew McDermott (AI Magazine, 2000): • The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions

Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds • Problem: how to say true things about effects of HLAs?

Angelic semantics for HLAs S S S s0 s0 s0 a4 a1 a3 a2 g g g • Begin with atomic state-space view, start in s0, goal state g • Central idea is the reachable set of an HLA from each state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2

Life: play and win in 20 trillion moves Stuart Russell