Meeting 9 - RL Wrap Up and B&W Tournament

Meeting 9 - RL Wrap Up and B&W Tournament • Course logistic notes • Reinforcement Learning Wrap Up (30 minutes) • Temporal Difference Learning and Variants • Examples • Continuous state-action spaces • Tournament time – move to 5110

Course Logistics • Assignment 1: • Tournament code due several minutes ago • Final code and paper due Thursday. This is the result of the McGill open house last weekend that prevented lab access to some. • Tournament after short conclusion of RL

Value Estimation and Greedy Policy Improvement: Exercise • The exercise was posted in elaborated form on Friday 27 January – See the course website. • The exercise is due on Tuesday 7 February at 16h00. • This will count as one quiz toward your grade.

Reinforcement Learning

RL – Setting

Value and Action-Value Functions • ValueV(s)of a state under the policy: • Action-ValueQ(s,a): Take any action a and follow policy thereafter

Generalized Policy Iteration (for policy improvement) • Iterative cycle of value estimation and improvement of policy:

Value Estimation via Temporal Difference Learning • Idea: use sampling idea of Monte-Carlo, but instead of adjusting V in to better match observed return Rt use revised estimate for the return: • The value function update formula becomes: • Note: if V(st+1) were a perfect estimate, this would be a DP update • This Value Estimation method is called Temporal-Difference (TD) learning

Temporal-Difference (TD) Learning

TD Learning: Advantages • No model of the environment (rewards, probabilities) is needed • TD only needs experience with the environment. • On-line, incremental learning • Both TD and MC converge. TD converges faster • Learn before knowing final outcome • Less memory and peak computation required

TD for learning action values ( “Q-Learning”) • Easy ansatz provides a TD version of action-value learning – • Bellman equation for Qπ: • Dynamic programming update for Q: • TD update for Q(s,a):

On-Policy SARSA learning • Policy improvement: At each time-step, choose action greedy mostly greedy with respect to Q: • After getting reward, seeing st+1, and choosing at+1, update action values according to action-value TD formula:

Grid World – Example Goal – Prey runs about at random Agent – Predator chasing prey

Grid World Example Pursuit before learning

Grid World Example Upon further learning trials

Grid World Example Learned pursuit task

Learning the pursuit behavior

Grid World Example State space and learned value function

Grid World Example State sequence st Action sequence at Reward sequence rt Value sequence V(st) Delta sequence rt+1 - γV(st+1) - V(st)

Continuous (or large) state spaces • Previously implied table implementation: Value[state] • Impractical for continuous or large (multi-dimensional) state spaces • Examples: Chess (~10^43), Robot manipulator (continuous) • B&W state space size? • Both storage problems and generalization problems • Generalization problem: Agent will not have a chance to explore most states of a sufficiently large state space

State Space Generalization: Approaches • Quantize continuous state spaces • Circumvents generalization problem: force small number states • Quantization of continuous state variables, perhaps coarse • Example: Angle in cart-pole problem quantized to (-90, -30, 0, 30, 90)° • Tilings: Impose overlapping - grid • More general approach: function approximation • Simple case: represent V(s) as a set of weights wi and basis functions fi(s) • V(s) = w1 f1(s) + w2 f2(s) + ... + wn fn(s) • More refined methods of function approximation (eg. neural network) in coming weeks • Added benefit: V(s) generalizes to predict value of states not yet visited (interpolating between their values, for example)

Robot that learns to stand

Robot that learns to stand • No static solution here to stand-up problem: robot must use momentum to stand up • RL used: • Two-layer heirarchy • TD learning for action values Q(s,a) applied to plan problem sub-goal sequence that may lead to success • Continuous version of TD learning applied to learn to achieve sub-goals • Robot trained several hundred iterations in simulation + 100 or so trials on the physical robot • Details: • J. Morimoto, K. Doya, “Acquisition of stand-up behavior by a real robot using heirarchical reinforcement learning”. J. Robotics Auto. Sys. 36 (2001)

Robot that learns to stand • After several hundred attempts in simulation

Robot that learns to stand • After ~100 hundred additional trials on robot

Wrap up

Wrap up • Next time: • Brief overview of planning in AI • End of 1st section of course devoted to problem solving, search, and RL

Wrap up • Required readings • Russell and Norvig • Chapter 11: Planning • Acknowledgements: • Rich Sutton, Doina Precup, RL materials used here • Kenji Doya, Standing robot materials

Inaugural B&W Computer Tournament • Number of competitors? • Duration of typical game? • t ≈ 50 (total) moves × 10 sec / move ≈ 8 minutes • Stage 1: • Round-robin play • 3 games against randomly selected opponents • t ≈ 35 minutes • Top 8 agents advance. Scoring: Draw = 0, Win = +1, Loss = -1. • Stage 2: • Single elimination seeded bracket play: (((1 8),(4 5)),((2 7),(3 6))) • Top four competitors receive bonus (will deal fairly with drawn agents) • Draws: game drawn after 50 (total) moves, or by referee decision

Meeting 9 - RL Wrap Up and B&W Tournament