1 / 29

Applying Online Search Techniques to Reinforcement Learning

Applying Online Search Techniques to Reinforcement Learning. Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University. The Agony of Continuous State Spaces. Learning useful value functions for continuous-state optimal control problems can be difficult

sanura
Download Presentation

Applying Online Search Techniques to Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applying Online Search Techniques to Reinforcement Learning Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University

  2. The Agony of Continuous State Spaces • Learning useful value functions for continuous-state optimal control problems can be difficult • Small inaccuracies/inconsistencies in approximated value functions can cause simple controllers to fail miserably • Accurate value functions can be very expensive to compute even in relatively low-dimensional spaces with perfectly accurate state transition models

  3. Combining Value Functions With Online Search • Instead of modeling the value function accurately everywhere, we can perform online searches for good trajectories from the agent’s current position to compensate for value function inaccuracies • We examine two different types of search: • “Local” searches in which the agent performs a finite-depth look-ahead search • “Global” searches in which the agent searches for trajectories all the way to goal states

  4. where RT is the reward accumulated along T is the discount factor xT is the state at the end of T Typical One-Step “Search” Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes RT +V(xT) This takes O(|A|) time, where A is the set of possible actions. Given a perfect V(x), this would lead to optimal behavior.

  5. Local Search • An obvious possible extension: consider all possible d-step trajectories T, selecting the one that maximizes RT + dV(xT). • Computational expense: O(|A|d). • To make deeper searches more computationally tractable, we can limit agent to considering only trajectories in which the action is switched at most s times. • Computational expense: (considerably cheaper than full d-step search if s << d)

  6. Velocity Position Local Search: Example • Two-dimensional state space (position + velocity) • Car must back up to take “running start” to make it Search over 20-step trajectories with at most one switch in actions

  7. Using Local Search Online Repeat: • From current state, consider all possible d-step trajectories T in which the action is changed at most s times • Perform the first action in the trajectory that maximizes RT + dV(xT). Let B denote the “parallel backup operator” such that If s = (d-1), Local Search is formally equivalent to behaving greedily with respect to the new value function Bd-1V. Since V is typically arrived at through iterations of a much cruder backup operator, this value function is often much more accurate than V.

  8. Uninformed Global Search • Suppose we have a minimum-cost-to-goal problem in a continuous state space with nonnegative costs. Why not forget about explicitly calculating V and just extend the search from the current position all the way to the goal? • Problem: combinatorial explosion. • Possible solution: • Break state space into partitions, e.g. a uniform grid. (Can be represented sparsely.) • Use previously discussed local search procedure to find trajectories between partitions • Prune all but least-cost trajectory entering any given partition

  9. Uninformed Global Search • Problems: • Still computationally expensive • Even with fine partitioning of state space, pruning the wrong trajectories can cause search to fail

  10. Informed Global Search • Use approximate value function V to guide the selection of which points to search from next • Reasonably accurate V will cause search to stay along optimal path to goal: dramatic reduction in search time • V can help choose effective points within each partition from which to search, thereby improving solution quality • Uniformed Global Search same as “Informed” Global Search with V(x) = 0

  11. Informed Global Search Algorithm • Let x0 be current state, and g(x0) be the grid element containing x0 • Set g(x0)’s “representative state” to x0, and add g(x0) to priority queue P with priority V(x0) • Until goal state found or P empty: • Remove grid element g from top of P. Let x denote g’s “representative state.” • SEARCH-FROM(g, x) • If goal found, execute trajectory; otherwise signal failure

  12. Informed Global Search Algorithm, cont’d SEARCH-FROM(g, x): • Starting from x, perform “local search” as described earlier, but prune the search wherever it reaches a different grid element g  g. • Each time another grid element g reached at state x: • If g previously SEARCHED-FROM, do nothing. • If g never previously reached, add g to P with priority RT(x0…x) +  |T|V(x), where T is trajectory from x0 to x. Set g’s “representative state” to x. Record trajectory from x to x. • If g previously reached but previous priority is lower than RT(x0…x) +  |T|V(x), update g s priority to RT(x0…x) +  |T|V(x) and set “representative state” to x. Record trajectory from x to x.

  13. Informed Global Search Examples 7*7 simplex-interpolated V 13*13 simplex-interpolated V Hill-car Search Trees

  14. Informed Global Search as A* • Informed Global Search is essentially an A* search using the value function V as a search heuristic • Using A* with an optimistic heuristic function normally guarantees optimal path to the goal. • Uninformed global search effectively uses trivially optimistic heuristic V(s) = 0. Might we expect better solution quality with uninformed search than with non-optimistic crude approximate value function V? • Not necessarily! A crude approximate non-optimistic value function can improve solution quality by helping the algorithm avoid pruning wrong parts of search tree

  15. Hill-car • Car on steep hill • State variables: position and velocity (2-d) • Actions: accelerate forward or backward • Goal: park near top • Random start states • Cost: total time to goal

  16. Acrobot • Two-link planar robot acting in vertical plane under gravity • Underactuated joint at elbow; unactuated shoulder • Two angular positions & their velocities (4-d) • Goal: raise tip at least one link’s height above shoulder • Two actions: full torque clockwise / counterclockwise • Random starting positions • Cost: total time to goal Goal 1 2

  17. Move-Cart-Pole • Upright pole attached to cart by unactuated joint • State: horizontal position of cart, angle of pole, and associated velocities (4-d) • Actions: accelerate left or right • Goal configuration: cart moved, pole balanced • Start with random x;  = 0 • Per-step cost quadratic in distance from goal configuration • Big penalty if pole falls over  Goal configuration x

  18. Planar Slider • Puck sliding on bumpy 2-d surface • Two spatial variables & their velocities (4-d) • Actions: accelerate NW, NE, SW, or SE • Goal in NW corner • Random start states • Cost: total time to goal

  19. Local Search Experiments Move-Cart-Pole • CPU Time and Solution cost vs. search depth d • No limits imposed on number of action switches (s=d) • Value function: 134 simplex-interpolation grid

  20. Local Search Experiments Hill-car • CPU Time and Solution cost vs. search depth d • Max. number of action switches fixed at 2 (s = 2) • Value function: 72 simplex-interpolated value function

  21. Comparative experiments: Hill-Car • Local search: d=6, s=2 • Global searches: • Local search between grid elements: d=20, s=1 • 502 search grid resolution • 72 simplex-interpolated value function

  22. Hill-Car results cont’d • Uninformed Global Search prunes wrong trajectories • Increase search grid to 1002 so this doesn’t happen: • Uninformed does near-optimal • Informed doesn’t: crude value function not optimistic Failed search trajectory picture goes here

  23. Comparative Results: Four-d domains All value functions: 134 simplex interpolations All local searches between global search elements: depth 20, with at max. 1 action switch (d=20, s=1) • Acrobot: • Local Search: depth 4; no action switch restriction (d=4,s=4) • Global: 504 search grid • Move-Cart-Pole: same as Acrobot • Slider: • Local Search: depth 10; max. 1 action switch (d=10,s=1) • Global: 204 search grid

  24. Acrobot • Local search significantly improves solution quality, but increases CPU time by order of magnitude • Uninformed global search takes even more time; poor solution quality indicates suboptimal trajectory pruning • Informed global search finds much better solutions in relatively little time. Value function drastically reduces search, and better pruning leads to better solutions #LS: number of local searches performed to find paths between elements of global search grid

  25. Move-Cart-Pole • No search: pole often falls, incurring large penalties; overall poor solution quality • Local search improves things a bit • Uninformed search finds better solutions than informed • Few grid cells in which pruning is required • Value function not optimistic, so informed search solutions suboptimal • Informed search reduces costs by order of magnitude with no increase in required CPU time

  26. Planar Slider • Local search almost useless, and incurs massive CPU expense • Uninformed search decreases solution cost by 50%, but at even greater CPU expense • Informed search decreases solution cost by factor of 4, at no increase in CPU time

  27. Using Search with Learned Models • Toy Example: Hill-Car • 72 simplex-interpolated value function • One nearest-neighbor function approximator per possible action used to learn dx/dt • States sufficiently far away from nearest neighbor optimistically assumed to be absorbing to encourage exploration • Average costs over first few hundred trials: • No search: 212 • Local search: 127 • Informed global search: 155

  28. Using Search with Learned Models • Problems do arise when using learned models: • Inaccuracies in models may cause global searches to fail. Not clear then if failure should be blamed on model inaccuracies or on insufficiently fine state space partitioning • Trajectories found will be inaccurate • Need adaptive closed-loop controller • Fortunately, we will get new data with which to increase the accuracy of our model • Model approximators must be fast and accurate

  29. Avenues for Future Research • Extensions to nondeterministic systems? • Higher-dimensional problems • Better function approximators for model learning • Variable-resolution search grids • Optimistic value function generation?

More Related