1 / 40

Markov Decision Processes & Reinforcement Learning

Markov Decision Processes & Reinforcement Learning. Megan Smith Lehigh University, Fall 2006. Outline. Stochastic Process Markov Property Markov Chain Markov Decision Process Reinforcement Learning RL Techniques Example Applications. Stochastic Process.

delta
Download Presentation

Markov Decision Processes & Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006

  2. Outline • Stochastic Process • Markov Property • Markov Chain • Markov Decision Process • Reinforcement Learning • RL Techniques • Example Applications

  3. Stochastic Process • Quick definition: A Random Process • Often viewed as a collection of indexed random variables • Useful to us: Set of states with probabilities of being in those states indexed over time • We’ll deal with discrete stochastic processes http://en.wikipedia.org/wiki/ Image:AAMarkov.jpg

  4. Stochastic Process Example • Classic: Random Walk • Start at state X0 at time t0 • At time ti, move a step Zi where P(Zi = -1) = p and P(Zi = 1) = 1 - p • At time ti, state Xi = X0 + Z1 +…+ Zi http://en.wikipedia.org/wiki/Image:Random_Walk_example.png

  5. Markov Property • Also thought of as the “memoryless” property • A stochastic process is said to have the Markov property if the probability of state Xn+1 having any given value depends only upon state Xn • Very much depends on description of states

  6. Markov Property Example • Checkers: • Current State: The current configuration of the board • Contains all information needed for transition to next state • Thus, each configuration can be said to have the Markov property

  7. Markov Chain • Discrete-time stochastic process with the Markov property • Industry Example: Google’s PageRank algorithm • Probability distribution representing likelihood of random linking ending up on a page http://en.wikipedia.org/wiki/PageRank

  8. Markov Decision Process (MDP) • Discrete time stochastic control process • Extension of Markov chains • Differences: • Addition of actions (choice) • Addition of rewards (motivation) • If the actions are fixed, an MDP reduces to a Markov chain

  9. Description of MDPs • Tuple (S, A, P(.,.), R(.))) • S -> state space • A -> action space • Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a) • R(s) = immediate reward at state s • Goal is to maximize some cumulative function of the rewards • Finite MDPs have finite state and action spaces

  10. Simple MDP Example • Recycling MDP Robot • Can search for trashcan, wait for someone to bring a trashcan, or go home and recharge battery • Has two energy levels – high and low • Searching runs down battery, waiting does not, and a depleted battery has a very low reward news.bbc.co.uk

  11. Transition Probabilities

  12. Transition Graph state node action node

  13. Solution to an MDP = Policy π • Gives the action to take from a given state regardless of history • Two arrays indexed by state • V is the value function, namely the discounted sum of rewards on average from following a policy • π is an array of actions to be taken in each state (Policy) 2 basic steps V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')

  14. Variants • Value Iteration • Policy Iteration • Modified Policy Iteration • Prioritized Sweeping 1 2 basic steps 2 V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') Value Function

  15. 0 10 10 0 4.5 14.5 19 0 2.03 18.55 24.18 8.55 4.76 19.26 11.79 29.23 20.81 31.82 7.45 15.30 10.23 17.67 22.72 33.68 Value Iteration V(s) = R(s) + γmaxa∑Pa(s,s')V(s')

  16. Why So Interesting? • If the transition probabilities are known, this becomes a straightforward computational problem, however… • If the transition probabilities are unknown, then this is a problem for reinforcement learning.

  17. Typical Agent • In reinforcement learning (RL), the agent observes a state and takes an action. • Afterward, the agent receives a reward.

  18. Mission: Optimize Reward • Rewards are calculated in the environment • Used to teach the agent how to reach a goal state • Must signal what we ultimately want achieved, not necessarily subgoals • May be discounted over time • In general, seek to maximize the expected return

  19. Value Functions • Vπ is a value function (How good is it to be in this state?) • Vπ is the unique solution to its Bellman Equation • Expresses relationship between a state and its successor states State-value function for policy π Bellman Equation:

  20. Another Value Function • Qπ defines the value of taking action a in state s under policy π • Expected return starting from s, taking action a, and thereafter following policy π Action-value function for policy π Backup diagrams for (a) Vπ and (b) Qπ

  21. Dynamic Programming • Classically, a collection of algorithms used to compute optimal policies given a perfect model of environment as an MDP • The classical view is not so useful in practice since we rarely have a perfect environment model • Provides foundation for other methods • Not practical for large problems

  22. DP Continued… • Use value functions to organize and structure the search for good policies. • Turn Bellman equations into update policies. • Iterative policy evaluation using full backups

  23. Policy Improvement • When should we change the policy? • If we pick a new action α from state s and thereafter follow the current policy and V(π’) >= V(π), then picking α from state s is a better policy overall. • Results from the policy improvement theorem

  24. Policy Iteration • Continue improving the policy πand recalculating V(π) • A finite MDP has a finite number of policies, so convergence is guaranteed in a finite number of iterations

  25. Remember Value Iteration? Used to truncate policy iteration by combining one sweep of policy evaluation and one of policy improvement in each of its sweeps.

  26. Monte Carlo Methods • Requires only episodic experience – on-line or simulated • Based on averaging sample returns • Value estimates and policies only changed at the end of each episode, not on a step-by-step basis

  27. Policy Evaluation • Compute average returns as the episode runs • Two methods: first-visit and every-visit • First-visit is most widely studied First-visit MC method

  28. Estimation of Action Values • State values are not enough without a model – we need action values as well • Qπ(s, a)  expected return when starting in state s, taking action a, and thereafter following policy π • Exploration vs. Exploitation • Exploring starts

  29. Example Monte Carlo Algorithm First-visit Monte Carlo assuming exploring starts

  30. Another MC Algorithm On-line, first-visit, ε-greedy MC without exploring starts

  31. Temporal-Difference Learning • Central and novel to reinforcement learning • Combines Monte Carlo and DP methods • Can learn from experience w/o a model – like MC • Updates estimates based on other learned estimates (bootstraps) – like DP

  32. TD(0) • Simplest TD method • Uses sample backup from single successor state or state-action pair instead of full backup of DP methods

  33. SARSA – On-policy Control • Quintuple of events (st, at, rt+1, st+1, at+1) • Continually estimate Qπ while changing π

  34. Q-Learning – Off-policy Control • Learned action-value function, Q, directly approximates Q*, the optimal action-value function, independent of policy being followed

  35. Case Study • Job-shop Scheduling • Temporal and resource constraints • Find constraint-satisfying schedules of short duration • In it’s general form, NP-complete

  36. NASA Space Shuttle Payload Processing Problem (SSPPP) • Schedule tasks required for installation and testing of shuttle cargo bay payloads • Typical: 2-6 shuttle missions, each requiring 34-164 tasks • Zhang and Dietterich (1995, 1996; Zhang, 1996) • First successful instance of RL applied in plan-space • states = complete plans • actions = plan modifications

  37. SSPPP – continued… • States were an entire schedule • Two types of actions: • REASSIGN-POOL operators – reassigns a resource to a different pool • MOVE operators – moves task to first earlier or later time with satisfied resource constraints • Small negative reward for each step • Resource dilation factor (RDF) formula for rewarding final schedule’s duration

  38. Even More SSPPP… • Used TD() to learn value function • Actions selected by decreasing ε-greedy policy with one-step lookahead • Function approximation used multilayer neural networks • Training generally took 10,000 episodes • Each resulting network represented different scheduling algorithm – not a schedule for a specific instance!

  39. RL and CBR • Example: CBR used to store various policies and RL used to learn and modify those policies • Ashwin Ram and Juan Carlos Santamarıa, 1993 • Autonomous Robotic Control • Job shop scheduling: RL used to repair schedules, CBR used to determine which repair to make • Similar methods can be used for IDSS

  40. References • Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998 • Stochastic Processes, www.hanoivn.net • http://en.wikipedia.org/wiki/PageRank • http://en.wikipedia.org/wiki/Markov_decision_process • Using Case-Based Reasoning as a Reinforcement Learning framework for Optimization with Changing Criteria, Zeng, D. and Sycara, K. 1995

More Related