1 / 14

Reinforcement Learning

Reinforcement Learning. Mitchell, Ch. 13 (see also Barto & Sutton book on-line). Rationale. Learning from experience Adaptive control Examples not explicitly labeled, delayed feedback Problem of credit assignment – which action(s) led to payoff?

tate
Download Presentation

Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

  2. Rationale • Learning from experience • Adaptive control • Examples not explicitly labeled, delayed feedback • Problem of credit assignment – which action(s) led to payoff? • tradeoff short-term thinking (immediate reward) for long-term consequences

  3. Agent Model • Transition function – T:SxA->S, environment • Reward function R:SxA->real, payoff • Stochastic but Markov • Policy=decision function, p:S->A • “rationality” – maximize long term expected reward • Discounted long-term reward (convergent series) • Alternatives: finite time horizon, uniform weights =

  4. R,T

  5. Markov Decision Processes (MDPs) • if know R and T(=P), solve for value func Vp(s) • policy evaluation • Bellman Equations • dynamic programming (|S| eqns in |S| unknowns)

  6. MDPs • finding optimal policies • Value iteration – update V(s) iteratively until p(s)=argmaxa Vp(s) stops changing • Policy iteration – iterate between choosing p and updating V over all states • Monte Carlo sampling: run random scenarios using p and take average rewards as V(s)

  7. Q-learning: model-free • Q-function: reformulate as value function of S and A, independent of R and T(=d)

  8. Q-learning algorithm

  9. Convergence • Theorem: Q converges to Q*, after visiting each state infinitely often (assuming |r|<) • Proof: with each iteration (where all SxA visited), magnitude of largest error in Q table decreases by at least g

  10. Training • “on-policy” • exploitation vs. exploration • will relevant parts of the space be explored if stick to current (sub-optimal) policy? • e-greedy policies: choose action with max Q value most of the time, or random action e % of the time • “off-policy” • learn from simulations or traces • SARSA: training example database: <s,a,r,s’,a’> • Actor-critic

  11. Non-deterministic case

  12. Temporal Difference Learning

  13. convergence is not the problem • representation of large Q table is the problem (domains with many states or continuous actions) • how to represent large Q tables? • neural network • function approximation • basis functions • hierarchical decomposition of state space

More Related