1 / 31

Decision Making Under Uncertainty Lec #9: Approximate Value Function

Decision Making Under Uncertainty Lec #9: Approximate Value Function. UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006. Some slides by Jeremy Wyatt (U Birmingham), Ron Parr (Duke), Eduardo Alonso, and Craig Boutilier (Toronto). So Far. MDPs, RL Transition matrix (nxn)

hilda
Download Presentation

Decision Making Under Uncertainty Lec #9: Approximate Value Function

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Making Under UncertaintyLec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal AmirSpring Semester 2006 Some slides by Jeremy Wyatt (U Birmingham), Ron Parr (Duke), Eduardo Alonso, and Craig Boutilier (Toronto)

  2. So Far • MDPs, RL • Transition matrix (nxn) • Reward vector (n entries) • Value function – table (n entries) • Policy – table (n entries) • Time to solve is linear programming with O(n) variables • Problem: Number of states (n) too large in many systems

  3. Today: Approximate Values • Motivation: • Representation of value function is smaller • Generalization: fewer state-action pairs must be observed • Some spaces (e.g., continuous) must be approximated • Function approximation • Have a parametrized representation for value function

  4. Overview: Approximate Reinforcement Learning • Consider V^( ; ): X  R, where  denotes the vectors of parameters of the approximator • Want to learn  so that V^ approximates V* well • Gradient-descent: find direction  in which V^( ; ) can change to improve performance the most • Use step size, : V^(s; new) = V^(s; ) + 

  5. Overview: Approximate Reinforcement Learning • In RL, at each time step, the learning system uses a function approximator to select an action a= ^(s; ) • Should adjust  so as to produce the maximum possible value for each s • Would like ^ to solve the parametric global optimization problem ^(s; ) = maxaA Q(s,a) sS

  6. Supervised learning andReinforcement learning • SL: s and a = (s) are given • Regression uses direction  =a – ^(s; ). • SL: we can check =0 and decide if the correct map is given by ^ at s. • RL: direction is not available • RL: cannot make conclusion on correctness without exploring the values of V(s,a) for all a

  7. How to measure the performance of FA methods? • Regression SL minimizes the mean-square error (MSE) over some distribution P of the inputs • Our value prediction problem: inputs = states, and target function = V, so MSE(t ) = sS P(s)[V(s)- Vt(s)]2 • P is important because it is usually not possible to reduce the error to zero at all states.

  8. More on MSE • Convergence results available to on-policy distributions with frequency of states that are encountered while interacting with environment • Question: why minimizing MSE? • Our goal is to make predictions to aid in finding a better policy. The best predictions for that purpose are not necessarily the best for minimizing MSE • There is no better alternative at the moment

  9. Gradient-Descent for Reinforcement Learning • t parameter vector of real valued components • Vt (s) is smooth differentiable function of t for all sS • Minimize error on observed example <st,V(st)>: Adjust t by a small amount in the direction that reduces the error the most on that example t+1 = t +  [V(st) - Vt(st)] t Vt(st) • Where t Vt(st) denotes the derivative vector of the gradient of Vt(st) with respect to t

  10. Value Prediction • If vt, the t-th training example, is not the true value, V(st), but some approximation: t+1 = t +  [vt- Vt(st)] t Vt(st) • Similarly, for n-step returns t+1 = t +  [R()t- Vt(st)] t Vt(st), or t+1 = t +  t et , for t the Bellman error and et = et-1 t Vt(st)

  11. Gradient-Descent TD() for V Initialize  arbitrarily and e = 0 Repeat (for each episode) s  initial state of episode Repeat (for each step of the episode) a  action given by  for s Take action a, observe reward, r, and next state, s’   r +  V(s’) – V(s) e  e  V(s)   +  e Until s is terminal

  12. Approximation Models • Most widely used (and simple) G-D: • NN using the error backpropagation algorithm • Linear form • Factored value functions (your present.) [GPK] • First-order value functions (your present.) [BRP] • Restricted policies (your present.) [KMN]

  13. Approximation Models • Most widely used (and simple) G-D: • NN using the error backpropagation algorithm • Linear form • Factored value functions (your present.) [GPK] • First-order value functions (your present.) [BRP] • Restricted policies (your present.) [KMN]

  14. Function Approximation • Common approach to solving MDPs • find a functional form f(q)for VF that is tractable • e.g., not exponential in number of variables • attempt to find parameters q s.t. f(q) offers “best fit” to “true” VF • Example: • use neural net to approximate VF • inputs: state features; output: value or Q-value • generate samples of “true VF” to train NN • e.g., use dynamics to sample transitions and train on Bellman backups (bootstrap on current approximation given by NN)

  15. Linear Function Approximation • Assume a set of basis functionsB = { b1 ... bk } • each bi : S →  generally compactly representable • A linear approximator is a linear combination of these basis functions; for some weight vector w: • Several questions: • what is best weight vector w ? • what is a “good” basis set B ? • what does this buy us computationally?

  16. Flexibility of Linear Decomposition • Assume each basis function is compact • e.g., refers only a few vars; b1(X,Y), b2(W,Z), b3(A) • Then VF is compact: • V(X,Y,W,Z,A) = w1 b1(X,Y) + w2 b2(W,Z) + w3 b3(A) • For given representation size (10 parameters), we get more value flexibility (32 distinct values) compared to a piecewise constant rep’n

  17. Interpretation of Averagers Averagers Interpolate: y1 y2 Grid vertices = Y x y4 y3

  18. Linear methods? • So, if we can find good basis sets (that allow a good fit), VF can be more compact • Several have been proposed: coarse coding, tile coding, radial basis functions, factored (project on a few variables), feature selection, first-order

  19. Linear Approx: Components • Assume basis set B = { b1 ... bk } • each bi : S →  • we view each bias an n-vector • let A be the n x k matrix [ b1 ... bk ] • Linear VF: V(s) = S wi bi(s) • Equivalently: V = Aw • so our approximation of V must lie in subspace spanned by B • let B be that subspace

  20. Policy Evaluation • Approximate state-value function is given by Vt(s) = S wi bi(s) • The gradient of the approximation value function with respect to t in this case is wi Vt(s) = bi(s)

  21. Approximate Value Iteration • We might compute approximate V by • Let V0 = Aw0 for some weight vector w0 • Perform Bellman backups to produce V1 = Aw1; V2 = Aw2; V3 = Aw3; etc... • Unfortunately, even if V0 in subspace spanned by B, L*(V0) = L*(Aw0) will generally not be • So we need to find best approximation to L*(Aw0) in B before we can proceed

  22. Projection • We wish to find a projection of our VF estimates into Bminimizing some error criterion • We’ll use max norm • Given V lying outside B, we want a w s.t: || Aw – V || is minimal

  23. Projection as Linear Program • Finding a w that minimizes || Aw – V || can be accomplished with a simple LP • Number of variables is small (k+1); but number of constraints is large (2 per state) • this defeats the purpose of function approximation • but let’s ignore for the moment Vars: w1, ..., wk, f Minimize: f S.T. f V(s) – Aw(s) , s f Aw(s) - V(s) , s f measures max norm difference between V and “best fit”

  24. Approximate Value Iteration • Run value iteration; but after each Bellman backup, project result back into subspace B • Choose arbitrary w0 and let V0 = Aw0 • Then iterate • Compute Vt=L*(Awt-1) • Let Vt =Awtbe projection of Vtinto B • Error at each step given by f • final error, convergence not assured • Analog for policy iteration as well

  25. Convergence Guarantees Monte Carlo: converges to minimal MSE || Q’MC - Qp||p= minw || Q’w - Qp||p ºe TD(0) converges close to MSE || Q’TD - Qp||p=e /(1-g) [TV] DP may diverge There exists counter examples [B,TV]

  26. VFA Convergence results • Linear TD(l) converges if we visit states using the on-policy distribution • Off policy Linear TD(l) and linear Q learning are known to diverge in some cases • Q-learning, and value iteration used with some averagers (including k-Nearest Neighbour and decision trees) has almost sure convergence if particular exploration policies are used • A special case of policy iteration with Sarsa style updates and linear function approximation converges • Residual algorithms are guaranteed to converge but only very slowly

  27. VFA TD-gammon • TD(l) learning and a Backprop net with one hidden layer • 1,500,000 training games (self play) • Equivalent in skill to the top dozen human players • Backgammon has ~1020 states, so can’t be solved using DP

  28. Homework • Read about POMDPs: [Littman ’96] Ch.6-7

  29. THE END

  30. FA, is it the answer? • Not all FA methods are suited for the use of RL. For example, NN and statistical methods assume a static training set over which multiple passes are made. In RL, however, it is important that learning be able to ccur on-line, while interacting with the environment (or a model of the environment) • In particular, Q learning seems to lose its convergence properties when integrated with FA

  31. A linear, gradient-descent version of Watkins’s Q() Initialize  arbitrarily and e = 0 Repeat (for each episode) s  initial state of episode For all aA(s) F(a)  set of features present in s,a Qa  iF(a)(i) Repeat (for each step of episode) With probability 1- a  arg maxa Qa e  e else a  a random action A(s) e  0 For all iF(a): e(i)  e(i)+1 Take action a, observe reward r and next state s’   r – Qa For all aA(s’) F(a)  set of features present in s’,a Qa  iF(a)(i) a’  arg maxa Qa    + Qa’   +  e until s’ is terminal

More Related