1 / 40

Reinforcement Learning Part 2

Reinforcement Learning Part 2. Temporal Difference Learning, Actor-Critics, and the brain. Outline of Lecture. Review Temporal Difference learning An efficient way to estimate the value function Actor-Critic Methods A way to learn if you can estimate the value function

menefer
Download Presentation

Reinforcement Learning Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning Part 2 Temporal Difference Learning, Actor-Critics, and the brain

  2. Outline of Lecture • Review • Temporal Difference learning • An efficient way to estimate the value function • Actor-Critic Methods • A way to learn if you can estimate the value function • The relationship between TD(λ) and the brain.

  3. (REVIEW) Example: GridWorld Current State Actions States Terminal State (Optional) Initial or Start State

  4. (REVIEW) Markov Decision Process (MDP) • M = (S, A, P, R, d0, γ) • S is the set of possible states. • A is the set of possible actions. • P describes the transition dynamics. • We use t to denote the (integer) time step. • R describes the rewards. • d0 is the distribution over the states at time t = 0. • γ is a real-valued discount parameter in the interval [0,1].

  5. (REVIEW) Episodes • An episode is one run of an MDP, starting at t=0, and running until a terminal state is reached.

  6. (REVIEW) Trajectory • If you use a policy, π, on an MDP, M, for one episode, you get a trajectory.

  7. (REVIEW) Discounted Return

  8. (REVIEW) Value Function

  9. (REVIEW) Objective

  10. (REVIEW) Softmax Policy • One policy parameter per state-action pair.

  11. (REVIEW) Parameterized Gaussian Policy • Let φ(s) be a vector of features associated with the state s.

  12. (REVIEW) Solving an MDP using Local Search • Given policy parameters, we can estimate how good they are by generating many trajectories using them and then averaging the returns. • Use hill-climbing, simulated annealing, a genetic algorithm, or any other local search method to find θ that maximize J.

  13. Temporal Difference Learning • Temporal difference learning (TD) is an algorithm for estimating the value function. • Let be our estimate of the true value of the state s, • We can initialize it randomly or to zero. • If we take action a in state s and go to state s’ and receive a reward of r, how can we update

  14. TD-Error Temporal Difference Error (TD Error) Bellman Error Reward Prediction Error

  15. TD-Error • A positive TD-error means that reality was better than our expectation. • We should increase • A negative TD-error means that reality was worse than our expectation. • We should decrease Reality Expectation

  16. TD Algorithm

  17. TD(λ) Say that the TD-error at time t + 4 is positive: We would increase our estimate of the value of st+4. If we already updated the value of st+3 before, then we underestimated its value! We used the old value

  18. TD(λ) • Idea: If we observe a positive TD-error (things worked out better than expected), then we could increase the value of many of the recent states. • Note: There are many ways of viewing TD(λ).

  19. TD(λ) • Allows observed rewards to update value estimates for many states immediately.

  20. TD(λ) • Each state has an eligibility trace, which tracks how much a positive TD error should increase its value and a negative TD error should decrease it. • As time passes, the eligibility of a state decays. • When a state occurs, its eligibility is set to 1 (it is very responsible for the TD error).

  21. TD(λ) • Let et(s) be the eligibility of state s. • When state s occurs, set et(s) = 1. • Otherwise set et(s) = γλet(s), where λ is a parameter between 0 and 1. γλ = 0.8 eligibility Time since state occurred

  22. TD(λ) Notice that λ = 0 results in TD.

  23. Actor-Critic

  24. Actor-Critic

  25. Actor-Critic • Use TD(λ) to estimate the value function. • Use a parameterized policy to select actions. • When the TD error is positive, increase the probability of the action that was chosen. • When the TD error is negative, decrease the probability of the action that was chosen.

  26. Actor-Critic with Softmax Action Selection Remember:

  27. Actor-Critic with Softmax Action Selection

  28. Other Actor-Critics • We could use something like an artificial neural network to represent the policy. • We could add eligibility traces to the policy update. • We could make sure that the policy updates result in the policy following the gradient of J(θ) • Called “Policy Gradient” • We could make sure that the policy updates result in the policy following the natural gradient of J(θ). • Called “Natural Policy Gradient” • My research involves adding safety constraints, safety guarantees, and convergence guarantees to natural policy gradient methods.

  29. TD and the Brain • Preliminaries • RL and TD are not meant to model the brain. • There are many details the we will not discuss. • We will ignore controversies and discuss the leading hypotheses only.

  30. TD and the Brain • “If the problems that animals face are well modeled as [MDPs/POMDPs]—as we think they are—it would be surprising if effective algorithms bore no relationship to the methods that have evolved enabling animals to deal with the problems they face over their lifetimes.” -Andy Barto

  31. Dopamine • Dopamine is a neurotransmitter.

  32. Dopamine • Dopamine is manufactured in the ventral tegmental area (VTA) and substantianigra and broadcast to several parts of the brain. • Evidence suggests that dopamine might encode TD error.

  33. Dopamine • Prefrontal Cortex: • Planning complex cognitive behavior • Decision making • Striatum: • Motor control (Parkinson’s) • Nucleus accumbens: • Pleasure • Laughter • Reward • reinforcement learning • fear, aggression • addiction • Hippocampus: • Long-term memory • Short-term memory • Spatial memory and navigation

  34. GeroMiesenboeck’s TED Talk

  35. Return Midterms • Next time: • TD(λ) with function approximation • More!

More Related