1 / 18

An Overview of MAXQ Hierarchical Reinforcement Learning

An Overview of MAXQ Hierarchical Reinforcement Learning. Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei. Motivation. The traditional reinforcement learning algorithms treat the state space of the Markov Decision Process as a single “flat” search space.

egan
Download Presentation

An Overview of MAXQ Hierarchical Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei

  2. Motivation • The traditional reinforcement learning algorithms treat the state space of the Markov Decision Process as a single “flat” search space. • Drawback of this approach: not scale to tasks that have a complex, hierarchical structure, e.g., robot soccer, air traffic control. • To overcome this problem, i.e. to make reinforcement learning hierarchical, need to introduce mechanisms for abstraction and sharing This paper describes an initial effort in this direction

  3. A learning example

  4. A learning example (cont’d) Task: the taxi is in a randomly-chosen cell and the passenger is at one of the four special locations (R, G, B, Y). The passenger has a desired destination and the job of the taxi is to go to the passenger, pick him/her up, go to the passenger’s destination, and drop him/her off. Six available primitive actions: North, South, East, West, Pickup and Putdown Reward: each action receives -1; when the passenger is putdown at the destination, receive +20; when the taxi attempts to pickup a non-existent passenger or putdown the passenger at a wrong place, receive -10; running into walls has no effect but entails the usual reward of -1.

  5. Q-learning algorithm • For any MDP, there exist one or more optimal policies. All these policies share the same optimal value function, which satisfies the Bellman equation: • Q function:

  6. Q-learning algorithm (cont’d) • Value function example:

  7. Q-learning algorithm (cont’d) • Learning Process:

  8. Hierarchical Q-learning • Action a is generally simple, e.g., those available primitive actions (Normal Q- learning) • Could action a be also complex, e.g., a subroutine that takes many primitive actions and then exits? • Yes! The learning algorithm still works. (Hierarchical Q-learning)

  9. Hierarchical Q-learning (cont’d) • Assumption: some hierarchical structure is given.

  10. HSMQ Alg. (Task Decomposition)

  11. MAXQ Alg. (Value Fun. Decomposition) • Want to obtain some sharing (compactness) in the representation of the value function. • Re-write Q(p, s, a) as where V(a, s) is the expected total reward while executing action a, and C(p, s, a) is the expected reward of completing parent task p after a has returned

  12. MAXQ Alg. (cont’d) • An example

  13. MAXQ Alg. (cont’d)

  14. MAXQ Alg. (cont’d)

  15. State Abstraction Three fundamental forms • Irrelevant variables e.g. passenger location is irrelevant for the navigate and put subtasks and it thus could be ignored. • Funnel abstraction A funnel action is an action that causes a larger number of initial states to be mapped into a small number of resulting states. E.g., the navigate(t) action maps any state into a state where the taxi is at location t. This means the completion cost is independent of the location of the taxi—it is the same for all initial locations of the taxi.

  16. State Abstraction (cont’d) • Structure constraints - E.g. if a task is terminated in a state s, then there is no need to represent its completion cost in that state - Also, in some states, the termination predicate of the child task implies the termination predicate of the parent task Effect - reduce the amount memory to represent the Q-function. 14,000 q values required for flat Q-learning 3,000 for HSMQ (with the irrelevant-variable abstraction 632 for C() and V() in MAXQ - learning faster

  17. State Abstraction (cont’d)

  18. Limitations • Recursively optimal not necessarily optimal • Model-free Q-learning Model-based algorithms (that is, algorithms that try to learn P(s’|s,a) and R(s’|s,a)) are generally much more efficient because they remember past experience rather than having to re-experience it.

More Related