1 / 22

Reinforcement Learning in the Multi-Robot Domain

Reinforcement Learning in the Multi-Robot Domain. Source. “Reinforcement Learning in the Multi-Robot Domain” : Maja J Mataric. Introduction. Mataric describes a method for real-time learning in an autonomous agent Reinforcement Learning (RL) is used

pricee
Download Presentation

Reinforcement Learning in the Multi-Robot Domain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning in the Multi-Robot Domain

  2. Source • “Reinforcement Learning in the Multi-Robot Domain” : Maja J Mataric

  3. Introduction • Mataric describes a method for real-time learning in an autonomous agent • Reinforcement Learning (RL) is used • The agent learns based on rewards and punishments • This method is experimentally validated on a group of 4 robots learning a foraging task

  4. Why? • A successful learning algorithm would allow autonomous agents to exhibit complex behaviors with little (or no) extra programming.

  5. Two Main Challenges • The state space is prohibitively large • Building a predictive model is very slow • Might be more efficient to learn a policy • Dealing with Structuring and Assignment of reinforcement • Environment does not provide a direct source of immediate reinforcement • Delayed Credit

  6. Addressing Challenges Since the state space is prohibitively large, it is necessary to reduce the space using behaviors and conditions • Behaviors • Homing / Wall-following • Abstract away low-level controllers • Conditions • Have-puck? / at-home? • Abstract away details of the state space

  7. Addressing Challenges • Reinforcement is difficult because an event that induces reinforcement may be due to past actions such as • Attempts to reach a goal • or Reactions to another robot • To address this Mataric uses Shaped Reinforcement in the form of • Heterogeneous reward functions • and Progress Estimators

  8. Reward Functions • Heterogeneous reward functions combine multi-modal feedback from • External (sensory) • and Internal (state) modalities • Each behavior has an associated goal which provides a reinforcement signal • More sub-goals leads to more frequent reinforcement which leads to faster convergence

  9. Reward Functions • Progress Estimators (PE’s) provide positive or negative reinforcement with respect to the current goal • PE’s Decrease sensitivity to noise • Noise-induced events are not consistently supported • PE’s Encourage Exploration • Non-productive behaviors are terminated • Decrease fortuitous rewards • Over time less reward is given to fortuitous successes

  10. The Learning Task • The Learning Task consists of finding a mapping from conditions • Have-puck? • At-home? • Near-intruder? • Night-time? • To behaviors • Safe-wandering • Dispersion • Resting • homing

  11. Learning Algorithm • Matrix A(c,b) is a normalized sum of the reinforcement, R, for each behavior pair over time t A(c,b) = Sumt R(c,t) • Learning is continuous

  12. Immediate Reinforcement • Positive • Ep: grasped-puck • Egd: dropped-puck-at-home • Egw: woke-up-at-home • Negative • Ebd: dropped-puck-away-from-home • Ebw: woke-up-away-from-home • The events are merged into one heterogeneous reinforcement function RE(c)

  13. Progress Estimators • RI(c,t) – Minimizing Interference • Positive for increasing distance • Negative for decreasing distance • RH(c,t) – Homing (with puck) • Positive for nearer to home • Negative for farther from home

  14. Control Algorithm • Behavior selection is induced by events • Events are triggered • Externally • Internally • By progress estimators

  15. Control Algorithm When an event is detected, the following control sequence is executed • Current (c,b) pair is reinforced • Current behavior is terminated • New behavior selected • Choose an untried behavior • Otherwise choose “best” behavior

  16. Experimental Results Three approaches compared • Monolithic single-goal reward • Puck delivery to home • Heterogeneous reward function • Heterogeneous reward function with two progress estimators

  17. Experimental Results • The hand-coded base policy

  18. Experimental Results • Monolithic • Heterogeneous • Heterogeneous with progress estimators Percent of correct policy learned after 15 minutes

  19. Evaluation • Monolithic • Does not provide enough feedback • Heterogeneous Reward • Certain behaviors pursued too long • Behaviors with delayed reward ignored (homing) • Heterogeneous with Progress Estimators • Eliminates thrashing • Impact of fortuitous rewards minimized

  20. Conclusions • Mataric’s methods of heterogeneous reward functions with progress estimators can improve performance using domain knowledge

  21. Critique • Multi-robot learning? • The techniques converge to a hand-crafted policy – what is the optimal policy?

  22. Questions?

More Related