1 / 71

Execution-Time Communication Decisions for Coordination of Multi-Agent Teams

Execution-Time Communication Decisions for Coordination of Multi-Agent Teams. Maayan Roth Thesis Defense Carnegie Mellon University September 4, 2007. Cooperative Multi-Agent Teams Operating Under Uncertainty and Partial Observability. Cooperative teams

tiana
Download Presentation

Execution-Time Communication Decisions for Coordination of Multi-Agent Teams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Execution-Time Communication Decisions for Coordination of Multi-Agent Teams Maayan Roth Thesis Defense Carnegie Mellon University September 4, 2007

  2. Cooperative Multi-Agent Teams Operating Under Uncertainty and Partial Observability • Cooperative teams • Agents work together to achieve team reward • No individual motivations • Uncertainty • Actions have stochastic outcomes • Partial observability • Agents don’t always know world state

  3. Coordinating When Communication is a Limited Resource • Tight coordination • One agent’s best action choice depends on the action choices of its teammates • We wish to Avoid Coordination Errors • Limited communication • Communication costs • Limited bandwidth

  4. Thesis Question “How can we effectively use communication to enable the coordination of cooperative multi-agent teams making sequential decisions under uncertainty and partial observability?

  5. Multi-Agent Sequential Decision Making

  6. Thesis Statement “Reasoning about communication decisions at execution-time provides a more tractable means for coordinating teams of agents operating under uncertainty and partial observability.”

  7. Thesis Contributions • Algorithms that: • Guarantee agents will Avoid Coordination Errors (ACE) during decentralized execution • Answer the questions of when and what agents should communicate

  8. Outline • Dec-POMDP model • Impact of communication on complexity • Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB) • ACE-PJB-Comm: When should agents communicate? • Selective ACE-PJB-Comm: What should agents communicate? • Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP) • Future directions

  9. Dec-POMDP Model • Decentralized Partially Observable Markov Decision Process • Multi-agent extension of single-agent POMDP model • Sequential decision-making in domains where: • Uncertainty in outcome of actions • Partial observability - uncertainty about world state

  10. Dec-POMDP Model • M = <, S, {Ai}im, T, {i}im, O, R> •  is the number of agents • S is set of possible world states • {Ai}im is set of joint actions, <a1, …, am> where ai  Ai • T defines transition probabilities over joint actions • {i}im is set of joint observations, <1, …, m> where i  i • O defines observation probabilities over joint actions and joint observations • R is team reward function

  11. Dec-POMDP Complexity • Goal - Compute policy which, for each agent, maps its local observation history to an action • For all   2, Dec-POMDP with  agents is NEXP-complete • Agents must reason about the possible actions and observations of their teammates

  12. Impact of Communication on Complexity [Pynadath and Tambe, 2002] • If communication is free: • Dec-POMDP reducible to single-agent POMDP • Optimal communication policy is to communicate at every time step • When communication has any cost, Dec-POMDP is still intractable (NEXP-complete) • Agents must reason about value of information

  13. Classifying Communication Heuristics • AND- vs. OR-communication [Emery-Montemerlo, 2005] • AND-communication does not replace domain-level actions • OR-communication does replace domain-level actions • Initiating communication [Xuan et al., 2001] • Tell - Agent decides to tell local information to teammates • Query - Agent asks a teammate for information • Sync - All agents broadcast all information simultaneously

  14. Classifying Communication Heuristics • Does the algorithm consider communication cost? • Is the algorithm is applicable to: • General Dec-POMDP domains • General Dec-MDP domains • Restricted domains • Are the agents guaranteed to Avoid Coordination Errors?

  15. Related Work AND OR Query Cost Sync ACE Unrestricted Tell

  16. Overall Approach • Recall, if communication is free, you can treat a Dec-POMDP like a single agent 1) At plan-time, pretend communication is free - Generate a centralized policy for the team 2) At execution-time, use communication to enable decentralized execution of this policy while Avoiding Coordination Errors

  17. Outline • Dec-POMDP, Dec-MDP models • Impact of communication on complexity • Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB) • ACE-PJB-Comm: When should agents communicate? • Selective ACE-PJB-Comm: What should agents communicate? • Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP) • Future directions

  18. Tiger Domain: (States, Actions) • Two-agent tiger problem [Nair et al., 2003]: Individual Actions: ai {OpenL, OpenR, Listen} Robot can open left door, open right door, or listen S: {SL, SR} Tiger is either behind left door or behind right door

  19. Tiger Domain: (Observations) Individual Observations: I  {HL, HR} Robot can hear tiger behind left door or hear tiger behind right door Observations are noisy and independent.

  20. Tiger Domain:(Reward) • Coordination problem – agents must act together for maximum reward Listen has small cost (-1 per agent) Both agents opening door with tiger leads to medium negative reward (-50) Maximum reward (+20) when both agents open door with treasure Minimum reward (-100) when only one agent opens door with tiger

  21. Coordination Errors Reward(<OpenR, OpenL>) = -100 Reward(<OpenL, OpenL>) ≥ -50 HL HL HL … Agents Avoid Coordination Errors when each agent’s action is a best response to its teammates’ actions. a1 = OpenR a2 = OpenL

  22. Avoid Coordination Errors by Reasoning Over Possible Joint Beliefs (ACE-PJB) • Centralized POMDP policy maps joint beliefs to joint actions • Joint belief (bt) – distribution over world states • Individual agents can’t compute the joint belief • Don’t know what their teammates have observed or what action they selected • Simplifying assumption: • What if agents knew the joint action at each timestep? • Agents would only have to reason about possible observations • How can this be assured?

  23. Ensuring Action Synchronization • Agents only allowed to choose actions based on information known to all team members • At start of execution, agents know b0 – initial distribution over world states A0 – optimal joint action given b0, based on centralized policy • At each timestep, each agent computes Lt, distribution of possible joint beliefs Lt = {<bt, pt, t>} t – observation history that led to bt pt - likelihood of observing t

  24. HL HL b: P(SL) = 0.8 p: p(b) = 0.29 Possible Joint Beliefs b: P(SL) = 0.5 p: p(b) = 1.0 L0 a = <Listen, Listen> HL,HL HR,HR HL,HR HR,HL How should agents select actions over joint beliefs? b: P(SL) = 0.5 p: p(b) = 0.21 b: P(SL) = 0.5 p: p(b) = 0.21 b: P(SL) = 0.2 p: p(b) = 0.29 L1

  25. Q-POMDP Heuristic • Select joint action that maximizes expected reward over possible joint beliefs • Q-MDP [Littman et al., 1995] • approximate solution to large POMDP using underlying MDP • Q-POMDP [Roth et al., 2005] • approximate solution to Dec-POMDP using underlying single-agent POMDP

  26. b: P(SL) = 0.8 p: p(b) = 0.29 Q-POMDP Heuristic b: P(SL) = 0.5 p: p(b) = 1.0 Choose joint action by computing expected reward over all leaves HL,HL HR,HR HL,HR HR,HL b: P(SL) = 0.5 p: p(b) = 0.21 b: P(SL) = 0.5 p: p(b) = 0.21 b: P(SL) = 0.2 p: p(b) = 0.29 Agents will independently select same joint action, guaranteeing they avoid coordination errors… but action choice is very conservative (always <Listen,Listen>) ACE-PJB-Comm: Communication adds local observations to joint belief

  27. HL ACE-PJB-Comm Example {} <HL,HL> <HL,HR> <HR,HL> <HR,HR> L1 aNC = Q-POMDP(L1) = <Listen,Listen> L* = circled nodes Don’t communicate aC = Q-POMDP(L*) = <Listen,Listen>

  28. ACE-PJB-Comm Example {HL,HL} {} <HL,HL> <HL,HR> <HR,HL> <HR,HR> L1 a = <Listen, Listen> <HL,HL> <HL,HL> <HL,HL> <HL,HR> <HL,HL> <HR,HL> <HL,HL> <HR,HR> <HL,HR> <HL,HL> <HL,HR> <HL,HR> <HL,HR> <HR,HL> <HL,HR> <HR.HR> … L2 aNC = Q-POMDP(L2) = <Listen, Listen> L* = circled nodes Agent 1 communicates aC = Q-POMDP(L*) = <OpenR,OpenR> V(aC) - V(aNC) > ε

  29. ACE-PJB-Comm Example {HL,HL} {} <HL,HL> <HL,HR> <HR,HL> <HR,HR> L1 a = <Listen, Listen> <HL,HL> <HL,HL> <HL,HL> <HL,HR> <HL,HL> <HR,HL> <HL,HL> <HR,HR> <HL,HR> <HL,HL> <HL,HR> <HL,HR> <HL,HR> <HR,HL> <HL,HR> <HR.HR> … L2 Agent 1 communicates <HL,HL> Agents open right door! Q-POMDP(L2) = <OpenR, OpenR>

  30. ACE-PJB-Comm Results • 20,000 trials in 2-Agent Tiger Domain • 6 timesteps per trial • Agents communicate 49.7% fewer observations using ACE-PJB-Comm, 93.3% fewer messages • Difference in expected reward because ACE-PJB-Comm is slightly pessimistic about outcome of communication

  31. Additional Challenges • Number of possible joint beliefs grows exponentially • Use particle filter to model distribution of possible joint beliefs • ACE-PJB-Comm answers the question of when agents should communicate • Doesn’t deal with what to communicate • Agents communicate all observations that they haven’t previously communicated

  32. Selective ACE-PJB-Comm[Roth et al., 2006] • Answers what agents should communicate • Chooses most valuable subset of observations • Hill-climbing heuristic to choose observations that “push” teams towards aC • aC - joint action that would be chosen if agent communicated all observations • See details in thesis document

  33. Selective ACE-PJB-Comm Results • 2-Agent Tiger domain: • Communicates 28.7% fewer observations • Same expected reward • Slightly more messages

  34. Outline • Dec-POMDP, Dec-MDP models • Impact of communication on complexity • Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB) • ACE-PJB-Comm: When should agents communicate? • Selective ACE-PJB-Comm: What should agents communicate? • Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP) • Future directions

  35. Dec-MDP • State is collectively observable • One agent can’t identify full state on its own • Union of team observations uniquely identifies state • Underlying problem is an MDP, not a POMDP • Dec-MDP has same complexity as Dec-POMDP • NEXP-Complete

  36. Acting Independently • ACE-PJB requires agents to know joint action at every timestep • Claim: In many multi-agent domains, agents can act independently for long periods of time, only needing to coordinate infrequently

  37. Meeting-Under-Uncertainty Domain • Agents must move to goal location and signal simultaneously • Reward: +20 - Both agents signalat goal -50 - Both agents signal at another location -100 - Only one agent signals -1 - Agents move north, south, east, west, or stop

  38. Factored Representations • Represent relationships among state variables instead of relationships among states S = <X0, Y0, X1, Y1> Each agent observes its own position

  39. Factored Representations • Dynamic Decision Network models state variables over time • at = <East, *>:

  40. Tree-structured Policies • Decision tree that branches over state variables • A tree-structured joint policy has joint actions at the leaves

  41. Approach[Roth et al., 2007] • Generate tree-structured joint policies for underlying centralized MDP • Use this joint policy to generate a tree-structured individual policy for each agent* • Execute individual policies * See details in thesis document

  42. Context-specific Independence Claim: In many multi-agent domains, one agent’s individual policy will have large sections where it is independent of variables that its teammates observe.

  43. Individual Policies • One agent’s individual policy may depend on state features it doesn’t observe

  44. Avoid Coordination Errors by Executing an Individual Factored Policy (ACE-IFP) • Robot traverses policy tree according to its observations • If it reaches a leaf, its action is independent of its teammates’ observations • If it reaches a state variable that it does not observe directly, it must ask a teammate for the current value of that variable • The amount of communication needed to execute a particular policy corresponds to the amount of context-specific independence in that domain

  45. Avoid Coordination Errors by Executing an Individual Factored Policy (ACE-IFP) • Benefits: • Agents can act independently without reasoning about the possible observations or actions of their teammates • Policy directs agents about when, what, and with whom to communicate • Drawback: • In domains with little independence, agents may need to communicate a lot

  46. Experimental Results • In 3x3 domain, executing factored policy required less than half as many messages as full communication, with same reward • Communication usage decreases relative to full communication as domain size increases

  47. Factored Dec-POMDPs • [Hansen and Feng, 2000] looked at factored POMDPs • ADD-representations of transition, observation, and reward functions • Policy is a finite-state controller • Nodes are actions • Transitions depend on conjunctions of state variable assignments • To extend to Dec-POMDP, make individual policy a finite-state controller among individual actions • Somehow combine nodes with the same action • Communicate to enable transitions between action nodes

  48. Future Directions • Considering communication cost in ACE-IFP • All children of a particular variable may have similar values • Worst-case cost of mis-coordination? • Modeling teammate variables requires reasoning about possible teammate actions • Extending factoring to Dec-POMDPs

  49. Future Directions • Knowledge persistence • Modeling teammates’ variables • Can we identify “necessary conditions”? • e.g. “Tell me when you reach the goal.” Are you here yet? Are you here yet?

  50. Contributions • Decentralized execution of centralized policies • Guarantee that agents will Avoid Coordination Errors • Make effective use of limited communication resources • When should agents communicate? • What should agents communicate? • Demonstrate significant communication savings in experimental domains

More Related