Game-Theoretic Multi-Agent Learning

Game-Theoretic Multi-Agent Learning By: Mostafa Sahraei-Ardakani

Research Groups: • Stanford University (Yoav Shoham) • Rutgers University (Michael Litmaan) • University of Michigan (Michael Wellman) • University of Alberta (Michael Bowling) • University of British Columbia (Kevin Leyton-Brown) • McGill University (Shie Mannor) • Brown University (Ammy Greenwald) • Carnegie Mellon University

Basic Definitions • Markov Decision Process(MDP) • Stage Games • Repeated Games : Repeated Stage Game • Stochastic Games (Markov Games) : A generalization of Repeated games and MDPs

Definitions in SG point of view • Repeated Game: Stochastic game with only one stage (state) • MDP: Stochastic game with only one agent • So SG is a generalization of RG and MDP and has both properties.

What is the question?!!! • What exact question(s) is MAL addressing? • What is the yardstick? • Which information is available? • Game rules • Play observability • Rivals’ actions • Rivals’ Strategies • Learning or/and Teaching • Rock-Paper-Scissors • Repeated Prisoners’ Dilemma

Engineering Application • Distributed Controllers • Simplifies design of independent controllers • Equilibrium or Global Optimum? • Problem of Exploitation of Learning

Model-Based Approaches • Of Game Theorists Interest • Start with some model of opponent’s strategy • Compute and play best response • Observe the opponent’s play and update the model of her strategy • Go to Step 2 • Example: Fictitious Play (1951) • Compute rivals’ mixed strategy according to the history • Play the best response

Fictitious Play (FP) • assumes opponent’s play stationary strategies • multiple best responses choosen with positive probability Convergence guarantees: • Games which are iterated dominance solvable (strict Nash equilibrium) • Cooperative • In zero-sum games the empirical distribution converges to the unique mixed startegy Nash equilibrium. Note : Smooth FP can play mixed strategies

Incremental Gradient Ascent Learners (IGA) • incrementally climbs on the mixed strategy space • for 2-player 2-action general sum games • guarentees convergence to a Nash equilibrium or guarantees convergence to an average payoff that is sustained by some Nash equilibrium

The Dynamics

The Update Rule

AWESOME! • Adapt When Everybody is Stationary, Otherwise Move to Equilibrium • Is not RL • Converges to NE in Self Play • Plays Some epochs • APPE • APS  Adapts and finds the best respons

Some Indices

Model-Free Approaches • Reinforcement learning • AVOID Building an explicit model • How well ones’ own various possible actions fare. • Mostly studied in: Computer Science- AI

Single Agent Q-Learning • The Environment is no longer Stationary • Therefore, the convergence is not guaranteed.

Bellman’s Heritage • Single agent Q-learning converges to optimal value function V* • Simple extension to multi-agent SG setting Q values updated without regard of opponents’ actions Justified if opponents’ choice of actions are stationary

Bellman’s Heritage • Cure: Define Q-values as a function of all agents’ actions Problem: How to update V? • Maximin Q-learning Problem: Motivated only for zero-sum SG

Mini-Max Learning • For Zero-Sum Games, or conservative play

Nash Q-Learning for GSSG • Max operator (Q-Learning)  Nash operator(Nash-Q)

Friend or Foe Q-Learning • Adversarial Equilibrium • Coordination Equilibrium

Friend or Foe Q-Learning (2) • Opponent Considered as Friend: • Opponent Considered as Foe

Friend or Foe Q-Learning (3) • The Opponent may act differently! • Results on two common grid games

Correlated Q-Learning • What is Correlated Equilibrium? • Example • Benefits over Mixed Strategy Nash • Convex ploytope  Linear Programming • Better Outcomes and Denial • Independent action selection with a shared signal

Correlated Q-Learning(2) • Need not be well-defined like Nash Value function • Generalizes the before mentioned functions

Correlated Q-Learning(3) • Utilitarian • Egalitarian • Republican • Libertarian

Correlated Q-Learning(4)

Correlated Q-Learning(5)

Platform for MARL • http://www.cs.ubc.ca/~kevinlb/malt • GAMUT  Stanford

New Approach: Time-order Policy Update • Make the Environment stationary • How to observe rivals’ actions? • Keep the MAX operator! • No direct focus on equilibria

QDTO

QDTO- Convergence

QDTO-Simulations

QTDO- Market

Thanks for your Attention!

Game-Theoretic Multi-Agent Learning