1 / 34

Top level learning

Top level learning. Pass selection using TPOT-RL. DT receiver choice function. DT is trained off-line in artificial situation DT used in a heuristic, hand-coded function to limit the potential receivers to those that are at least as close to the opponent‘s goal as the passer

iago
Download Presentation

Top level learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Top level learning Pass selection using TPOT-RL

  2. DT receiver choice function • DT is trained off-line in artificial situation • DT used in a heuristic, hand-coded function to limit the potential receivers to those that are at least as close to the opponent‘s goal as the passer • always passes to the potential receiver with the highest confidence of success (max(passer, receiver))

  3. requirement in „reality“ • best pass may be to a receiver farther away from the goal than the passer • the receiver that is most like to successfully receive the pass may not be the one that will subsequently act most favorable for the team

  4. backward pass situation

  5. Pass Selection - a team behavior • learn how to act strategically as part of a team • requires understanding of long-term effects of local decisions • given the behaviors and abilities of teammates and opponents • measured by the team‘s long-term success in a real game • -> must be trained on-line against an opponent

  6. ML algorithm characteristics for pass selection • On-line • Capable of dealing with a large state space despite limited training • Capable of learning based on long-term, delayed reward • Capable of dealing with shifting concepts • Works in a team-partitioned scenario • Capable of dealing with opaque transitions

  7. TPOT-RL succeeds by: • Partitioning the value function among multiple agents • Training agents simultaneously with a gradually decreasing exploration rate • Using action-dependent features to aggressively generalize the state space • Gathering long-term, discounted reward directly from the environment

  8. TPOT-RL: policy mapping(S -> A) • State generalization • Value function learning • Action selection

  9. State generalization I • Mapping state space to feature vectorf : S -> V • Using action-dependent feature functione : S x A -> U • Partitioning state space among agentsP : S -> M

  10. State generalization II • |M| >= m ... Number of agents in team • A = {a0, ..., an-1} • f(s) = <e(s, a0), ..., e(s, an-1), P(s)> • V = U|A| x M

  11. Value function learning I • Value function Q(f(s), ai)Q : V x A -> reell • Depends on e(s, ai)independent of e(s, aj)  j  i • Q-table has |U|1 * |M| * |A| entries

  12. Value function learning II • f(s) = v • Q(v, a) = Q(v, a) +  * (r – Q(v, a)) • r is derived from observable environmental characteristics • Reward function R : Stlim -> reell • Range of R is [-Qmax, Qmax] • Keep track of action taken ai and feature vector v at that time

  13. Action selection • Exploration vs. Exploitation • Reduce number of free variables with action filter • W  U: if e(s, a)  W -> a shouldn‘t be a potential action in s • B(s) = {a  A | e(s, a)  W} • B(s) = {} (W  U) ?

  14. TPOT-RL applied to simulated robotic soccer • 8 possible actions in A (see action space) • Extend definition of  (Section 6) • Input for L3 is DT from L2 to define e

  15. action space

  16. State generalization using a learned feature I • M = team‘s set of positions (|M| = 11) • P(s) = player‘s current position • Define e using DT (C = 0.734) • W = {Success}

  17. State generalization using a learned feature II • |U| = 2 • V = U8 x {PlayerPositions}|V| = |U||A| * |M| = 28 * 11 • Total number of Q-values:|U| * |M| * |A| = 2 * 11 * 8 • With action filtering (W):each agent learns |W| * |A| = 8 Q-values • 10 training examples per 10-minute game

  18. Value function learning via intermediate reinforcement I • Rg: if goal is scoredr = Qmax / t ... t  tlim • Ri: notice t, xt3 conditions to fix reward • Ball is goes out of bounds at t+t0 (t0 < tlim) • Ball returns to agent at t+tr (tr < tlim) • Ball still in bounds at t+tlim

  19. Ri: Case 1 • Reward r is based on value r0 • tlim = 30 seconds (300 sim. cycles) • Qmax = 100 •  = 10

  20. reward function

  21. Ri: Cases 2 & 3 • r based on average x-position of ball • xog= x-coordinate of opponent goal • xlg = x-coordinate of learner‘s goal

  22. Value function learning via intermediate reinforcement II • After taking ai and receiving r, update QQ(e(s, ai), ai) = (1 - ) * Q(e(s, ai), ai) + r •  = 0.02

  23. Action selection for multiagent training • Multiple agents are concurrently learning-> domain is non-stationary • To deal with this: • Each agent stays in the same state partition throughout training • Exploration rate is very high at first, then gradually decreasing

  24. State partitioning • Distribute training into |M| partitionseach with a lookup-table of size |A| * |U| • After training, each agent can be given the trained policy for all partitions

  25. Exploration rate • Early exploitation runs the risk of ignoring the best possible actions • When in state s choose • Action with highest Q-value with prob. p(ai such that j, Q(f(s),ai)  Q(f(s),aj)) • Random Action with probability (1 – p) • p increases gradually from 0 to 0.99

  26. Results I • Agents start out acting randomly with empty Q-tables •  v  V, a  A, Q(v,a) = 0 • Probability of acting randomly decreases linearly over periods of 40 games • to 0.5 in game 40 • to 0.1 in game 80 • to 0.01 in game 120 • Learning agents use Ri

  27. Results II

  28. Result II Statistics • 160 10-minute games • |U| = 1 • Each agent • gets 1490 action-reinforcement pairs • -> reinforcement 9.3 times per game • tried each action 186.3 times • -> each action only once per game

  29. Results III

  30. Results IV

  31. Results IV Statistics • Action predicted to succeed vs. Selected • 3 of 8 „attack“ actions (37.5%):6437 / 9967 = 64.6% • Action filtering • 39.6% of action options were filtered out • 10400 action opportunities B(s)  {}

  32. Results V

  33. Domain characteristics for TPOT-RL: • There are multiple agents organized in a team. • There are opaque state transitions. • There are too many states and/or not enough training examples for traditional RL. • The target concept is non-stationary. • There is long-range reward available. • There are action-dependent features available.

  34. Examples for such domains: • Simulated robotic soccer • Network packet-routing • Information networks • Distributed logistics

More Related