N-Grams and the Last-Good-Reply Policy in MCTS: Monte-Carlo Methods for Computation and Optimization

N-Grams and the Last-Good-Reply Policy in MCTS Monte-Carlo methods for Computation and OptimizationSpring 2015 Based on “N-Grams and the Last-Good-Reply Policy Applied in General Game Playing” (Mandy J. W. Tak, Mark H. M. Winands, YngviBjörnsson), IEEE Transactions on Computational Intelligence and AI in games, Vol. 4 No. 2, June 2012 Presentation by Ayal Shwartz

Games: • Turn-Based Zero-sum games • Worst-Case scenario • Minimax trees • pruning • Problems with minimax trees and pruning • Probabilistic approach • Monte-Carlo Tree Search • Selection and Expansion • Simulation and Backpropagation • Upper confidence bounds • Simulation strategies: • Scoring • Move-Average sampling • N-Grams • Last-Good-Reply policy • Some fine details • Results

Games • A game is structured playing, usually undertaken for enjoyment, and sometimes used as an educational tool • Key components of games are goals, rules, challenge, and interaction Source: Wikipedia • In a game, we have players, each with their own reward function. • Usually the only reward that interests us is the one at the end of the game – when no one can play anymore, and thus, the rewards are set. • Each player wants to maximize their own reward, defined by the game.

Turn-Based Zero-Sum Games • Many games are turn-based, meaning every player has his turn to play, and cannot play during someone else’s turn. • Many games of interest are zero-sum games, meaning the sum of the reward is 0 (or some constant). • This means that if one player has positive reward, someone else must have negative reward. • In two-player games, this means that if your opponent wins, you lose!

Worst-Case scenario • We assume that our opponents will always choose the move that guarantees them maximal reward in the worst-case scenario. • And we should do the same. • In other words, we should choose the moves for which the minimal reward we can guarantee will be maximal

Minimax trees • In our turn, we want to choose the move thatgives us the highest worst-case reward • In his turn, our opponent wants to choosethe move that gives him the highest worst-case reward • Since this is a zero-sum game, maximalreward for us is minimal reward for the opponent,and minimal reward for us is maximal reward for the opponent. < -2 -2 -4 3 3 2

Pruning -2 > • Pruning saves us some computations. • Since our opponent will always pick the lowestreward, there is no point in looking at thebrother of the leaf with reward -4, since ouropponent won’t pick any reward greaterthan -4. • We already have a way to guarantee atleast -2 reward. • This is called Alpha pruning. • Similar reasoning can give us Beta pruning. -2 -2 -2 -4 3 3 -4 2

Problems with minimax trees and pruning • In almost any game of interest, there are too many possible states. We cannot consider all possible ways to continue the game from a given state, unless there aren’t too many ways to end the game from the given state. • Not enough memory. • Not enough time to preform necessary computations.  We are limited in how deep we can develop the tree.

Problems with pruning (2) • Solution: Use a heuristic reward function for the states. • This enables us to look only at the next steps, and choose the step that would give us the best worst-case-scenario reward in steps. • Problem: This requires prior knowledge of the game.

Problems with pruning (3) • In General Game Playing (GPP), the objective is to be able to play any game • Given a description of the game (in Game Description Language), we would like to be able to play the game as soon as we have the game’s description (with some time for pre-processing allowed). We won’t have time to come up with a heuristic reward function.

Monte-Carlo Tree Search • Rather than look at all the options, we can probabilistically sample moves, and simulate a game from the chosen moves until it ends. • Based on the results of the simulations, we decide which moves are best. • This (potentially) saves us some time and memory. • Hopefully, we will get a good estimation of the (true) victory probability • Note that the victory probability is also based on how we select the next move we will play. • This also allows us to interrupt the tree search any time and come up with an action (thought we cannot make the same guarantees made in minimax trees).

Monte-Carlo Tree Search (2) • If we bias ourselves towards good moves in the simulation, we might get a different probability distribution over the states in which the game ends. • We are more likely to win if we bias our choice of moves to ones that give us a better victory probability • (assuming we also select the next move to play appropriately)

Selection and Expansion • Selection: • Starting at the root (representing current state), traversethe tree until you reach a leaf. • Expansion: • If the leaf doesn’t represent a state in which the game ends, select the leaf, and create one (or more) children for theleaf. • Each child represents a state reachable from it’s parent usingan action that can be preformed from the parent state. 3/4 1/2 2/2 1/2 0/0

Simulation and Backpropagation 5/7 4/6 • Simulation: • From newly created node, simulate a game. That is,sample moves for each player, create a child noderepresenting the resulting state, and repeat from newnode until game end is reached. • Backpropagation: • Propagate the result of the simulation (victory/defeat, or reward)back to the node created in the expansion phase. Update parents. 3/4 1/2 4/5 2/2 1/2 2/3 0/0 1/1 1:0

Upper confidence bound • In the selection phase, we would like to balance two elements: exploration and exploitation. • We should explore new moves. • Shorter paths to victory are preferable, as they contain less uncertainty. • Perhaps we will find a path with a higher reward. • We should exploit promising moves. • We don’t have enough time or memory to explore too many options. • There may not be any better options which we are likely to reach.

Upper confidence bound (2) • UCT (Upper Confidence bound for Trees) can be used to balance exploration and exploitation: Where: • is a collection of all actions we can take from state • is the average reward when taking action from . • is the number of visits to state , and is the number of times action has been taken from state .

Upper confidence bound (3) Trade-off between exploration and exploitation Every time an action is taken from a certain state, the chance of this action being taken again decreases, thus increasing the importance of it’s reward and decreasing the importance of exploring it again And the chance to choose any other action increases slightly, to promote exploration of other actions.

Upper confidence bound (4) • The UCT gives us a measure of how important it is to expand certain nodes. • Note that if we haven’t explored a certain action from a state , then is 0, and is not defined • In such a case, we will explore by default. • In other words, if we’ve decided that it is important to explore a certain state which has unexplored actions, it is important to explore them. • If there are many such actions, we use the same selection strategy we use in the simulation stage.

Simulation strategies • The simulation strategy is a key component of MCTS. • A basic simulation strategy would be to select random moves. • This, however, might not be such a good idea: we can generally assume that our opponent will favour moves which give him a good chance at victory. • We would also like to choose good moves when we actually play the game.

Scoring • If we have a “score” for each move, we can use that score when considering which move to select next. • We will consider a simple scoring function: Average reward returned when the move (or series of moves) we’re considering was used. • For convenience, let’s assume our rewards are 1 (victory) or 0 (defeat). • Can easily be generalized to games with different rewards/non-zero-sum games.

Scoring (2) • How do we use the score? • Gibbs measure: let be all the legal moves from a certain state. Let be the probability of selecting as the next move, then: • is the score for move . • is a parameter which controls exploration. • is the maximal reward if move hasn’t been given a score yet. • -Greedy: select the move with the highest score , otherwise, sample a random move uniformly.

Move-Average Sampling • The Move-Average Sampling Technique (MAST)uses the average reward from play-outs(simulations) in which a move was used tocalculate the move’s score in order to determineif it’s a good move or not. : 3 Victories, 3 defeats : 2 Victories, 8 defeats : 5 Victories, 2 defeat : 3 Victories, 2 defeats : 2 Victories, 7 defeats : 5 Victories, 1 defeat 0:1

N-Grams • N-Grams add context to the moves: We look at up to (N-1) moves back when deciding which move to select next. • We give a score to each legal move based on the N-Gram formed when adding it to the previous moves. • If a certain series of moves hasn’t been played enough times, we ignore it, as we do not have enough information on that move yet. • We average the score of all the relevant -Grams, , that we have in memory. • (If a certain -Gram hasn’t been at least times, we ignore it. In the paper, was chosen with manual tuning).

N-Grams (2) Reward Lookup (Average reward when series of moves was used)

Last-Good-Reply Policy • During a game, we can consider moves to be “replies” to the previous move (or series of moves). • If a reply to a series of moves was successful, we should try it again when we are able to. • Like in N-Grams, we can look back a few steps and try to consider the context. • If we lost in one of our simulations, all of the “replies” which appeared in the simulation are removed from memory, since they are no longer considered good replies. • If we won in the simulation, all the replies that appeared in the simulation will be added, since they seem to be good replies.

Some fine details • Expansion when a node has unvisited moves: The UCT cannot be applied when we have actions for which we have no score.In such a case, we will use one of the simulation strategies to select an action from the set of unexplored actions that can be taken. • Last-Good-Reply: If we cannot use a reply (if it isn’t a legal move), we use a fall-back strategy – either N-Gram, or MASTin this case. Basic MCTS algorithms might randomly select moves with uniform probabilities.

Some fine details (2) • Non-zeros-sum games When the game is not a zero-sum game, we cannot assume that our opponent will choose the move which is worst for us. Instead, when we simulate the opponent’s move, we assume he tries to maximize his own reward. • Games with more than two players: When there are players, an N-Gram is made of a sequence of consecutive move, made by players: With the move made by player appended the end of this sequence.

Some fine details (3) • Concurrent move games: We can look at a turn-based game as a concurrent-move game, with a special action called noop(No operation), which each player must play until a certain condition is met (signalling that it’s his turn). Note that even in concurrent-move games, the N-Grams look back on previous turns, not on multiple moves in the same turn, though they are calculated from the perspective of each player.

Some fine details (4) • And finally, how do we select the move when we actually need to play? • Option 1: Select the move which leads to the highest reward we’ve seen so far. • This is the strategy employed by CadiaPlayer – a player who won multiple General Game Playing competitions, using MCTS with MAST and Gibbs measure. • Option 2: Select the move which has the highest average reward. • Both have their advantages and disadvantages.

Some fine details (5) • It would be preferable to win in as few moves as possible, since fewer moves mean less uncertainty. • A “discount factor” can be introduced. The reward for an endgame that takes steps would be multiplied by • In CadiaPlayer,

Results(Best of all values tested when compared to Gibbs)

Results (2)-Greedy MAST against MAST using Gibbs measure It seems that in the case of Connect5, exploration is moreimportant than exploitation, as opposed to Checkers.

N-Grams and the Last-Good-Reply Policy in MCTS: Monte-Carlo Methods for Computation and Optimization

N-Grams and the Last-Good-Reply Policy in MCTS: Monte-Carlo Methods for Computation and Optimization

Presentation Transcript

Monte Carlo methods in finance

Monte Carlo Methods: Basics

Monte Carlo Methods

Monte Carlo Simulation Methods

Monte Carlo Methods

Random Numbers and Monte Carlo Methods

Monte Carlo methods

Basket Options And Monte Carlo Methods

Classical and Quantum Monte Carlo Methods

Introduction to Monte Carlo Methods

Chapter 5: Monte Carlo Methods

Monte Carlo methods

Monte Carlo Methods

MONTE` CARLO METHODS

Parallel Monte Carlo Methods

Monte Carlo Methods

Monte-Carlo Methods

Monte Carlo Methods

Monte Carlo Methods

Monte Carlo Methods

Monte Carlo Methods

Monte Carlo Methods