Monte-Carlo Methods

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

Differences with DP/TD • Differences with DP methods: • Real RL: Complete transition model not necessary • They sample experience; can be used for direct learning • They do not bootstrap • No evaluation of successor states • Differences with TD methods • Well, they do not bootstrap • they average episodic returns Slides prepared by Georgios Chalkiadakis

Overview and Advantages • Learn from experience – sample episodes • Sample sequences of states, actions, rewards • Either on-line, or from simulated (model-based) interactions with environment. • But no complete model required. • Advantages • Provably learn optimal policy without model • Can be used with sample /easy-to-produce models • Can focus on interesting state regions easily • More robust wrt Markov property violations Slides prepared by Georgios Chalkiadakis

Policy Evaluation Slides prepared by Georgios Chalkiadakis

Action-value functions required • Without a model, we need Q-value estimates • MC methods now average returns following visits to state-action pairs • All such pairs “need” to be visited! • …sufficient exploration required • Randomize episode starts (“exploring-starts”) • …or behave using a stochastic (e.g. ε-greedy) policy • …thus “Monte-Carlo” Slides prepared by Georgios Chalkiadakis

Monte-Carlo Control (to generate optimal policy) • For now, assume “exploring starts” • Does “policy iteration” work? • Yes! Where evaluation of each policy is over multiple episodes And improvement  make policy greedy wrt current Q-value function Slides prepared by Georgios Chalkiadakis

Monte-Carlo Control (to generate optimal policy) • Why? is greedy wrt • Then, policy-improvement theorem applies because, for all s : is uniformly better than Thus Slides prepared by Georgios Chalkiadakis

A Monte-Carlo control algorithm Slides prepared by Georgios Chalkiadakis

What about ε-greedy policies? ε-greedy Exploration • If not “greedy”, select with Otherwise: Slides prepared by Georgios Chalkiadakis

Yes, policy iteration works • See the details in book • ε-soft on-policy algorithm:

…and you can have off-policy learning as well… • Why? Slides prepared by Georgios Chalkiadakis

Monte-Carlo Methods

Monte-Carlo Methods

Presentation Transcript

INTRODUCTION TO MONTE CARLO METHODS

Monte Carlo methods in finance

Monte Carlo Methods: Basics

Monte Carlo Methods

Monte Carlo Simulation Methods

Monte Carlo Methods

Monte Carlo methods

Introduction to Monte Carlo Methods

Chapter 5: Monte Carlo Methods

Monte Carlo methods

Monte Carlo Methods

MONTE` CARLO METHODS

Parallel Monte Carlo Methods

Monte Carlo Methods

Monte Carlo Methods

Computational Physics - Monte Carlo Methods

Monte Carlo Methods

Monte Carlo Methods

Monte Carlo Methods

Monte Carlo Methods

Monte Carlo Methods Week #5

Monte-Carlo Methods