1 / 31

Q-Learning and Dynamic Treatment Regimes

Q-Learning and Dynamic Treatment Regimes. S.A. Murphy Univ. of Michigan IMS/Bernoulli: July, 2004. Outline Dynamic Treatment Regimes Optimal Q-functions and Q-learning The Problem & Goal Finite Sample Bounds Outline of Proof Shortcomings and Open Problems. Dynamic Treatment Regimes.

aziza
Download Presentation

Q-Learning and Dynamic Treatment Regimes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Q-Learning and Dynamic Treatment Regimes S.A. Murphy Univ. of Michigan IMS/Bernoulli: July, 2004

  2. Outline • Dynamic Treatment Regimes • Optimal Q-functions and Q-learning • The Problem & Goal • Finite Sample Bounds • Outline of Proof • Shortcomings and Open Problems

  3. Dynamic Treatment Regimes ---- Multi-stage decision problems: repeated decisions are made over time on each patient. ---- Used in the management of Addictions, Mental Illnesses, HIV infection and Cancer

  4. k Decisions Observations made prior to tth decision Action at tth decision Primary Outcome:

  5. A dynamic treatment regime is a vector of decision rules, one per decision If the regime is implemented then

  6. Goal: Estimate the decision rules that maximize mean Data: Data set of n finite horizon trajectories, each with randomized actions. are randomization probabilities.

  7. Optimal Q-functions and Q-learning: Definition: denotes expectation when the actions are chosen according to the regime

  8. Q-functions: The Q-functions for optimal regime, are given recursively by For t=k,k-1,….

  9. Q-functions: The optimal regime is given by

  10. Q-learning: Given a model for the Q-functions, minimize over Set

  11. Q-learning: For each t=k-1,…,1 minimize over And set and so on.

  12. Q-Learning: The estimated regime is given by

  13. The Problem & Goal: Most learning (e.g. estimation) methods utilize a model for all or parts of the multivariate distribution of implicitly constrains the class of possible decision rules in the dynamic treatment regime: call this constrained class, is a vector with many components (high dimensional) thus the model is likely incorrect; view and as approximation classes.

  14. Goal:Given a learning method and approximation classes assess the ability of learning method to produce the best decision rules in the class. Ideally construct an upper bound for where is the estimator of the regime denotes expectation when the actions are chosen according to the rule

  15. Goal:Given a learning method, model and approximation class construct a finite sample upper bound for This upper bound should be composed of quantities that are minimized in the learning method. Learning Method is Q-learning.

  16. Finite Sample Bounds: Primary Assumptions: (1) for L>1. (2) Number of possible actions is finite.

  17. Definition: where E, without a subscript, denotes expectation when the actions are randomized.

  18. Results: Approximation Error: The minimum is over with

  19. Define The estimation error involves the complexity of this space.

  20. Estimation Error: For with probability at least 1- δ for n satisfying

  21. If is finite then n needs only to satisfy that is,

  22. Outline of Proof: The Q-functions for regime are given by

  23. Proof Outline (1)

  24. Proof Outline (2) It turns out that also

  25. Proof Outline (3)

  26. Shortcomings and Open Problems

  27. Recall Estimation Error: For with probability at least 1- δ for n satisfying

  28. Open Problems • Is there a learning method that can learn the best decision rule in an approximation class given a data set of n finite horizon trajectories? • Sieve Estimators or Regularized Estimators? • Dealing with high dimensional X-- feature extraction---feature selection.

  29. This seminar can be found at: http://www.stat.lsa.umich.edu/~samurphy/seminars/ ims_bernoulli_0704.ppt The paper can be found at : http://www.stat.lsa.umich.edu/~samurphy/papers/ Qlearning.pdf samurphy@umich.edu

  30. Recall Proof Outline (2) It turns out that also

  31. Recall Proof Outline (1)

More Related