1 / 22

World models and basis functions

World models and basis functions. Nisheeth 25 th January 2019. Levels of analysis. RL in the brain. What is the problem? Reinforcement  learning preferences for actions that lead to desirable outcomes How is it solved?

vernoni
Download Presentation

World models and basis functions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. World models and basis functions Nisheeth 25th January 2019

  2. Levels of analysis

  3. RL in the brain • What is the problem? • Reinforcement  learning preferences for actions that lead to desirable outcomes • How is it solved? • MDPs provide a general mathematical structure for solving decision problems under uncertainty • RL was developed as a set of online learning algorithms to solve MDPs • A critical component of model-free RL algorithms is the temporal difference signal • Hypothesis: brain is implementing model-free RL? • Implementation • Spiking rates of dopaminergic neurons in the basal ganglia and ventral striatum behave as if they are encoding this TD signal

  4. Implication? • Model-free learning • Learn the mapping from action sequences to rewarding outcomes • Don’t care about the physics of the world that lead to different outcomes • Is this a realistic model of how human and non-human animals learn?

  5. Learning maps of the world

  6. Cognitive maps in rats and men

  7. Rats learned a spatial model • Rats behave as if they had some sense of p(s’|s,a) • This was not explicitly trained • Generalized from previous experience • Corresponding paper is recommended reading • So is Tolman’s biography http://psychclassics.yorku.ca/Tolman/Maps/maps.htm

  8. The model free vs model-based debate • Model free learning  learn stimulus-response mappings = habits • What about goal-based decision-making? • Do animals not learn the physics of the world in making decisions? • Model-based learning  learn what to do based on the way the world is currently set up = thoughtful responding? • People have argued for two systems • Thinking fast and slow (Balleine & O’Doherty, 2010)

  9. A contemporary experiment • The Daw task (Daw et al, 2011) is a two-stage Markov decision task • Differentiates model-based and model-free RL accounts empirically

  10. Predictions meet data • Behavior appears to be a mix of both strategies • What does this mean? • Active area of research

  11. Some hunches Moderate training Extensive training (Holland, 2004; Kilcross & Coutureau, 2003)

  12. Current consensus • In moderately trained tasks, people behave as if they are using model-based RL • In highly trained tasks, people behave as if they are using model-free RL • Nuance: • Repetitive training on a small set of examples favors model-free strategies • Limited training on a larger set of examples favors model-based strategies (Fulvio, Green & Schrater, 2014)

  13. Big ticket application • How to practically shift behavior from habitual to goal-directed in the digital space • Vice versa is understood pretty well by • Social media designers

  14. The social media habituation cycle Reward State

  15. Designed based on cognitive psychology principles

  16. Competing claims First World kids are miserable! https://journals.sagepub.com/doi/full/10.1177/2167702617723376 (Twenge, Joiner, Rogers & Martin, 2017) Not true! https://www.nature.com/articles/s41562-018-0506-1 (Orben & Przybylski, 2019)

  17. Big ticket application • How to change computer interfaces from promoting habitual to thoughtful engagement • Depends on being able to measure habitual vs thoughtful behavior online

  18. The vocabulary of world maps Basis functions

  19. The state space problem in model-free RL • Number of states quickly becomes too large • Even for trivial applications • Learning becomes too dependent on right choice of exploration parameters • Explore-exploit tradeoffs become harder to solve State space = 765 unique states

  20. Solution approach • Cluster states • Design features to stand in for important situation elements • Close to win • Close to loss • Fork opp • Block fork • Center • Corner • Empty side

  21. Value function approximation • RL methods have traditionally approximated the state value function using linear basis functions • θ is a K valued parameter vector, where K is the number of features that are part of the function φ • Implicit assumption: all features contribute independently to evaluation

  22. Function approximation in Q-learning • Approximate the Q table with linear basis functions • Update the weights • Where δ is the TD term • More details in next class

More Related