1 / 40

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING. Group 11 Ashish Meena 04005006 Rohitashwa Bhotica 04005010 Hansraj Choudhary 04d05005 Piyush Kedia 04d05009 . Outline. Introduction Learning Models Motivation Reinforcement Learning Framework Q – Learning Algorithm

felix
Download Presentation

REINFORCEMENT LEARNING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. REINFORCEMENT LEARNING Group 11 Ashish Meena 04005006 Rohitashwa Bhotica 04005010 Hansraj Choudhary 04d05005 Piyush Kedia 04d05009

  2. Outline • Introduction • Learning Models • Motivation • Reinforcement Learning Framework • Q – Learning Algorithm • Applications • Summary

  3. Introduction • Machine Learning • Construction of programs that automatically improve with experience. • Types of Learning • Supervised Learning • Unsupervised Learning • Reinforcement Learning

  4. Supervised Learning - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classification. • Example • Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient. (Logistic Regression)

  5. Unsupervised Learning • Unsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering. • Example • From the DNA micro-array data, determine which genes are most “similar” in terms of their expression profiles. (Clustering)

  6. Reinforcement Learning • Training data: (S, A, R). (State-Action-Reward) • Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long – term reward.

  7. State Reward Action Reinforcement Learning Agent Policy Environment

  8. Motivation • Supervised and unsupervised learning fail in many situations. • Example • FLIGHT CONTROLS SYSTEMS • For a set of all sensor readings at a given time deciding how the flight controls should function. • In case of supervised learning the labels for many features are unknown. • Performing trial-and-error interactions with the environment reinforcement learning is capable of solving such problems.

  9. Learning by Interaction • Learning to ride a bicycle • Actions: • Turn handle bars RIGHT. • Turn handle bars LEFT. • Rewards: • Positive if the cycle is perfectly balanced. • Negative if the angle between the cycle and the ground decreases

  10. Inspiration from Natural Learning • Dopamine • Neurohormone occurring in a wide variety of animals, including both vertebrates and invertebrates • Regulates and controls behavior by inducing pleasurable effects equivalent to rewards • Motivates us to proactively to perform certain activities

  11. Markov Decision Processes • A discrete set of environment states S. • A discrete set of agent actions A. • At each discrete time agent observes state stЄS and chooses action st ЄA • Receives reward rt • State changes to st+1 • Markov Assumptions • st+1 = δ(st,at) • rt = r(st,at) • δand r are usually unknown to the learner • δand r may also be non deterministic

  12. Simple deterministic example

  13. Model-free v/s Model-based Methods • Model-free Methods • Eliminate need to know the model • Learn a policy without estimating the model. • Example Q Learning • Model Based Methods • Model consists of the Transition Probability Function and the Reward Function. • Interactively estimate the model and calculate the policy. • Example Dyna 9/1/2014 13

  14. Precise Goal Maximization of long-term reward. An optimal policy has to be learnt in order to do this. Long-term reward may have different interpretations according to how the future is taken into account. i.e. different models of optimal behaviour

  15. Models of Optimal Behaviour • Three main models • Finite Horizon Model • Optimize the reward for the next h steps rt represents the reward for the (t+1)thaction performed • h=1 represents a greedy strategy

  16. Models of Optimal Behaviour (2) • Infinite Horizon Discounted Model • Models future rewards being less valuable than in the present • Has infinite horizon i.e. considers infinite actions in the future • Makes use of a discount factor g between 0 and 1 which represents the importance of the future. • This is the model used in the Q-Learning algorithm

  17. Models of Optimal Behaviour (3) Average-Reward Model Reward per action is considered Infinite future is taken into account Rewards in the future are equally valuable

  18. The Learning Task • Execute actions, observe results and • Learn a policy, π : S → A maximizing • for all states S • Infinite horizon discounted model is being used here

  19. Value Function • For each possible policy πthe agent can adopt the value function is • In terms of the value function the learning task can be reformulated to learn the optimal policy p* such that

  20. Learning the Value Function • The value of the optimal V written as V * (s) = max V π (s) • A look-ahead search can be performed from any state to determine the optimal policy using a π*(s) = argmax [r(s, a) + gV*(d(s, a))] • Learning V* is only possible when the agent has perfect knowledge of both d and r. • Thus need to define the Q function arises

  21. Example V* Values

  22. The Q Function • The evaluation function Q(s, a) is defined as Q(s, a) = r(s, a) + g V*(d(s, a)) • If the Q-function is learnt the optimal action can be selected without knowing d and r.

  23. Learning the Q function • Notice the relationship between Q and V* • Rewriting the value of Q

  24. Learning the Q-value • FOR each <s, a> DO • Initialize table entry: • Observe current state s • WHILE (true) DO • Select action a and execute it • Receive immediate reward r • Observe new state s’ • Update table entry for using the training rule: • Move: record transition from s to s’

  25. Q Learning Illustration • Consider that an optimal strategy is being learnt for the given set of states and rewards for actions • The reward for all actions is 0 unless it is moving to the goal state. 0 100 G 0 0 0 0 0 100 0 0 0 0

  26. Q Learning Illustration(2) • The actual values of V and Q taking g=0.9 are:

  27. Learning Step • Since an absorbing state exists learning will consist of a series of episodes with a random start state

  28. Uncertain State Transitions • State transition probability function is used • Denotes the transition probability from State s to s’ when action a is performed. • Q function is modified to

  29. Applications • Cell Phone Channel Allocation • Cobot: A Social Reinforcement Learning Agent • Car Simulation: Using Reinforcement Learning • Network Packet Routing  • Elevator Scheduling  • Use of RL to improve the performance of natural language question answering systems • Reinforcement Learning Methods for Military Applications 9/1/2014 29

  30. Cell Phone Channel Allocation • Learns channel allocations for cell phones • Channels are limited • Allocations affect adjacent cells • Want to minimize dropped and blocked calls 9/1/2014 30

  31. Cell Phone Channel Allocation Cont… • States • Occupied and unoccupied channels for each cell • Availability: Number of free channels for cell • Actions  • Call arrival • Evaluate possible free channels • Assign one that has highest value • Call termination • Free channel • Consider reassigning each ongoing call to just-released channel • Perform reassignment (if any) with highest value • Rewards and Values  • Reward is number of on-going calls  9/1/2014 31

  32. Performance of FA, BDCL, and RL FA=fixed assignment method (FA) BDCL=borrowing with directional channel locking

  33. Cobot: A Social Reinforcement Learning Agent • Cobot is a software agent • Apply RL in a complex human online social chat based environment • The goal is to interact with other members and to become a real part of his social fabric • Takes certain actions under his own initiative • Any user can reward or punish him. • Cobot has a incremental database of “social statistics” • e.g. how frequently and in what ways users interacted with one another provided summaries of these statistics as a service

  34. Cobot Cont… States One state space corresponding to each user State space contains a number of features containing statistics about that particular user Actions Null Action Choose to remain silent for this time period. Topic Starters Introduce a conversational topic Social Commentary Make a comment describing the current social state of the LivingRoom, such as “It sure is quiet” or “Everyone here is friendly.”

  35. Cobot cont… The RL Reward Function Reward verb E.g. hug Punish verb E.g. spank These verbs give a numerical (positive and negative, respectively) training signal to Cobot

  36. Car Simulation: Using Reinforcement Learning The drivers do not know the track information beforehand Takes appropriate action to avoid bumping into the wall by learning from the past experience and the given rewards States state of the car is represented by two variables The distance of the car to the left wall of the track the car’s velocity towards right wall

  37. Car Simulation cont… Action Turn left or right to go in right direction Reward The car will be given a negative reward if Car bumping into the wall car going backwards Positive if Going on correct direction

  38. Conclusion • The basic reinforcement learning model consists of: • a set of environment states S • a set of actions A and • a set of scalar "rewards" • Learner take actions in an environment so as to maximize its long-term reward • Any problem domain that can be cast as a Markov decision process can potentially benefit from this technique. • Unlike supervised learning, reinforcement learning systems do not require explicit input-output pairs for training 9/1/2014 38

  39. References Reinforcement Learning:  A User’s Guide Bill Smart Department of Computer Science and Engineering Washington University in St. Louis http://www.cse.wustl.edu/~wds/ICAC 2005 Machine Learning, Tom Mitchell, McGraw Hill, 1997. Harmon, M., Harmon, S.: Reinforcement Learning : A Tutorial, Wright State University, 2000. Rich Sutton: Reinforcement Learning: Tutorial, AT& T Labs http://www.cs.ualberta.ca/~sutton/Talks/RL-Tutorial/RL-Tutorial.ppt Kaelbling, L.P., Littman, M.L., and Moore, A.W. "Reinforcement learning: A survey". Journal of Artificial Intelligence Research, 4, 1996. http://en.wikipedia.org/wiki/Reinforcement_learning A Social Reinforcement Learning Agent by Charles Isbell, Christian Shelton, Michael Kearns, Satinder Singh and Peter Stone. In Proceedings of the Fifth International Conference on Autonomous Agents (AGENTS), pages 377-384, 2001. Winner of Best Paper Award 9/1/2014 39

  40. Questions & Comments

More Related