Reinforcement learning

Svetlana Lockwood Washington State University CptS 540 Fall 2010 Reinforcement learning

Background • Dates back to the early days of cybernetics • Goal: to program agents by reward and punishment without needing to specify how the task is to be achieved • BUT • Generally, RL agent faces MarkovDecision Problem (MDP) • Formidable computational obstacles

Introduction • Informal definition: an agent must learn behavior through trial-and-error interactions with a dynamic environment • Two main approaches: • Search in the space of behaviors • mainly this approach has been taken in genetic algorithms and genetic programming • Use statistical techniques and dynamicprogramming methods to estimate the utility of taking actionsin states of the world

Reinforcement Learning (RL) • Differs from supervised learning: • no input/output pairs • agent is told the immediate reward and subsequent state but nottold which action would be best in long-term interests • to act optimally, agent must gather experience about possible system states, actions, transitions and rewards • Another important difference: • system evaluation is often concurrent with learning • does not require predefined state-action transition

Formal Overview s Function R defines the reward r, function I defines how agent sees the world, i.e. full or partial observability. I R Environment: Mars i r Agent: Spirit a

Formal Definition • Formally, the model consists of • a discrete set of environment states, S • a discrete set of agent actions, A • a set of scalar reinforcement signals (r), typically {0, 1} but also may be real numbers The agent's job is to find a policy π, mapping states to actions, that maximizes measure of reinforcement. Environment generally is non-deterministic, i.e.same actions in same state but different times maylead to different state.

Models of Optimal Behavior • Three major models: • finite-horizon: optimize expected reward after h steps • infinite-horizon discounted model: takes into account long term rewards, but they are geometrically discounted, mathematically tractable, 0<γ<1 • average-reward model:

Exploitation versus Exploration: The Single-State Case k-armed bandit problem: The agent is in a room with a collection of k gambling machines and is permitted a fixed number of pulls, h. Any arm may be pulled with payoff 1 or 0 according to some unknown probability distribution. No penalty for pulling arm, the only cost is in wasting a pull. What should the agent's strategy be? Dynamic-Programming Approach Gittins Allocation Indices Learning Automata

Dynamic-Programming Approach • If agent lives h steps, then we can use Bayesian reasoning, requires prior joint probability distribution • {n1, w1, …nk, wk} - system current state after pulling k arms • V*(n1, w1, …nk, wk) the max remaining reward then the remaining rewards are 0. This is basis for recursive definition If If we know the value for all belief states with t pulls remaining, we can compute the value of any belief state with t+1 pulls remaining. Expense: linear in SxA, thus exponential in horizon.

Other approaches to k-bandit • Greedy approach: choose action with highest payoff. • Randomized strategy: take a random action with the best estimated expected reward. Start with a large p to encourage initial exploration, which is slowly decreased.

RL: general case • In the general case of the reinforcement learning problem with multiple states • the agent's actions determine not only its immediate reward, but also the next state of the environment. • Such environments can be thought of as networks of k-bandit problems.

Q-Learning • Value iteration approach Goal: to learn state-to-action function Q that maximizes the expected returns, i.e. • Complexity: quadratic in S and linear in A

Applications • Active area of research • Just scratched the top of the iceberg • Some applications include: • Military • Robotics • Visual images and speech processing • Etc.

David is 11 years old. He weighs 60 pounds. He is 4 feet, 6 inches tall. He has brown hair. His love is real. But he is not. A Steven Spielberg Film Artificial Intelligence

Reinforcement learning

Reinforcement learning

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning