150 likes | 259 Views
UAV Route Planning in Delay Tolerant Networks. Daniel Henkel , Timothy X Brown University of Colorado, Boulder Infotech @ Aerospace ‘07 May 8, 2007. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A. Familiar: Dial-A-Ride.
E N D
UAV Route Planning in Delay Tolerant Networks Daniel Henkel, Timothy X Brown University of Colorado, Boulder Infotech @ Aerospace ‘07 May 8, 2007 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA
Familiar: Dial-A-Ride Dial-A-Ride:curb-to-curb, shared ride transportation service • Receive calls • Pick up and drop off passengers • Minimize overall transit time The Bus Optimal route not trivial !
In context: Dial-A-UAV Complication: infinite data at sensors; potentially two-way traffic Delay tolerant traffic! Talk tomorrow – 8am: Sensor Data Collection Sensor-1 Sensor-3 Sensor-5 Monitoring Station Sensor-2 Sensor-6 Sensor-4 • Sparsely distributed sensors, limited radios • TSP solution not optimal • Our approach: Queueing and MDP theory
TSP’s Problem Traveling Salesman Solution • One cycle visits every node • Problem: far-away nodes with little data to send • Visit them less often A B UAV hub pA pB dA dB fA fB B New: cycle defined by visit frequenciespi B
Queueing Approach Goal Minimize average delay Idea: express delay in terms of pi, then minimize over set {pi} • pi as probability distribution • Expected service time of any packet • Inter-service time: exponential distribution with mean Ti/pi • Weighted delay: A B UAV fB fA pA pB dB dA pC C hub pD dC dD D fC fD
Solution and Algorithm Probability of choosing node i for next visit: Implementation: deterministic algorithm 1. Set ci= 0 2. ci = ci + pi while max{ci} < 1 3. k = argmax {ci} 4. Visit node k; ck = ck-1 5. Go to 2. Performance improvement over TSP!
Unknown Environment • What is RL? • Learning what to do without prior training • Given: high-level goal; NOT: how to reach it • Improving actions on the go • Distinguishing Features: • Interaction with environment • Trial & Error Search • Concept of Rewards & Punishments • Example: training dog Learns model of environment.
The Framework Agent • Performs Actions Environment • Gives rise to Rewards • Puts Agent in situations called States
Elements of RL Policy Reward Value Model ofEnvironment • Policy: what to do (depending on state) • Reward: what is good • Value: what is good because it predicts reward • Model: what follows what Source: Sutton, Barto, Reinforcement Learning – An Introduction, MIT Press, 1998
UA Path Planning - Simple Goal Minimize average delay -> Find pA and pB • Service traffic from A and B to hub H • Goal: minimize average packet delay • State: traffic waiting at nodes: (tA, tB) • Actions: fly to A; fly to B • Reward: # packets delivered • Optimal policy: # visits to A and B; depend on flow rates, distances A B UAV hub pA pB dA dB fA fB
MDP • If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). • If state and action sets are finite, it is a finite MDP. • To define a finite MDP, you need to give: • state and action sets • one-step “dynamics” defined by transition probabilities: • reward expectation:
RL approach to solving MDPs • Policy: Mapping from set of States to set of Actions π : S → A • Sum of Rewards (:=return): from this time onwards • Value function (of a state): Expected return when starting with s and following policy π. For an MDP,
Bellman Equation for Policy π • Evaluating E{.}; assuming deterministic policy; π solution: • Action-Value Function: Value of taking action a in state s. For an MDP,
Optimality • V and Q, both have a partial ordering on them since they are real valued. π also ordered: • Concept of V* and Q*: • Concept of π*: The policy π which maximizes Qπ(s,a) for all states s.
Reinforcement Learning - Methods • To find π*, all methods try to evaluate V/Q value functions • Different Approaches: • Dynamic Programming Approach • Policy evaluation, improvement, iteration • Monte-Carlo Methods • Decisions are taken based on averaging sample returns • Temporal Difference Methods (!!)