240 likes | 262 Views
Cross-Layer Scheduling in Cloud Systems. Hilfi Alkaff, Indranil Gupta , Luke Leslie Department of Computer Science University of Illinois at Urbana-Champaign. Distributed Protocols Research Group: http://dprg.cs.uiuc.edu. Inside a Datacenter: Networks Connecting Servers. Fat Tree
E N D
Cross-Layer Scheduling in Cloud Systems Hilfi Alkaff, Indranil Gupta, Luke Leslie Department of Computer Science University of Illinois at Urbana-Champaign Distributed Protocols Research Group: http://dprg.cs.uiuc.edu
Inside a Datacenter: Networks Connecting Servers Fat Tree [Leiserson 85] Jellyfish [Singla 12] Clos [Dally 04] VL2 [Greenberg 09] Tree
Inside a Datacenter: Networks Connecting Servers Fat Tree [Leiserson 85] Jellyfish [Singla 12] Clos [Dally 04] VL2 [Greenberg 09] Tree Structured Networks Unstructured Networks and/or routing
SDN • Software Defined Networking • For any end-host pair, multiple routes available • SDN Controller helps to choose one of these routes • Configures switches accordingly • Which route is the “best”?
SDNs and Applications • Which route is the “best”? • Our approach • Best network routes should really be decided based on the application that is using the network • To minimize interference (and thus congestion) and to optimize bandwidth use • Today: SDN routes selected application-agnostic way • But the application itself can help, by placing tasks at servers • Today: Applications schedule tasks in network-agnostic way, leading to bad bandwidth utilization • SDN Controller and Application Scheduler should coordinate with each other • This is our cross-layer scheduling approach
Applications: Short Real-Time Analytics Jobs Stream Processing: Storm Batch Processing: MapReduce, Hadoop
Tasks Hadoop Storm Tasks
Tasks and Flows Hadoop Storm Tasks Flows
Challenges • Two large state spaces to explore • Set of Possible Routes for each end-to-end flow • Large numbers of flows and possible routes 2. Set of Possible Task to Server Placements • Large numbers of servers and tasks
Our Strategy • To explore state space, use simulated annealing • At application level scheduler • And separately at routing (SDN) level • Simulated Annealing • probabilistic approach • avoids getting stuck in local optima with some non-zero probability of jumping away • probability of jumping away decreases quickly over time (annealing process for steel)
Pre-computation • For all pairs of servers, pre-compute the k shortest paths • Store it in a hash table, indexed by server pair • Compact storage by merging overlapping routes (for a server pair) into a tree • Small in size and Quick to compute • 1000 servers, k=10 • 50 M entries • After compaction, 6 MB • 3 minutes to generate
When a Job Arrives • Don’t change the allocations or routes of existing jobs • Non-intrusive • Reduces state space to explore • Simulated Annealing is run offline, and the resultant schedule is used to schedule new job’s tasks and flows • Primary Simulated Annealing (SA) runs at Application level • Calls Routing level SA
Simulated Annealing Steps • Start from an arbitrary state • Tasks to servers, and routes to flows • Generate next-state S’ (At Application Level) • De-allocate one task • Prefer tasks that affect computation more, e.g., closer to beginning or end of topology • Allocate this task to random server • Call Routing level SA
Simulated Annealing Steps (2) … • Call Routing level SA • (At Routing Level) • De-path one route • Select random server pair • Remove its worst path • Prefer higher number of hops, and break ties by lower bandwidth • Allocate Path: Change this route to a better path • Prefer lower number of hops, and break ties by higher bandwidth
Simulated Annealing Steps (3) • After generating next-state S’ • Calculate utility(S’) • Utility function considers all jobs in cluster (not just new job) • Utility function accounts for bottlenecked paths from source tasks to sink tasks • If utility(S’) > utility(current state) • Transition from current state to S’ • If utility(S’) ≤ utility(current state) • Transition with probability e(utility(S’)-utility(current state))/t • Non-zero probability of transitioning even if S’ is a worse state • Probability decreases over time (t) • Wait until convergence • Re-run entire simulated annealing 5 times, and take best result
Experiments • Implemented into Apache Hadoop (YARN) • Implemented into Apache Storm • Deployment experiments on Emulab: up to 30 hosts • Emulated network using ZeroMQ and Thrift • Emulated Fat-Tree and Jellyfish • Larger scale simulation experiments • Upto 1000 hosts
Experimental Settings • 10 hosts, 100 Mbps, 5 links per router, #links selected via scaling rules • 3 GHz, 2 GB RAM • Hadoop cluster workload • Facebook’s SWIM benchmark • Shuffle ranges from 100 B to 10 GB • 1 job per second • Storm cluster workload: Random tree topologies • Topologies constructed as randomly with number of children selected by Gaussian (mean = sd = 2) • 100 B tuples • Each source generate 1 MB – 100 MB of data • 10 jobs per minute • Each experimental run is 10 minutes
Inside a Datacenter: Networks Connecting Servers Fat Tree [Leiserson 85] Jellyfish [Singla 12] Clos [Dally 04] VL2 [Greenberg 09] Tree Structured Networks Unstructured Networks and/or routing
Storm on Jellyfish Topology Application-only SA: 21.2% Routing-only SA: 23.2% Performance improves with scale App+Routing SA: 34.1% improvement in throughput at 30 hosts
Hadoop on Fat-Tree Topology Application-only SA & Routing-only SA Smaller than combining both Performance improves with scale App+Routing SA: 26% improvement in throughput at 30 hosts
Other Experimental Results • Similar results for other combinations • Hadoop on Jellyfish • App+Routing SA: 31.9% improvement in throughput at 30 hosts • Performance improves with scale • Application-only SA: 18.8% • Routing-only SA: 25.5% • Storm on Fat-Tree • App+Routing SA: 30%improvement in throughput at 30 hosts • Performance improves with scale • Application-only SA: 21.1% • Routing-only SA: 22.7%
Other Experimental Results (2) • Scheduling time is small • Time to schedule a new job in a 1000 server cluster • Fat-Tree: 0.48 s (Hadoop) to 0.53 s (Storm) • Jellyfish: 0.67 s (Hadoop) to 0.74 s (Storm) • No starvation • Worst case degradation in completion time for any job is 20% in Hadoop, 30% in Storm • Outliers are large jobs (rare in real-time analytics with short jobs) • Fault-recovery is fast • Upon failure, re-run simulated annealing once • Recovery occurs within 0.35 s to 0.4 s
Takeaways • Today: Application schedulers and SDN scheduler are disjoint • Leads to suboptimal placement and routing • Our approach: coordinated cross-layer scheduling • Explore small state spaces • Use simulated annealing • At 30 hosts, gives between 26% to 34% improvement in Hadoop and Storm for both structured/unstructured networks • Other networks will fall between these two numbers • Overheads are small, and improvement gets better with scale Distributed Protocols Research Group: http://dprg.cs.uiuc.edu
Ongoing/Future Work Our work opens the door: • Explore other heuristics, e.g., data affinity for tasks, congestion • Explore other non-SA approaches • Available bandwidth estimation • OpenFlowintegration • Batching multiple jobs into scheduling Distributed Protocols Research Group: http://dprg.cs.uiuc.edu