140 likes | 281 Views
The Datacenter Needs an Operating System. Matei Zaharia, Benjamin Hindman , Andy Konwinski , Ali Ghodsi , Anthony Joseph, Randy Katz, Scott Shenker , Ion Stoica. Background. Clusters of commodity servers have become a major computing platform in industry and academia
E N D
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica
Background • Clusters of commodity servers have become a major computing platform in industry and academia • Driven by data volumes outpacing the processing capabilities of single machines • Democratized by cloud computing
Background • Some have declared that “the datacenter is the new computer” • Claim: this new computer increasingly needs an operating system • Not necessarily a new host OS, but a common software layer that manages resources and provides shared services for the whole datacenter, like an OS does for one host
Why Datacenters Need an OS • Growing number of applications • Parallel processing systems: MapReduce, Dryad, Pregel, Percolator, Dremel, MR Online • Storage systems: GFS, BigTable, Dynamo, SCADS • Web apps and supporting services • Growing number of users • 200+ for Facebook’s Hadoop data warehouse, running near-interactive ad hoc queries
What Operating Systems Provide • Resource sharingacross applications & users • Data sharing between programs • Programming abstractions (e.g. threads, IPC) • Debugging facilities (e.g. ptrace, gdb) Result: OSes enable a highly interoperable software ecosystem that we now take for granted
An Analogy • Today, a scientist analyzing data on a single machine can pipe it through a variety of tools, write new tools that interface with these through standard APIs, and trace across the stack • In the future, the scientist should be able to fire up a cloud on EC2 and do the same thing: • Intermix a variety of apps & programming models • Write new parallel programs that talk to these • Get a unified interface for managing the cluster • Debug and trace across all these components
Today’s Datacenter OS • HadoopMapReduce as common execution and resource sharing platform • HadoopInputFormat API for data sharing • Abstractions for productivity programmers, but not for system builders • Very challenging to debug across all the layers
Tomorrow’s Datacenter OS • Resource sharing: • Lower-level interfaces for fine-grained sharing (Mesos is a first step in this direction) • Optimization for a variety of metrics (e.g. energy) • Integration with network scheduling mechanisms (e.g. Seawall [NSDI ‘11], NOX, Orchestra)
Tomorrow’s Datacenter OS • Data sharing: • Standard interfaces for cluster file systems, key-value stores, etc • In-memory data sharing (e.g. Spark, DFS cache), and a unified system to manage this memory • Streaming data abstractions (analogous to pipes) • Lineage instead of replication for reliability (RDDs)
Tomorrow’s Datacenter OS • Programming abstractions: • Tools that can be used to build the next MapReduce / BigTable in a week (e.g. BOOM) • Efficient implementations of communication primitives (e.g. shuffle, broadcast) • New distributed programming models
Tomorrow’s Datacenter OS • Debugging facilities: • Tracing and debugging tools that work across the cluster software stack (e.g. X-Trace, Dapper) • Replay debugging that takes advantage of limited languages / computational models • Unified monitoring infrastructure and APIs
Putting it All Together • A successful datacenter OS might let users: • Build a Hadoop-like software stack in a week using the OS’s abstractions, while gaining other benefits (e.g. cross-stack replay debugging) • Share data efficiently between independently developed programming models and applications • Understand cluster behavior without having to log into individual nodes • Dynamically share the cluster with other users
Conclusion • Datacenters need an OS-like software stack for the same reasons single computers did: manageability, efficiency & programmability • An OS is already emerging in an ad-hoc way • Researchers can help by taking a long-term approach towards these problems
How Researchers can Help • Focus on paradigms, not performance • Industry is tackling performance but lacks luxury to take long-term view towards abstractions • Explore clean-slate approaches • Likelier to have impact here than in a “real” OS because datacenter software changes quickly! • Bring cluster computing to non-experts • Much harder and more rewarding than big users