260 likes | 429 Views
OceanStore: An Infrastructure for Global-Scale Persistent Storage. John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao.
E N D
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao A few slides have been borrowed from the authors’ presentations
Vision • What is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” Source: Berkeley OceanStore Website
Vision • What is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” • data • all kinds of information • desktop, laptop, palmtop • cars, cellular phones, other devices • futuristic: embedded in environment
Vision • What is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” • persistence • devices can be rebooted, lost, replaced • reliable, durable data (“deep archival” will last forever) • Automatic maintenance
Vision What is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” • connectivity • even to tiniest devices, possibly intermittent • variable bandwidth, latency • availability • uniform access, comparable to LAN-based networked storage • fault-tolerant, DoS-tolerant
Vision • what is oceanstore? • “a utility infrastructure to span theglobe and provide continuous access to persistent information” • scale • geographically distributed • 1010 users • 1014 files / objects
Questions about information: Where is persistent information stored? 20th-century tie between location and content outdated In world-scale system, locality is key How is it protected? Can disgruntled employee of ISP sell your secrets? Can’t trust anyone (how paranoid are you?) Can we make it indestructible? Want our data to survive “the big one”! Highly resistant to hackers (denial of service) Wide-scale disaster recovery Is it hard to manage? Worst failures are human-related Want automatic (introspective) diagnosis and repair
First Observation:Want Utility Infrastructure Mark Weiser from Xerox: Transparent computing is the ultimate goal. Computers should disappear into the background In the context of storage: Don’t want to worry about backup Don’t want to worry about obsolescence Need lots of resources to make data secure and highly available, BUT don’t want to own them Outsourcing of storage already becoming popular Pay monthly fee and your “data is out there”
Service provided by confederation of companies Monthly fee paid to one service provider Companies buy and sell capacity from each other Utility-based Infrastructure Canadian OceanStore Sprint AT&T IBM Pac Bell IBM
Target applications Email Group calendar, contacts Distributed design tools Computer Supported Cooperative Work Digital libraries Distributed/shared repositories
Assumptions Untrusted infrastructure A small number of servers may crash or leak information most of the servers functioning correctly financially “responsible party” of servers ensure integrity but only clients trusted with cleartext Nomadic data data divorced from location flows freely within the storage infrastructure promiscuouscaching: “anywhere, anytime” location important for performance dynamic system tuning through introspection
System overview • persistent object • GUID: 160-bit SHA-1 hash • secure identification – globally unique and unforgeable • 280 unique objects before collisions (birthday paradox) • floating object replicas: independent of location • encrypted data • read • try fast probabilistic replica search (Bloom filter) • fallback to slower deterministic search (Tapestry) • write • update with predicates [as in Bayou – what is Bayou?] • creates new version
What is Bayou The Bayou System (Xerox PARC) is a platform of replicated, highly-available, variable-consistency, databases on which collaborative applications can be built. It caters to portable devices having intermittent connections.
System overview • application interface • sessions: sequence of read/writes • session guarantees [Bayou] • loose consistency levels, ACID • active and archival forms • active: latest version, with update handle • archive: erasure coded read-only version • dynamic optimization • object location • degree of replication
naming • self-certifying path names (Mazières) • object GUID = hash of owner key and readable name • create hierarchies using “directory” objects • read restriction • through client encryption of data • write restriction, access control • associate ACL lists with object, respected by servers
addressing • address an object by its GUID • message: GUID, random number, small predicate • route to closest GUID replica matching predicate • combines data location and routing: • no central name service to attack • save one round-trip for location discovery • routing • fast, probabilistic search algorithm • slow, deterministic search algorithm
routing • fast, probabilistic search algorithm • Bloom filter • probabilistic set membership test using bit vector • n-bit vector generated from n hashes of each set element • filter is union (OR) of all bit vectors • attenuated Bloom filter • array of d Bloom filters • i th Bloom filter is union of all <i -hop nodes • slow, deterministic algorithm • Tapestry
addressing and routing deterministic probabilistic
updates • Updates based on versioning and conflict resolution • i.e. no locking • update: actions with predicates • commit – apply action of first true predicate • abort – no true predicates • conflict resolution on encrypted data • possible predicates: • compare-version, compare-size, compare-block, search • possible actions: • replace-block, insert-block, delete-block, append
archival • produced when objects idle • use erasure codes (redundant fragmentation) • simplest example: parity bit • need any (n-1) out of n fragments • interleaved Reed-Solomon codes, Tornado codes • fragmentation improves reliability • “deep archival storage” • sweeper processes ensure replication sustained over time • fragmentation improves performance
Erasure Codes Simple parity bits, or generalized Reed-Solomon codes can be used to implement it.
Floating Replica and Deep Archival Coding Full Copy Full Copy Full Copy Ver1: 0x34243 Ver2: 0x49873 Ver3: … Ver1: 0x34243 Ver2: 0x49873 Ver3: … Ver1: 0x34243 Ver2: 0x49873 Ver3: … Conflict Resolution Logs Conflict Resolution Logs Conflict Resolution Logs Floating Replica Erasure-coded Fragments
dynamic optimization (introspection) • observation modules • collect and summarize information • incrementally update system database • optimization modules • periodically process the observation database • cluster recognition: group related objects • replica management: maintain replica number and location • periodic migration: work-home-work-home… • maintenance: routing, dissemination, availability, durability