520 likes | 1.2k Views
Google Bigtable. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006. Google Scale. Lots of data
E N D
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006
Google Scale • Lots of data • Copies of the web, satellite data, user data, email and USENET, Subversion backing store • Many incoming requests • No commercial system big enough • Couldn’t afford it if there was one • Might not have made appropriate design choices
Data model: a big map • <Row, Column,Timestamp> triple as key • <Row, Column,Timestamp> -> string • API has lookup, insert and delete operations based on the key • Does not support a relational model • No table-wide integrity constraints • No multirow transactions
Data Model Example: Zoo
Data Model Example: Zoo row keycol. keytimestamp
Data Model Example: Zoo row keycol. keytimestamp - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs - (zebras, weight, 2006) --> 620 lbs
Data Model Example: Zoo row keycol. keytimestamp - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs - (zebras, weight, 2006) --> 620 lbs Each key is sorted in Lexicographic order
Data Model Example: Zoo row keycol. keytimestamp - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs - (zebras, weight, 2006) --> 620 lbs Timestamp ordering is defined as “most recent appears first”
Data Model Example: Web Indexing
Data Model Row
Data Model Columns
Data Model Cells
Data Model timestamps
Data Model Column family
Data Model Column family family:qualifier
Data Model Column family family:qualifier
Data Model - Timestamps • Used to store different version of data in a cell • New writes default to current time • Lookup options: • Return most recent K values • Return all values in the timestamp range
API Examples: Write/Modify atomic row modification No support for (RDBMS-style) multi-row transactions
API Examples: Read Return sets can be filtered using regular expressions: anchor: com.cnn.*
SSTable • The Google SSTable file format is used internally to store BigTable data • An SSTableprovides a persistent ordered immutable map from keys to values • Each SSTable contains a sequence of blocks • A block index (stored at the end of SSTable) is used to locate blocks • The index is loaded into memory when the SSTable is open • One disk access to get the block
SSTable Index (block ranges) 64KB Block 64KB Block 64KB Block …
Tablet • Contains some range of rows of the table • Built out of multiple SSTables Start:aardvark Tablet End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index
Table • Multiple tablets make up the table • SSTables can be shared • Tablets do not overlap, SSTables can overlap Tablet Tablet apple boat aardvark apple SSTable SSTable SSTable SSTable
Organization • A Bigtable cluster stores tables • Each table consists of tablets • Initially each table consists of one tablet • As a table grows it is automatically split into multiple tablets • Tablets are assigned to tablet servers • Multiple tablets per server. Each tablet is 100-200 MB • Each tablet lives at only one server • Tablet server splits tablets that get too big
Organization • Master • Keeps track of the set of live tablet servers • Assigns tablets to table servers • Detects the addition and expiration of tablet servers • Balancing tablet-server load • Handles scheme changes such as table and column family creations • A Bigtable library is linked to every client.
Finding a tablet Similar to a B+ tree <table_id, end_row> -> location Location is <ip address, port> of a tablet server
Finding a tablet • A Bigtable library is linked to every client. • Client communicates directly with tablet server for reads/writes. • Client reads the Chubby file that points to the root tablet • This starts the location process • The client library caches tablet locations. • There are algorithms for prefetching • If there is no cached information, three network round-trips are needed
System Structure Bigtable client Bigtable Master Bigtable Cell metadata Bigtable client library Performs metadata ops+ Load balancing read/write Bigtable tablet server Bigtable tablet server Bigtable tablet server Open() Serves data Serves data Serves data Cluster scheduling system GFS Lock service Handles failover, monitoring Holds tablet data, logs Holds metadata Handles master-election
Tablet Server • When a tablet server starts, it creates and acquires and exclusive lock on, a uniquely-named file in a specific Chubby directory • Let’s call this servers directory • A tablet server stops serving its tablets if it loses its exclusive lock • This may happen if there is a network connection that causes the tablet server to lose its Chubby session
Tablet Server • A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists • If the file no longer exists then the tablet server will never be able to serve again • Kills itself • At some point it can restarted; it goes to a pool of unassigned tablet servers
Master Startup Operation • Upon start up the master needs to discover the current tablet assignment. • Grabs unique master lock in Chubby • Prevents server instantiations • Scans servers directory in Chubby for live servers • Communicates with every live tablet server • Discover all tablets • Scans METADATA table to learn the set of tablets • Unassigned tables are marked for assignment
Master Operation • Detect tablet serverfailures/resumption • Master periodically asks each tablet server for the status of its locks
Master Operation • Tablet server lost its lock or master cannot contact tablet server: • Master attempts to acquire exclusive lock on the server’s file in the servers directory • If master acquires the lock then the tablets assigned to the tablet server are assigned to others • Master deletes the server’s file in the servers directory • Assignment of tablets should be balanced • If master loses its Chubby session then it kills itself • An election can take place to find a new master
Master Operation • Handles schema changes such as table and column family creations. • Handles table creation, and merging of tablets • Tablet servers directly update metadata on tablet split, then notify master • lost notification may be detected lazily by master
Tablet Serving • Updates are committed to a commit log that stores redo records • Recently committed updates are stored in a buffer called memtable • Older updates are stored in a sequence of SSTables • To recover a tablet, a tablet server reads its metadata from the METADATA table • Contains list of SSTables and a set of redo points
Read Op Memtable Memory GFS Tablet Log SST SST SST SSTable Files Write Op Tablet Serving • Commit log stores the updates that are made to the data. • Recent updates are stored in memtable. • Older updates are stored in SStable files
Read Op Memtable Memory GFS Tablet Log SST SST SST SSTable Files Write Op Tablet Serving • Recovery process. • Reads/Writes that arrive at tablet server. • Is the request Well-formed? • Authorization. • Chubby holds the permission file. • If a mutation occurs it is wrote to commit log and finally a group commit is used.
Read Op Memtable Memory GFS Tablet Log SST SST SST SSTable Files Write Op Tablet Serving • Tablet recovery process • Read metadata containing SSTABLES and redo points • Redo points are pointers into any commit logs • Apply redo points
Compactions • Minor compaction – convert the memtable into an SSTable • Reduce memory usage • Reduce log traffic on restart • Merging compaction • Reduce number of SSTables • Good place to apply policy “keep only N versions” • Major compaction • Merging compaction that results in only one SSTable • No deletion records, only live data
Locality Groups • Group column families together into an SSTable • Avoid mingling data, e.g. page contents and page metadata • Can keep some groups all in memory • Can compress locality groups • Tablet movement • Major compaction (with concurrent updates) • Minor compaction (to catch up with updates) without any concurrent updates • Load on new server without requiring any recovery action
Lessons learned • Interesting point- only implement some of the requirements, since the last is probably not needed • Many types of failure possible • Big systems need proper systems-level monitoring • Value simple design