1 / 16

Zookeeper at Facebook

Zookeeper at Facebook. Vishal Kathuria. Agenda. Zookeeper use at Facebook Project Zeus – Goals Tao Design Tao Workload simulator Early results of Zookeeper testing Zookeeper Improvements . Use Cases Inside Facebook. HDFS For location of the name node Name node leader election

chelsa
Download Presentation

Zookeeper at Facebook

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Zookeeper at Facebook Vishal Kathuria

  2. Agenda • Zookeeper use at Facebook • Project Zeus – Goals • Tao Design • Tao Workload simulator • Early results of Zookeeper testing • Zookeeper Improvements

  3. Use Cases Inside Facebook • HDFS • For location of the name node • Name node leader election • 75K temporary (permanent in future) clients • HBase • For mapping of regions to region servers, location of ROOT node • Region server failure detection and failover • After UDBs more to HBase, ~100K permanent clients • Titan • Mapping of user to Prometheus web server within a cell • Leader election of Prometheus web server • Future: Selection of the Hbase geo-cell

  4. Use cases (contd) • Ads • Leader Election • Scribe • Leader election of scribe aggregators • Future customers • TAO • Sharding • MySQL • Leader Election • Search

  5. Project Zeus • “Make Zookeeper awesome” • Zookeeper works at Facebook scale • Zookeeper is one of the most reliable services at Facebook • Solve pressing infrastructure problems using ZooKeeper • Shard Manager for Tao • Generic Shard Management capability in Tupperware • MySQL HA

  6. Caveats • Project is 5 weeks old • Initial sharing of ideas with the community • Ideas not yet whetted or proven through prototypes

  7. Tao Design • Shard Map • Based on ranges instead of consistent hash • Stored in ZooKeeper • Accessed by clients using Aether • Populated by Eos • Dynamically updated based on load information

  8. Tao Projected Workload • Scale requirements for a single cluster • 24,000 Web machines • Read only clients • 6,000 Tao server machines • Read/Write clients • About 20 clusters site wide • Shard Map is 2-3 MB of data

  9. Tao Workload Simulator • Clients • Read the shard map of local cluster after connection • Put a watch on the shard map • Refresh shard map after watch fires • Follower Servers • These servers are clients of the leader servers • Also read their own shard map • Leader Servers • Read their own shard map and of all of their followers • Shard Manager - Eos • Periodically updates the shard map

  10. Hardware • 3 node zookeeper ensemble • 8 core • 8G RAM • Clients – 20 node cluster • Web class machines • 12 G RAM

  11. Scenario - Steady State • Using Zookeeper ensemble per cluster model • Assumptions • 40K connections • Small number of clients joining/leaving at any time • Rare updates to the shard map – once every 10 minutes • Result • Zookeeper worked well in this

  12. Scenario - Cluster Power Up/Down • Cluster Powering Up • 25K Clients simultaneously trying to connect • Slow response time • It took some clients 560s to connect and get data • Cluster powering down • 25 K clients simultaneously disconnect • System Temporarily Unresponsive • The disconnect requests filled zookeeper queues • System would not accept any more new connections or requests • After a short time, the disconnect requests were processed and the system became responsive again

  13. Scenario – Zookeeper Node Failure • Rolling Restart of ZooKeeper Nodes • Startup/Shutdown of entire cluster • With active clients • Without active clients • Result • No corruptions or system hangs noticed so far

  14. Zookeeper Design • Client connect/disconnect is a persisted update involving all nodes • The ping and connection timeout handling is done by the leader for all connections • Single thread handling connect requests and data requests • Zookeeper is implemented as a single threaded pipeline. • All reads are serialized • Low read throughput • Uses only 3 cores at full load

  15. Zookeeper Improvement Ideas • Non persisted sessions with local session tracking • Hacked a prototype to test potential • Initial test runs very encouraging • Dedicated connection creation thread • Prototyped, test runs in progress • Multiple threads for deserializing incoming requests

  16. Zookeeper Improvement Ideas • Dedicated parallel pipeline for read only clients

More Related