1 / 18

Cassandra - A Decentralized Structured Storage System

Cassandra - A Decentralized Structured Storage System. 報告者 : 呂俐禎. Abstract. Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.

lam
Download Presentation

Cassandra - A Decentralized Structured Storage System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cassandra - A Decentralized Structured Storage System 報告者: 呂俐禎

  2. Abstract • Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.

  3. Motivations • High Availability • High Write Throughput • High scalability.

  4. Data Model • Table is a multi dimensional map indexed by a key. • Each Column has • Name • Value • Timestamp • Columns are grouped into Column Families. • 2 Types of Column Families • Simple • Super (nested Column Families)

  5. C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Data Model Columns are added and modified dynamically ColumnFamily1 Name : MailListType : SimpleSort : Name KEY Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordListType : SuperSort : Time Column Families are declared upfront Name : aloha Name : dude C2 V2 T2 C6 V6 T6 SuperColumns are added and modified dynamically Columns are added and modified dynamically

  6. System Architecture Consistent Hash • Data partitioned to subset of nodes: Consistent Hashing • Data replicated to multiple nodes for redundancy, performance: Quorum using “preference list” of nodes • Node management: • Membershipalgorithm to know which nodes are up/down. “Accrual failure detection + Gossip” • Bootstrappingto add node. Manual operation + “Seed” nodes NodeA NodeC NodeD Gossip NodeB

  7. Partitioning • Consistent Hashing • Nodes are logically structured in Ring Topology. • Hashed value of key associated with data partition is used to assign it to a node in the ring. • Lightly loaded nodes moves position to alleviate highly loaded nodes.

  8. Replication • Uses replication to achieve high availability and Each data item is replicated at N (replication factor) nodes. • Coordinator node replicates the key to an additional N-1 nodes. • Different Replication Policies • Rack Unaware • Rack Aware • Datacenter Aware

  9. Partitioning and Replication h(key1) 1 0 N=3 B h(key2) A C F E D 1/2 • * Figure taken from Avinash Lakshman and Prashant Malik (authors of the paper) slides.

  10. Membership • Cluster membership in Cassandra is based on Scuttlebutt, a very efficient anti-entropy Gossip based mechanism. • Each node locally determines if any other node in the system is up/down. • Uses gossip for node membership and to transmit system control state. • Node Fail state is given by variable ‘phi‘Φwhich tells how likely a node might fail (suspicion level) instead of simple binary value (up/down). • This type of system is known as Accrual Failure Detector.

  11. Gossip

  12. Accrual Failure Detector • If a node is faulty, the suspicion level monotonically increases with time. Φ(t)  k as t  k Where k is a threshold variable (depends on system load) which tells a node is dead. • Φis computed using inter-arrival times of gossip messages from other nodes in the cluster.

  13. BootStrapping & Scaling the Cluster • When a node starts for the first time, it chooses a random token for its position in the ring. • The mapping is persisted to disk locally and also in Zookeeper. • The token information is then gossiped around the cluster. • An administrator uses command line or browser to initiate the addition and removal of nodes from Cassandra instance

  14. Local Persistence • Relies on local file system for data persistency. • Write operations happens in 2 steps • Write to commit log in local disk of the node • Update in-memory data structure. • Read operation • Looks up in-memory as first before looking up files on disk. • Uses Bloom Filter (summarization of keys in file store in memory) to avoid looking up files that do not contain the key.

  15. READ • Use cache data • Bloom filter to reduce SSTable access • Check SSTables in time-order In-Memory Table • Dumped to SSTable when full Commit Log SS Table SS Table SS Table WRITE SSTable • No reads, seeks • Atomic within CF • Append-only • Dedicated disk • Synced to disk OR • buffered • Immutable • Compact job in background to merge files

  16. Facebook Inbox Search • Cassandra developed to address this problem. • 50+TB of user messages data in 150 node cluster on which Cassandra is tested. • Search user index of all messages in 2 ways. • Term search : search by a key word • Interactions search : search by a user id

  17. Comparison with MySQL • MySQL > 50 GB Data Writes Average : ~300 msReads Average : ~350 ms • Cassandra > 50 GB DataWrites Average : 0.12 msReads Average : 15 ms • Stats provided by Authors using facebook data.

  18. Reference • Cassandra - A Decentralized Structured Storage System • AvinashLakshmanFacebook • PrashantMalikFacebook • http://en.wikipedia.org/wiki/Apache_Cassandra • http://wiki.apache.org/cassandra/FrontPage

More Related