830 likes | 1.07k Views
Ceph : A Scalable, High-Performance Distributed File System. Derek Weitzel. In the Before…. Lets go back through some of the mentionable distributed file systems used in HPC. In the Before…. There were distributed filesystems like: Lustre – RAID over storage boxes
E N D
Ceph: A Scalable, High-Performance Distributed File System Derek Weitzel
In the Before… • Lets go back through some of the mentionable distributed file systems used in HPC
In the Before… • There were distributed filesystems like: • Lustre – RAID over storage boxes • Recovery time after a node failure was MASSIVE! (Entire server’s contents had to be copied, one to one) • When functional, reading/writing EXTREMELY fast • Used in heavily in HPC
In the Before… • There were distributed filesystems like: • NFS – Network File System • Does this really count as distributed? • Single large server • Full POSIX support, in kernel since…forever • Slow with even a moderate number of clients • Dead simple
In the Current… • There are distributed filesystems like: • Hadoop – Apache Project inspired by Google • Massive throughput • Throughput scales with attached HDs • Have seen VERY LARGE production clusters • Facebook, Yahoo… Nebraska • Doesn’t even pretend to be POSIX
In the Current… • There are distributed filesystems like: • GPFS(IBM) / Panasas – Propriety file systems • Requires closed source kernel driver • Not flexible with newest kernels / OS’s • Good: Good support and large communities • Can be treated as black box for administrators • HUGE Installments (Panasas at LANL is HUGE!!!!)
Motivation • Ceph is a emerging technology in the production clustered environment • Designed for: • Performance – Striped data over data servers. • Reliability – No single point of failure • Scalability – Adaptable metadata cluster
Timeline • 2006 – Ceph Paper written • 2007 – Sage Weil earned PhD from Ceph (largely) • 2007 – 2010 Development continued, primarily for DreamHost • March 2010 – Linus merged Ceph client into mainline 2.6.34 kernel • No more patches needed for clients
Adding Ceph to Mainline Kernel • Huge development! • Significantly lowered cost to deploy Ceph • For production environments, it was a little too late – 2.6.32 was the stable kernel used in RHEL 6 (CentOS 6, SL 6, Oracle 6).
Lets talk paper Then I’ll show a quick demo
Ceph Overview • Decoupled data and metadata • IO directly with object servers • Dynamic distributed metadata management • Multiple metadata servers handling different directories (subtrees) • Reliable autonomic distributed storage • OSD’s manage themselves by replicating and monitoring
Decoupled Data and Metadata • Increases performance by limiting interaction between clients and servers • Decoupling is common in distributed filesystems: Hadoop, Lustre, Panasas… • In contrast to other filesystems, CEPH uses a function to calculate the block locations
Dynamic Distributed Metadata Management • Metadata is split among cluster of servers • Distribution of metadata changes with the number of requests to even load among metadata servers • Metadata servers also can quickly recover from failures by taking over neighbors data • Improves performance by leveling metadata load
Reliable Autonomic Distributed Storage • Data storage servers act on events by themselves • Initiates replication and • Improves performance by offloading decision making to the many data servers • Improves reliability by removing central control of the cluster (single point of failure)
Ceph Components • Some quick definitions before getting into the paper • MDS – Meta Data Server • ODS – Object Data Server • MON – Monitor (Now fully implemented)
Ceph Components • Ordered: Clients, Metadata, Object Storage 2 1 3
Ceph Components • Ordered: Clients, Metadata, Object Storage 2 1 3
Client Overview • Can be a Fuse mount • File system in user space • Introduced so file systems can use a better interface than the Linux Kernel VFS (Virtual file system) • Can link directly to the Ceph Library • Built into newest OS’s.
Client Overview – File IO • 1. Asks the MDS for the inode information
Client Overview – File IO • 2. Responds with the inode information
Client Overview – File IO • 3. Client Calculates data location with CRUSH
Client Overview – File IO • 4. Client reads directly off storage nodes
Client Overview – File IO • Client asks MDS for a small amount of information • Performance: Small bandwidth between client and MDS • Performance Small cache (memory) due to small data • Client calculates file location using function • Reliability: Saves the MDS from keeping block locations • Function described in data storage section
Ceph Components • Ordered: Clients, Metadata, Object Storage 2 1 3
Client Overview – Namespace • Optimized for the common case, ‘ls –l’ • Directory listing immediately followed by a stat of each file • Reading directory gives all inodes in the directory • Namespace covered in detail next! $ ls -l total 0 drwxr-xr-x 4 dweitzelswanson 63 Aug 15 2011 apache drwxr-xr-x 5 dweitzelswanson 42 Jan 18 11:15 argus-pep-api-java drwxr-xr-x 5 dweitzelswanson 42 Jan 18 11:15 argus-pep-common drwxr-xr-x 7 dweitzelswanson 103 Jan 18 16:37 bestman2 drwxr-xr-x 6 dweitzelswanson 75 Jan 18 12:25 buildsys-macros
Metadata Overview • Metadata servers (MDS) server out the file system attributes and directory structure • Metadata is stored in the distributed filesystem beside the data • Compare this to Hadoop, where metadata is stored only on the head nodes • Updates are staged in a journal, flushed occasionally to the distributed file system
MDS Subtree Partitioning • In HPC applications, it is common to have ‘hot’ metadata that is needed by many clients • In order to be scalable, Ceph needs to distributed metadata requests among many servers • MDS will monitor frequency of queries using special counters • MDS will compare the counters with each other and split the directory tree to evenly split the load
MDS Subtree Partitioning • Multiple MDS split the metadata • Clients will receive metadata partition data from the MDS during a request
MDS Subtree Partitioning • Busy directories (multiple creates or opens) will be hashed across multiple MDS’s
MDS Subtree Partitioning • Clients will read from random replica • Update to the primary MDS for the subtree
Ceph Components • Ordered: Clients, Metadata, Object Storage 2 1 3
Data Placement • Need a way to evenly distribute data among storage devices (OSD) • Increased performance from even data distribution • Increased resiliency: Losing any node is minimally effects the status of the cluster if even distribution • Problem: Don’t want to keep data locations in the metadata servers • Requires lots of memory if lots of data blocks
CRUSH • CRUSH is a pseudo-random function to find the location of data in a distributed filesystem • Summary: Take a little information, plug into globally known function (hashing?) to find where the data is stored • Input data is: • inode number – From MDS • OSD Cluster Map (CRUSH map) – From OSD/Monitors
CRUSH • CRUSH maps a file to a list of servers that have the data
CRUSH • File to Object: Takes the inode (from MDS)
CRUSH • File to Placement Group (PG): Object ID and number of PG’s
Placement Group • Sets of OSDs that manage a subset of the objects • OSD’s will have many Placement Groups • Placement Groups will have R OSD’s, where R is number of replicas • An OSD will either be a Primary or Replica • Primary is in charge of accepting modification requests for the Placement Group • Clients will write to Primary, read from random member of Placement Group
CRUSH • PG to OSD: PG ID and Cluster Map (from OSD)
CRUSH • Now we know where to write the data / read the data • Now how do we safely handle replication and node failures?
Replication • Replicates to nodes also in the Placement Group
Replication • Write the the placement group primary (from CRUSH function).
Replication • Primary OSD replicates to other OSD’s in the Placement Group
Replication • Commit update only after the longest update
Failure Detection • Each Autonomic OSD looks after nodes in it’s Placement Group (possible many!). • Monitors keep a cluster map (used in CRUSH) • Multiple monitors keep eye on cluster configuration, dole out cluster maps.
Recovery & Updates • Recovery is entirely between OSDs • OSD have two off modes, Down and Out. • Down is when the node could come back, Primary for a PG is handed off • Out is when a node will not come back, data is re-replicated.
Recovery & Updates • Each object has a version number • Upon bringing up, check version number of Placement Groups to see if current • Check version number of objects to see if need update
Ceph Components • Ordered: Clients, Metadata, Object Storage (Physical) 2 1 4
Object Storage • The underlying filesystem can make or break a distributed one • Filesystems have different characteristics • Example: RieserFS good at small files • XFS good at REALLY big files • Ceph keeps a lot of attributes on the inodes, needs a filesystem that can hanleattrs.
Object Storage • Ceph can run on normal file systems, but slow • XFS, ext3/4, … • Created own Filesystem in order to handle special object requirements of Ceph • EBOFS – Extent and B-Tree based Object File System.
Object Storage • Important to note that development of EBOFS has ceased • Though Ceph can run on any normal filesystem (I have it running on ext4) • Hugely recommend to run on BTRFS