Ceph : A Scalable, High-Performance Distributed File System

Ceph: A Scalable, High-Performance Distributed File System Derek Weitzel

In the Before… • Lets go back through some of the mentionable distributed file systems used in HPC

In the Before… • There were distributed filesystems like: • Lustre – RAID over storage boxes • Recovery time after a node failure was MASSIVE! (Entire server’s contents had to be copied, one to one) • When functional, reading/writing EXTREMELY fast • Used in heavily in HPC

In the Before… • There were distributed filesystems like: • NFS – Network File System • Does this really count as distributed? • Single large server • Full POSIX support, in kernel since…forever • Slow with even a moderate number of clients • Dead simple

In the Current… • There are distributed filesystems like: • Hadoop – Apache Project inspired by Google • Massive throughput • Throughput scales with attached HDs • Have seen VERY LARGE production clusters • Facebook, Yahoo… Nebraska • Doesn’t even pretend to be POSIX

In the Current… • There are distributed filesystems like: • GPFS(IBM) / Panasas – Propriety file systems • Requires closed source kernel driver • Not flexible with newest kernels / OS’s • Good: Good support and large communities • Can be treated as black box for administrators • HUGE Installments (Panasas at LANL is HUGE!!!!)

Motivation • Ceph is a emerging technology in the production clustered environment • Designed for: • Performance – Striped data over data servers. • Reliability – No single point of failure • Scalability – Adaptable metadata cluster

Timeline • 2006 – Ceph Paper written • 2007 – Sage Weil earned PhD from Ceph (largely) • 2007 – 2010 Development continued, primarily for DreamHost • March 2010 – Linus merged Ceph client into mainline 2.6.34 kernel • No more patches needed for clients

Adding Ceph to Mainline Kernel • Huge development! • Significantly lowered cost to deploy Ceph • For production environments, it was a little too late – 2.6.32 was the stable kernel used in RHEL 6 (CentOS 6, SL 6, Oracle 6).

Lets talk paper Then I’ll show a quick demo

Ceph Overview • Decoupled data and metadata • IO directly with object servers • Dynamic distributed metadata management • Multiple metadata servers handling different directories (subtrees) • Reliable autonomic distributed storage • OSD’s manage themselves by replicating and monitoring

Decoupled Data and Metadata • Increases performance by limiting interaction between clients and servers • Decoupling is common in distributed filesystems: Hadoop, Lustre, Panasas… • In contrast to other filesystems, CEPH uses a function to calculate the block locations

Dynamic Distributed Metadata Management • Metadata is split among cluster of servers • Distribution of metadata changes with the number of requests to even load among metadata servers • Metadata servers also can quickly recover from failures by taking over neighbors data • Improves performance by leveling metadata load

Reliable Autonomic Distributed Storage • Data storage servers act on events by themselves • Initiates replication and • Improves performance by offloading decision making to the many data servers • Improves reliability by removing central control of the cluster (single point of failure)

Ceph Components • Some quick definitions before getting into the paper • MDS – Meta Data Server • ODS – Object Data Server • MON – Monitor (Now fully implemented)

Ceph Components • Ordered: Clients, Metadata, Object Storage 2 1 3

Client Overview • Can be a Fuse mount • File system in user space • Introduced so file systems can use a better interface than the Linux Kernel VFS (Virtual file system) • Can link directly to the Ceph Library • Built into newest OS’s.

Client Overview – File IO • 1. Asks the MDS for the inode information

Client Overview – File IO • 2. Responds with the inode information

Client Overview – File IO • 3. Client Calculates data location with CRUSH

Client Overview – File IO • 4. Client reads directly off storage nodes

Client Overview – File IO • Client asks MDS for a small amount of information • Performance: Small bandwidth between client and MDS • Performance Small cache (memory) due to small data • Client calculates file location using function • Reliability: Saves the MDS from keeping block locations • Function described in data storage section

Client Overview – Namespace • Optimized for the common case, ‘ls –l’ • Directory listing immediately followed by a stat of each file • Reading directory gives all inodes in the directory • Namespace covered in detail next! $ ls -l total 0 drwxr-xr-x 4 dweitzelswanson 63 Aug 15 2011 apache drwxr-xr-x 5 dweitzelswanson 42 Jan 18 11:15 argus-pep-api-java drwxr-xr-x 5 dweitzelswanson 42 Jan 18 11:15 argus-pep-common drwxr-xr-x 7 dweitzelswanson 103 Jan 18 16:37 bestman2 drwxr-xr-x 6 dweitzelswanson 75 Jan 18 12:25 buildsys-macros

Metadata Overview • Metadata servers (MDS) server out the file system attributes and directory structure • Metadata is stored in the distributed filesystem beside the data • Compare this to Hadoop, where metadata is stored only on the head nodes • Updates are staged in a journal, flushed occasionally to the distributed file system

MDS Subtree Partitioning • In HPC applications, it is common to have ‘hot’ metadata that is needed by many clients • In order to be scalable, Ceph needs to distributed metadata requests among many servers • MDS will monitor frequency of queries using special counters • MDS will compare the counters with each other and split the directory tree to evenly split the load

MDS Subtree Partitioning • Multiple MDS split the metadata • Clients will receive metadata partition data from the MDS during a request

MDS Subtree Partitioning • Busy directories (multiple creates or opens) will be hashed across multiple MDS’s

MDS Subtree Partitioning • Clients will read from random replica • Update to the primary MDS for the subtree

Data Placement • Need a way to evenly distribute data among storage devices (OSD) • Increased performance from even data distribution • Increased resiliency: Losing any node is minimally effects the status of the cluster if even distribution • Problem: Don’t want to keep data locations in the metadata servers • Requires lots of memory if lots of data blocks

CRUSH • CRUSH is a pseudo-random function to find the location of data in a distributed filesystem • Summary: Take a little information, plug into globally known function (hashing?) to find where the data is stored • Input data is: • inode number – From MDS • OSD Cluster Map (CRUSH map) – From OSD/Monitors

CRUSH • CRUSH maps a file to a list of servers that have the data

CRUSH • File to Object: Takes the inode (from MDS)

CRUSH • File to Placement Group (PG): Object ID and number of PG’s

Placement Group • Sets of OSDs that manage a subset of the objects • OSD’s will have many Placement Groups • Placement Groups will have R OSD’s, where R is number of replicas • An OSD will either be a Primary or Replica • Primary is in charge of accepting modification requests for the Placement Group • Clients will write to Primary, read from random member of Placement Group

CRUSH • PG to OSD: PG ID and Cluster Map (from OSD)

CRUSH • Now we know where to write the data / read the data • Now how do we safely handle replication and node failures?

Replication • Replicates to nodes also in the Placement Group

Replication • Write the the placement group primary (from CRUSH function).

Replication • Primary OSD replicates to other OSD’s in the Placement Group

Replication • Commit update only after the longest update

Failure Detection • Each Autonomic OSD looks after nodes in it’s Placement Group (possible many!). • Monitors keep a cluster map (used in CRUSH) • Multiple monitors keep eye on cluster configuration, dole out cluster maps.

Recovery & Updates • Recovery is entirely between OSDs • OSD have two off modes, Down and Out. • Down is when the node could come back, Primary for a PG is handed off • Out is when a node will not come back, data is re-replicated.

Recovery & Updates • Each object has a version number • Upon bringing up, check version number of Placement Groups to see if current • Check version number of objects to see if need update

Ceph Components • Ordered: Clients, Metadata, Object Storage (Physical) 2 1 4

Object Storage • The underlying filesystem can make or break a distributed one • Filesystems have different characteristics • Example: RieserFS good at small files • XFS good at REALLY big files • Ceph keeps a lot of attributes on the inodes, needs a filesystem that can hanleattrs.

Object Storage • Ceph can run on normal file systems, but slow • XFS, ext3/4, … • Created own Filesystem in order to handle special object requirements of Ceph • EBOFS – Extent and B-Tree based Object File System.

Object Storage • Important to note that development of EBOFS has ceased • Though Ceph can run on any normal filesystem (I have it running on ext4) • Hugely recommend to run on BTRFS

Ceph : A Scalable, High-Performance Distributed File System

Ceph : A Scalable, High-Performance Distributed File System

Presentation Transcript

CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM

Frangipani: A scalable distributed File System

Distributed File Systems

A Simple and Scalable Distributed File System Dennis Fetterly, Maya Haridasan, and Michael Isard Microsoft Research – S

Ceph: A Scalable, High-Performance Distributed File System

Distributed File System Design and Implementation

Distributed File Systems

DISTRIBUTED FILE SYSTEM

G o o g l e File System

DISTRIBUTED FILE SYSTEMS

Other File Systems: NFS and GFS

Lecture 23: Distributed-File Systems (Chapter 17)

The Coda File System

The Google File System

AFS (Andrew File System)

Chapter 18 – Distributed Systems and Web Services

Scale and Performance in a Distributed File System

Distributed File Systems

Ceph: A Scalable, High-Performance Distributed File System

Distributed File Systems

Distributed Systems Course Distributed File Systems