1 / 70

Ceph : A Scalable, High-Performance Distributed File System

Ceph : A Scalable, High-Performance Distributed File System. Derek Weitzel. In the Before…. Lets go back through some of the mentionable distributed file systems used in HPC. In the Before…. There were distributed filesystems like: Lustre – RAID over storage boxes

desmond
Download Presentation

Ceph : A Scalable, High-Performance Distributed File System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ceph: A Scalable, High-Performance Distributed File System Derek Weitzel

  2. In the Before… • Lets go back through some of the mentionable distributed file systems used in HPC

  3. In the Before… • There were distributed filesystems like: • Lustre – RAID over storage boxes • Recovery time after a node failure was MASSIVE! (Entire server’s contents had to be copied, one to one) • When functional, reading/writing EXTREMELY fast • Used in heavily in HPC

  4. In the Before… • There were distributed filesystems like: • NFS – Network File System • Does this really count as distributed? • Single large server • Full POSIX support, in kernel since…forever • Slow with even a moderate number of clients • Dead simple

  5. In the Current… • There are distributed filesystems like: • Hadoop – Apache Project inspired by Google • Massive throughput • Throughput scales with attached HDs • Have seen VERY LARGE production clusters • Facebook, Yahoo… Nebraska • Doesn’t even pretend to be POSIX

  6. In the Current… • There are distributed filesystems like: • GPFS(IBM) / Panasas – Propriety file systems • Requires closed source kernel driver • Not flexible with newest kernels / OS’s • Good: Good support and large communities • Can be treated as black box for administrators • HUGE Installments (Panasas at LANL is HUGE!!!!)

  7. Motivation • Ceph is a emerging technology in the production clustered environment • Designed for: • Performance – Striped data over data servers. • Reliability – No single point of failure • Scalability – Adaptable metadata cluster

  8. Timeline • 2006 – Ceph Paper written • 2007 – Sage Weil earned PhD from Ceph (largely) • 2007 – 2010 Development continued, primarily for DreamHost • March 2010 – Linus merged Ceph client into mainline 2.6.34 kernel • No more patches needed for clients

  9. Adding Ceph to Mainline Kernel • Huge development! • Significantly lowered cost to deploy Ceph • For production environments, it was a little too late – 2.6.32 was the stable kernel used in RHEL 6 (CentOS 6, SL 6, Oracle 6).

  10. Lets talk paper Then I’ll show a quick demo

  11. Ceph Overview • Decoupled data and metadata • IO directly with object servers • Dynamic distributed metadata management • Multiple metadata servers handling different directories (subtrees) • Reliable autonomic distributed storage • OSD’s manage themselves by replicating and monitoring

  12. Decoupled Data and Metadata • Increases performance by limiting interaction between clients and servers • Decoupling is common in distributed filesystems: Hadoop, Lustre, Panasas… • In contrast to other filesystems, CEPH uses a function to calculate the block locations

  13. Dynamic Distributed Metadata Management • Metadata is split among cluster of servers • Distribution of metadata changes with the number of requests to even load among metadata servers • Metadata servers also can quickly recover from failures by taking over neighbors data • Improves performance by leveling metadata load

  14. Reliable Autonomic Distributed Storage • Data storage servers act on events by themselves • Initiates replication and • Improves performance by offloading decision making to the many data servers • Improves reliability by removing central control of the cluster (single point of failure)

  15. Ceph Components • Some quick definitions before getting into the paper • MDS – Meta Data Server • ODS – Object Data Server • MON – Monitor (Now fully implemented)

  16. Ceph Components • Ordered: Clients, Metadata, Object Storage 2 1 3

  17. Ceph Components • Ordered: Clients, Metadata, Object Storage 2 1 3

  18. Client Overview • Can be a Fuse mount • File system in user space • Introduced so file systems can use a better interface than the Linux Kernel VFS (Virtual file system) • Can link directly to the Ceph Library • Built into newest OS’s.

  19. Client Overview – File IO • 1. Asks the MDS for the inode information

  20. Client Overview – File IO • 2. Responds with the inode information

  21. Client Overview – File IO • 3. Client Calculates data location with CRUSH

  22. Client Overview – File IO • 4. Client reads directly off storage nodes

  23. Client Overview – File IO • Client asks MDS for a small amount of information • Performance: Small bandwidth between client and MDS • Performance Small cache (memory) due to small data • Client calculates file location using function • Reliability: Saves the MDS from keeping block locations • Function described in data storage section

  24. Ceph Components • Ordered: Clients, Metadata, Object Storage 2 1 3

  25. Client Overview – Namespace • Optimized for the common case, ‘ls –l’ • Directory listing immediately followed by a stat of each file • Reading directory gives all inodes in the directory • Namespace covered in detail next! $ ls -l total 0 drwxr-xr-x 4 dweitzelswanson 63 Aug 15 2011 apache drwxr-xr-x 5 dweitzelswanson 42 Jan 18 11:15 argus-pep-api-java drwxr-xr-x 5 dweitzelswanson 42 Jan 18 11:15 argus-pep-common drwxr-xr-x 7 dweitzelswanson 103 Jan 18 16:37 bestman2 drwxr-xr-x 6 dweitzelswanson 75 Jan 18 12:25 buildsys-macros

  26. Metadata Overview • Metadata servers (MDS) server out the file system attributes and directory structure • Metadata is stored in the distributed filesystem beside the data • Compare this to Hadoop, where metadata is stored only on the head nodes • Updates are staged in a journal, flushed occasionally to the distributed file system

  27. MDS Subtree Partitioning • In HPC applications, it is common to have ‘hot’ metadata that is needed by many clients • In order to be scalable, Ceph needs to distributed metadata requests among many servers • MDS will monitor frequency of queries using special counters • MDS will compare the counters with each other and split the directory tree to evenly split the load

  28. MDS Subtree Partitioning • Multiple MDS split the metadata • Clients will receive metadata partition data from the MDS during a request

  29. MDS Subtree Partitioning • Busy directories (multiple creates or opens) will be hashed across multiple MDS’s

  30. MDS Subtree Partitioning • Clients will read from random replica • Update to the primary MDS for the subtree

  31. Ceph Components • Ordered: Clients, Metadata, Object Storage 2 1 3

  32. Data Placement • Need a way to evenly distribute data among storage devices (OSD) • Increased performance from even data distribution • Increased resiliency: Losing any node is minimally effects the status of the cluster if even distribution • Problem: Don’t want to keep data locations in the metadata servers • Requires lots of memory if lots of data blocks

  33. CRUSH • CRUSH is a pseudo-random function to find the location of data in a distributed filesystem • Summary: Take a little information, plug into globally known function (hashing?) to find where the data is stored • Input data is: • inode number – From MDS • OSD Cluster Map (CRUSH map) – From OSD/Monitors

  34. CRUSH • CRUSH maps a file to a list of servers that have the data

  35. CRUSH • File to Object: Takes the inode (from MDS)

  36. CRUSH • File to Placement Group (PG): Object ID and number of PG’s

  37. Placement Group • Sets of OSDs that manage a subset of the objects • OSD’s will have many Placement Groups • Placement Groups will have R OSD’s, where R is number of replicas • An OSD will either be a Primary or Replica • Primary is in charge of accepting modification requests for the Placement Group • Clients will write to Primary, read from random member of Placement Group

  38. CRUSH • PG to OSD: PG ID and Cluster Map (from OSD)

  39. CRUSH • Now we know where to write the data / read the data • Now how do we safely handle replication and node failures?

  40. Replication • Replicates to nodes also in the Placement Group

  41. Replication • Write the the placement group primary (from CRUSH function).

  42. Replication • Primary OSD replicates to other OSD’s in the Placement Group

  43. Replication • Commit update only after the longest update

  44. Failure Detection • Each Autonomic OSD looks after nodes in it’s Placement Group (possible many!). • Monitors keep a cluster map (used in CRUSH) • Multiple monitors keep eye on cluster configuration, dole out cluster maps.

  45. Recovery & Updates • Recovery is entirely between OSDs • OSD have two off modes, Down and Out. • Down is when the node could come back, Primary for a PG is handed off • Out is when a node will not come back, data is re-replicated.

  46. Recovery & Updates • Each object has a version number • Upon bringing up, check version number of Placement Groups to see if current • Check version number of objects to see if need update

  47. Ceph Components • Ordered: Clients, Metadata, Object Storage (Physical) 2 1 4

  48. Object Storage • The underlying filesystem can make or break a distributed one • Filesystems have different characteristics • Example: RieserFS good at small files • XFS good at REALLY big files • Ceph keeps a lot of attributes on the inodes, needs a filesystem that can hanleattrs.

  49. Object Storage • Ceph can run on normal file systems, but slow • XFS, ext3/4, … • Created own Filesystem in order to handle special object requirements of Ceph • EBOFS – Extent and B-Tree based Object File System.

  50. Object Storage • Important to note that development of EBOFS has ceased • Though Ceph can run on any normal filesystem (I have it running on ext4) • Hugely recommend to run on BTRFS

More Related