Parallel Filesystems for high-performance I/O GPFS and Lustre

Parallel Filesystemsfor high-performance I/OGPFS and Lustre Vincenzo Vagnoni INFN Bologna CNAF Bologna, 21st March 2006

Parallel File Systems at the INFN Tier-1: early studies in 2005 • Evaluation and Comparison of Parallel File Systems (GPFS, Lustre) for the implementation of a powerful disk I/O infrastructure for the TIER-1 at CNAF. • A moderately high-end testbed used for this study: • 6 IBM xseries 346 file servers connected via FC SAN to 3 IBM FAStT 900 (DS4500) controllers providing a total of 24 TB. • 500 computing nodes (temporarily allocated) acting as clients • Maximum available throughput from server to client nodes using 6 Gb Ethernet cards in this study: 6 Gb/s • PHASE 1: Generic tests. • PHASE 2: Realistic physics analysis jobs reading data from (not locally mounted) Parallel File System. • Dedicated tools for test (PHASE 1) and monitoring have been developed: • The benchmarking tools allows the user to start, stop and monitor the test on all the clients from a single user interface

IBM xseries 346dual Xeon, 2 GB RAM 4 TB, 2 LUN17x250 GBRaid-5 array 4 TB, 2 LUN17x250 GBRaid-5 array 4 TB, 2 LUN17x250 GBRaid-5 array IBM FastT 900 (DS 4500) IBM FastT 900 (DS 4500) IBM FastT 900 (DS 4500) IBM xseries 346dual Xeon, 2 GB RAM 4 TB, 2 LUN17x250 GBRaid-5 array 4 TB, 2 LUN17x250 GBRaid-5 array 4 TB, 2 LUN17x250 GBRaid-5 array IBM xseries 346dual Xeon, 2 GB RAM IBM xseries 346dual Xeon, 2 GB RAM IBM xseries 346dual Xeon, 2 GB RAM IBM xseries 346dual Xeon, 2 GB RAM Early Parallel File System Test-bed client nodedual Xeon, 2 GB RAM client nodedual Xeon, 2 GB RAM client nodedual Xeon, 2 GB RAM 500 client nodes client nodedual Xeon, 2 GB RAM BrocadeFiberChannelSwitch client nodedual Xeon, 2 GB RAM GigabitEthernetSwitch client nodedual Xeon, 2 GB RAM client nodedual Xeon, 2 GB RAM client nodedual Xeon, 2 GB RAM client nodedual Xeon, 2 GB RAM 3 disk storage24 TB, 12 LUN 6 file system server

GPFS features • Commercial product, initially developed by IBM for the SP systems and then ported to Linux • Stable, reliable, fault tolerant, indicated for storage of critical data • Advanced command line interface for configuration and management • Easy to install, not invasive • Distributed as binaries in RPM packages • No patches to standard kernels are required (apart for small bug fixes on the server side already included in newer kernels), just a few kernel modules for POSIX I/O to be compiled for the running kernel • Data and metadata striping • Possibility to have data and metadata redundancy • Expensive solution, as it requires the replication of the whole files, indicated for storage of critical data • Data recovery for filesystem corruption available, fsck • Fault tolerant features oriented to SAN and internal health monitoring through network heartbeat • In principle requires every machine in the cluster (clients and servers) to have each-other root authentication without password (with rsh or ssh) • In case one gets root privileges on one machine, all machines can be hacked • This is not a nice feature for security and seems like a quick and dirty way adopted when porting the software to Linux • We implemented a workaround restricting the access of the clients to the servers by means of ssh forced-command wrappers • POSIX I/O

dataStructureDumpOnSGPanic 0 /tmp/mmfs assertOnStructureError 0 unmountOnDiskFail 0 writeAllDescReplicas 0 maxMBpS 150 prefetchThreads 72 worker1Threads 48 worker3Threads 8 maxFileCleaners 8 flushedDataTarget 32 flushedInodeTarget 32 maxBackgroundDeletionThreads 4 maxDataShipPoolSize 266240 prefetchTimeout 5 pctTokenMgrStorageUse 25 maxFcntlRangesPerFile 200 eventsExporterTcpPort 6668 eeWatchDogInterval 90 eeWatchDogHungThreadCutoff 60 priority 40 pindaemon 0 pinmaster stack:256K data:4096K watchdogtimeout 20 maxInodeDeallocHistory 50 maxInodeDeallocPrefetch 8 takeovertimeout 600 maxblocksize 1048576 writebehindThreshhold 524288 seqDiscardThreshhold 1048576 syncInterval 30 dmapiEnable 1 dmapiWorkerThreads 12 dmapiEventBuffers 64 dmapiEventTimeout -1 dmapiSessionFailureTimeout 0 dmapiMountTimeout 60 SANergyLease 60 SANergyLeaseLong 3600 mmapKprocs 3 IgnoreReplicaSpaceOnStat 0 enableTreeBasedQuotas 0 enableUIDremap 0 enableStatUIDremap 0 uidExpiration 36000 maxFeatureLevelAllowed 813 envVar maxSGDescIOBufSize 262144 lockTableVersion 1 readReplicaPolicy default opensslLibName libssl.so:libssl.so.0:libssl.so.4 cipherList EMPTY nsdBufSpace(% of PagePool) 30 nsdThreadsPerDisk 3 nsdMinWorkerThreads 8 nsdMaxWorkerThreads 32 tiebreakerdisks useDiskLease 1 leaseDuration 35 leaseRecoveryWait 35 leaseDMSTimeout -1 pingPeriod 2 totalPingTimeout 120 minMissedPingTimeout 8 maxMissedPingTimeout 60 wait4RVSD 0 comm_protocol TCP clusterType LC clusterName gpfs_flex.cr.cnaf.infn.it clusterId -8963710635260131391 uidDomain gpfs_flex.cr.cnaf.infn.it crashdump 0 dataStructureDump 1 /tmp/mmfs pagepool 1073741824 stealHistLen 4096 prefetchPct 20 hotListPct 10 maxBufferDescs -1 maxFilesToCache 1000 maxStatCache -1 statCacheDirPct 10 maxDiskAddrBuffs -1 sharedMemLimit 0 tokenMemLimit 268435456 maxAllocPctToCache 0 autoSgLoadBalance 0 multinode 1 tscTcpPort 1191 listenOnAllInterfaces 1 tscWorkerPool 10 tscPrimary 255.255.255.255 socketRcvBufferSize 0 socketSndBufferSize 0 asyncSocketNotify 0 allowRemoteConnections 0 allowDummyConnections 0 GPFS tuning, lot of handles

Lustre features • Commercial product, but open source and available as public download for evaluation purposes • Aggressive commercial effort, the developers sell it as an “Intergalactic Filesystem” scalable to 10000+ nodes • Whether or not is this true, it is a nice product indeed • Stable and reliable • Advanced interface for configuration and management • Easy to install, but rather invasive • Distributed as binaries and sources in RPM packages • Requires own Lustre patches to standard kernels either on server and client side, but binary distribution of patched kernels are made available • Possibility to have Metadata redundancy and Metadata Server fault tolerance • Only plain striping of data across the servers is allowed at the moment • But CFS foresees to include other mechanisms for data redundancy in the future • Data recovery for filesystem corruption available • POSIX I/O on local mount point, but also possibility of direct data access without local mount point, through libc wrapping with a specific shared Lustre library to be preloaded by the application (LD_PRELOAD=liblustre.so) • However, highly discouraged by CFS and just best-effort maintainance • Can be used in a production environment

Also some weak points... • (In my opinion) the weakest point of the GPFS architecture (for our scale) at the moment is how the cluster configuration is deployed to the clients • Not really clients, since every node in a GPFS cluster virtually has the same role as any other node • However, the configuration of the cluster (tuning parameters, topology of the cluster, address of servers nodes, disks, etc.) has to be replicated on each node by means of ssh via a push mechanism • Pull mechanism however foreseen, e.g. in case the configuration has changed while a node was down, then the node can pull the new configuration when it comes up • Lustre solves the problem of deploying the cluster configuration by using an LDAP-based centralized information service • Much more simple, there is no reason (to me) why IBM should not make something similar in short time • Weakest point of Lustre, necessity to install patched kernel on the servers and clients • Not really an issue for the server side, but it might be for the client side • Obviously, farm administrators are quite scared of deploying patched kernels on a production farm... I can’t blame them • CFS announced that future releases will provide workarounds to this

Parallel File System Benchmarking • The benchmarking tools allow the user to start, stop and monitor the evolution of simultaneous read/write operations from an arbitrary number of clients, reporting at the end of the test the aggregated throughput: • Developed as a set of scripts and C programs. • The tool implements network bandwith measurements by means of the netperf suite and sequential read/write with dd. • Thought to be of general use, can be reused with minimal effort for any kind of storage benchmark. • Completely automatized: • The user does not need to install anything on the target nodes as all the software is copied by the tool (and also removed in the end). • The user has only to issue a few commands from the shell prompt to control everything. • Can perform complex unattended and personalized tests by means of very simple scripts, collect and save all the results and produce plots.

Parallel File System Monitoring Tools • The monitoring tools allow to measure the time dependence of the raw network traffic of each server with a granularity of one second • Following the time dependence of the I/O gives important insights and can be very important for a detailed understanding and tuning of the network and parallel filesystem operational parameters • The existing tools didn’t provide such a low granularity, so we have written our own, reusing a work made for the LHCb online farm monitoring (consider that writing/reading one file of 1 GB from a single client requires just a few seconds) • The tool automatically produces a plot of the aggregated network traffic of the file servers for each test in pdf format. • The network traffic data files corresponding to each file server are saved also to ascii files in case one wants to make a detailed per-server analysis

PHASE 1: Generic Benchmark • Sequential read/write from a variable number of clients simultaneously performing I/O with different protocols (native GPFS/Lustre, RFIO over GPFS/Lustre, NFS over GPFS). • 1 to 30 GigaEthernet clients, 1 to 4 processes per client. • Sequential write/read of zeroed files by means of dd. • File sizes ranging from 1 MB to 1 GB. • Particular attention to read the whole files from disk (i.e. no caching at all, neither on the client side nor on the server side). • Before starting tests, appropriate sync’s are issued to flush the operating system buffers in order not to have interference between consecutive tests.

Network benchmark 30 netperf clients from 30 Gb/s nodes to 6 Gb/s servers 30 netperf clients from 6 Gb/s servers to 30 Gb/s nodes

Effective average throughput (Gb/s) # of simultaneous read/writes Generic Benchmark Raw ethernet throughput vs time(20 x 1GB file simultaneous reads with Lustre) Results of read/write(1GB different files)

Generic Benchmark(here shown for 1 GB files) • Numbers are reproducible with small fluctuations • Lustre tests with NFS export not performed, but possible

GPFS/Lustre comparison: 1 GB x 1 Files Effective average throughput (Gb/s) # of simultaneous read/writes

Native Lustre with different file sizes Effective average throughput (Gb/s) # of simultaneous read/writes

Native GPFS with different file sizes Effective average throughput (Gb/s) # of simultaneous read/writes

GPFS robustness test • Done just with GPFS 2.2 • 2.000.000 files written in 1 directory (for a total of 20 TB) by 100 processes simultaneously with native GPFS and then read back, run continuously for 3 days • No failures!

PHASE 2: Realistic analysis • We focus on the Analysis Jobs since they are generally the most I/O bound processes of the experiment activity. • Running a realistic LHCb analysis algorithm on 8.1 TB of data served by RFIO daemons running on GPFS parallel file system servers • The analysis algorithm performs a selection of an LHCb physics channel by reading sequentially input data files containing simulated events and producing n-tuples files in output. • Analysis jobs submitted to the production LSF batch system • 14000 jobs submitted to the queue, 500 jobs in simultaneous RUN state: • RFIO-copy to the local Worker Node disk the file to be processed; • Analyze the data; • RFIO-copy back the output of the algorithm; • Cleanup files from the local disk.

PHASE 2: Realistic analysis (II) • 8.1 TB of data processed in 7 hours, all 14000 jobs completed successfully. 500 jobs running simultaneously. • > 3 Gbit/s raw sustained read throughput from the file servers with GPFS (about 320MByte/s effective I/O throughput). • Write throughput of output data negligible (1 MB/job).

Some comments on analysis models • The common approach of copying the input files to the local disk from the SE, writing the output files to the local disk, then move them back to the SE, is conceptually simple but has serious drawbacks • The application can only check that the needed local disk space is available at the beginning of the job, but cannot predict whether this space will be available for the entire life of the job • In general, other applications may run on the same node at the same time and can write data to the local disk and fill up the disk • In real life this actually happens • The success of the job is intrinsically not predictable within this model unless one thinks to complex checkpointing mechanisms • More cleaver approach (which requires SRM v2.1 and a reliable filesystem that allows to keep a file open for a while) would be to open remotely input and output files • SRM 2.1 functionalities needed to pin the input files and reserve space for the output files on the SE (see StoRM talk)

Worker Nodes mounting GPFS Ethernet Storage servers GPFS and GridFTPd 4 x Sun Microsystem SunFire V 20 Z S Qlogic 2340 HBA 8 x 2 Gb / s FC F P SAN Fabric G 4 x 2 Gb / s FC r e t s u 22 TB l C StorageTEK FLexLine FLX 680 More recent studies with GPFS • Early studies in 2005 not performed by locally mounting the filesystem on the production worker nodes (but remote access via RFIO) • More recent studies in 2006 made with local GPFS mount • GPFS version 2.3.0-10 • Installation of GPFS RPMs completely “quattorized” • Minimal work required to adapt RPM packages provided by IBM to become quattor compliant • GPFS mounted on 500 boxes (most of the production farm)

WAN data transfers • Data transfers of pre-staged stripped LHCb data files from CERN (castorgridsc data exchanger pools) to the 4 GPFS servers via third party globus-url-copy • 40 simultaneous transfers, dynamically balanced by the DNS, 5 streams per transfer • Typical file size 500 MB • About 2 Gb/s of sustained throughput with this relatively simple testbed • CPU load of servers: 35% • Including I/O wait: 15%

Sustained writes on LAN from production worker nodes • 1000 jobs submitted to the LSF production batch • 400 jobs in simultaneous running state • 1 GB file written from each job at full available throughput • About 2.5 Gb/s • CPU load of servers: 70% • including I/O wait: 20% • negligible on client side

Sustained read on LAN from production worker nodes • 1000 jobs submitted to the LSF production batch • 300 jobs in simultaneous running state • 1 GB file read from each job at full available throughput • 4 Gb/s • Maximum available bandwidth used • CPU load of servers: 85% • including I/O wait: 50% • negligible on client side

A more realistic scenario: sustained WAN data transfers and local LAN read from worker nodes at the same time • 40x5 streams from CERN to CNAF • 1000 jobs submitted to the LSF production batch • 550 jobs in simultaneous running state • 1 GB file read from each job at full available throughput • About 1.7 Gb/s from CERN and 2.5 Gb/s to worker nodes • CPU load of servers: 100% • including I/O wait: 60% • negligible on client side

Lessons learned and conclusions • Personally I learned a lot from this work and, at the same time, this was a valuable contribution toward the solution of the LHC storage challenge for the Tier-1 • Very fruitful collaboration and full support from CNAF staff • Possibility to make experience with expensive hardware resources and advanced systems • See Giacinto’s talk for a similar experience • Parallel filesystems identified as interesting and valuable solutions • Oriented to SAN for high availability • Interesting performance figures, already at the scale of what required “one day” (not so far actually...) • POSIX I/O access, every existing application can use these filesystems as they are without any adaptation • For advanced storage management they require a dedicated SRM (see StoRM talk), then naturally become fully GRID-compliant disk-based storage solutions, and can be solid building blocks toward GRID standardization in the I/O sector • Can also act as highly available and scalable general purpose filesystems • e.g. suitable to serve home directories to large UI clusters

Parallel Filesystems for high-performance I/O GPFS and Lustre