1 / 19

Status of Lustre SE R&D

This workshop explores the status of Lustre SE R&D, including architecture review, local filesystem performance, SRM v2.2 interface, backup and monitoring, Lustre WAN, and more.

rubentaylor
Download Presentation

Status of Lustre SE R&D

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Status of Lustre SE R&D Yujun Wu1, Paul Avery1, Bockjoo Kim1, Fu Yu1, Dimitri Bourilkov1, Craig Prescott2, Charles A. Taylor2, Jorge L. Rodriguez3, Justin Kraft3 1UF Physics Dept and CMS Tier 2, 2UF HPC, 3FIU USCMS Tier-2 Workshop

  2. Outline • Lustre filesystem system review • Local Lustre filesystem performance • Lustre with SRM v2.2 interface • Lustre filesystem backup and monitoring • Lustre WAN • Summary and outlook; USCMS Tier-2 Workshop

  3. Lustre architecture review • Lustre filesystem consists of one metadata server (MDS), multiple object storage server (OSS). Each OSS usually serves between two and eight object storage targets (OST); • Lustre clients can access data stored in Lustre OSTs through standard POSIX semantics concurrently. And Lustre filesystem is able to achieve high performance through accessing the data using multiple OSSs and OSTs at the same time; • Florida CMS Tier-2 started to use Lustre filesystem since 2008----collaborating with UF HPC center; USCMS Tier-2 Workshop

  4. Lustre architecture review (2) USCMS Tier-2 Workshop

  5. Local Lustre filesystem performance • UF CMS Tier-2, working with UF HPC center, deployed 215TB Areca RAID storage using Lustre filesystem: - 1 MDS: E5520 two quad-core 2.3GHz Intel Xeon processors; 24GB DDR3 memory; Mirrored dual160GB Intel X25-M SATA solid state drive (MDT); - 5 OSSs: Dual AMD Opteron 2376 quad-core 2.3GHz processors with 32GB DDR2 memory. Each OSS connects to two 16-bay iStarUSA mAGE316U40-PCI-E storage enclosures via PCIe extension cables. Each enclosure has 16 Western Digital RE4-GP 2TB SATA drives. One Areca ARC-1680ix-16 Raid controller is installed in each enclosure; - 30 OSTs: Each OST is 4+1 RAID5 with a size of 8TB. There are 3 OSTs per Areca controller, with a global hot spare; USCMS Tier-2 Workshop

  6. Local Lustre filesystem performance (2) Directly mounted Lustre filesystem test using Iozone benchmark tool on a single client node with 10Gb NIC: - 826MB/s sequential write rate and 540MB/s sequential read rate; USCMS Tier-2 Workshop

  7. Local Lustre filesystem performance (3) • Local GridFTP put tests to the Lustre filesystem by using one and two GridFTP servers with 10Gb NIC and client files on ramdisk: • - Single GridFTP server: 690MB/s write rate • - Two GridFTP servers: ~1015MB/s write rate; USCMS Tier-2 Workshop

  8. SRM v2.2 Interface and performance • We deployed lightweight full SRM v2.2 implementation BestMan on top of Lustre filesystem and everything works fine so far with authorization through local gridmap file or GUMS; • BestMan java transfer selection plugin has been developed to select GridFTP servers based on load and configuration; • Interoperability test shows Lustre (through BestMan SRMv2.2 interface) works well with Castor, dCache, FTS, Hadoop, INFN StoRM, and ReddNet, etc; • Efforts have been put to eliminate the transfer errors by doing the following: - Upgrade GridFTP server to the new version; - Update the Linux kernel version on the GridFTP server; - Adjust network parameters; USCMS Tier-2 Workshop

  9. SRM v2.2 Interface and performance(2) USCMS Tier-2 Workshop

  10. SRM v2.2 Interface and performance (3) • Able to achieve about 600MB/s (production+debug instances) through PhEDEx using our new BestMan+GridFTP+Lustre setup. This was confirmed by both PhEDEx plot and local ganglia plots. And local peak rate from ganglia plot was near 9Gb/s when putting data into Lustre filesystem; • Able to run PhDEx loadtest for long period of time with sustainable rate about 230-300MB/s (debug instance); • The success rate has been greatly improved. It is now 90 -100% to our site most of the time in debug instance except from a few persistent sites; • Easily passed PhEDEX loadtest, SAM test; • Some CMS datasets are now stored on Lustre filesystem through BestMan setup. UF Tier-2 and HPC are getting over 1000 jobs daily and jobs have been completed without any Lustre access errors; USCMS Tier-2 Workshop

  11. SRM v2.2 interface and performance (4) USCMS Tier-2 Workshop

  12. SRM v2.2 interface and performance (5) USCMS Tier-2 Workshop

  13. SRM v2.2 Interface and performance (6) • Conducted BestMan server benchmark using Brian’s same srm_punch tool; • Using both srm-ls and srm-put (5 srm-ls and 4 srm-put one after another) in srm_punch with 100 UF Tier-2 pg worker nodes while CMS production jobs were running: - Using GUMS authentication: 1.9 to 3.3 Hz and the server can easily melt down with 100 clients; - Using local grid-mapfile: 2.1 to 4.7 Hz • Using only srm-ls in srm_punch with 100 worker nodes, we were able to achieve 39.8Hz with no GUMS or proxy delegation; and about 12Hz with proxy delegation; • These numbers are compatible with Brian’s earlier results (20Hz without GUMS or proxy delegation, higher number here likely due to more powerful server node)---no surprise! USCMS Tier-2 Workshop

  14. Lustre WAN • Besides the work at UF, we also work with FIU and FIT on testing Lustre with CMS storage and analysis jobs from remote locations; • As reported last year, Jorge from FIU was able to saturate their 1G network with filesystem benchmark tool from FIU on mounted UF Lustre filesystem. He and his student Justin are testing the performance on 10G network. The writing to UF Lustre filesystem has exceeded 150 MB/s using multiple 1G worker nodes to a 10G up link to the FLR from FIU; • Patrick Ford from FIT was also able to reach near network connection speed when reading/writing data from/to UF Lustre filesystem: iperf tests 111 MB/s, write to UF Lustre filesystem 108MB/s, read 72.5MB/s; USCMS Tier-2 Workshop

  15. Lustre WAN (2) • FIU (Jorge) has been able to run CMS application with directly mounted Lustre filesystem for data stored at UF HPC Lustre; • Both FIU and FIT are now able to access CMS data stored at UF. Currently, the CMSSW performance with direct data access is hindered by an issue with CMSSW under high latency network connection---hope better network connection between us can solve the problem (and possible CMSSW improvement?). - Good collaboration examples for T2 and T3 to share data and resources USCMS Tier-2 Workshop

  16. Lustre backup and monitoring • We have prototyped the Lustre backup system in our test system. It includes Lustre metadata backup and OST backup and the backed files are copied to a remote location; • Lustre metadata backup uses the well known LVM Snapshots to take the metadata snapshot; • For OSTs, besides relying on the underlying Raid storage, will also save the critical ext4 (Lustre) filesystem metadata to a file for each OST using e2image. Its “-I” option supports the restoration of metadata from the saved image file; • Implemented the Lustre Nagios monitoring and ganglia monitoring to monitor the health and performance; USCMS Tier-2 Workshop

  17. Status toward certifying as a CMS SE • Working on fully meeting the Tier-2 Storage Element requirements established by USCMS Tier-2 management; • Have been able to satisfy 13 requirements out of totally 14 listed requirements by the USCMS Tier-2 management; • Have been able to eliminate most of the transfer errors; have better GridFTP server selections and monitoring; • Right now, prototyped the Lustre metadata (and real data) backup systems, and documentation; • Will try to finish the documentation later this month; • Help from USCMS Tier-2 management is needed on a full SE review; USCMS Tier-2 Workshop

  18. Summary and outlook • UF CMS Tier 2 has been using Lustre for over two years with collaboration with UF HPC and nearby Tier 3 sites; • Our newly deployed Lustre filesystem shows good read/write performance efficiency. And we were able achieve almost 10Gb/s put rate to our Lustre filesystem using two local GridFTP servers; • Integrated Lustre filesystem with BestMan (SRMv2 interface), and Globus GridFTP for WAN transfers using customized java transfer selection plugin. And we have been able to achieve about 600MB/s transfer rate from FNAL and peak put rate was around 900MB/s; • Continuing to improve the transfer performance and quality---have eliminated most transfer errors found in PhEDEx transfers; USCMS Tier-2 Workshop

  19. Summary and outlook (2) • Lustre filesystem backup, including both metadata and object storage target, has been prototyped and in testing. Will put into production use soon. This is necessary even the Lustre filesystem itself is pretty reliable; • Lustre’s remote access feature may have important implication for CMS T3 sites. Jorge from FIU has proved that he can access data located at UF T2 site. This has the potential to avoid the need to deploy CMS data management services at small Tier3s and allow physicists to focus on physics; • We are getting involved with Teragrid in a new proposal in which one of the main tasks is to further develop Lustre on authentication ability using kerberos and Lustre wide area access; • Working on documentation and will be ready for a full CMS SE review soon. Need your help for suggestions. USCMS Tier-2 Workshop

More Related