Grid Datafarm Architecture for Petascale Data Intensive Computing

ACAT 2002 June 28, 2002 Moscow, Russia Grid Datafarm Architecture for Petascale Data Intensive Computing Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Gfarm project http://datafarm.apgrid.org/

Petascale Data Intensive Computing /Large-scale Data Analysis • Data intensive computing, large-scale data analysis, data mining • High Energy Physics • Astronomical Observation, Earth Science • Bioinformatics… • Good support still needed • Large-scale database search, data mining • E-Government, E-Commerce, Data warehouse • Search Engines • Other Commercial Stuff

Detector forLHCb experiment Detector for ALICE experiment Example: Large Hadron Collider Accelerator at CERN ~2000 physicists from 35 countries ATLAS Detector 40mx20m 7000 Tons LHCPerimeter 26.7km Truck

Peta/Exascale Data Intensive Computing Requirements • Peta/Exabyte scale files • Scalable parallel I/O throughput • > 100GB/s, hopefully > 1TB/s within a system and between systems • Scalable computational power • > 1TFLOPS, hopefully > 10TFLOPS • Efficiently global sharing with group-oriented authentication and access control • Resource Management and Scheduling • System monitoring and administration • Fault Tolerance / Dynamic re-configuration • Global Computing Environment

Current storage approach 1:HPSS/DFS, . . . Super- computer Disk IP Network (-10Gbps) Mover Mover Mover Mover Metadata Manager Petascale Tape Archive Meta DB Disk cache Single system image & Parallel I/O I/O bandwidth limited by IP network

Current storage approach 2: Striping cluster filesystem, ex. PVFS, GPFS IP Network (-10Gbps) Compute Node Compute Node I/O Node I/O Node Metadata Manager Meta DB File stripe Single system image & Parallel I/O I/O bandwidth limited by IP network

For Petabyte-scale Computing • Wide-area efficient sharing • Wide-area fast file transfer • Wide-area file replica management • Scalable I/O bandwidth, >TB/s • I/O bandwidth limited by network bandwidth • Utilize local disk I/O as much as possible • Avoid data movement through network as much as possible • Fault tolerance • Temporal failure of wide-area network is common • Node and disk failures not exceptional cases but common • Fundamentally New Paradigm is necessary

Our approach : Data Parallel Cluster-of-cluster Filesystem IP Network Comp. & I/O Node Comp. & I/O Node Comp. & I/O Node Comp. & I/O Node Metadata Manager N x 100MB/s Meta DB File fragment Single system image & Parallel I/O Local file view and file affinity scheduling to exploit locality of local I/O channels

Our approach (2) : Parallel Filesystem for Grid of Clusters • Cluster-of-cluster filesystem on the Grid • File replicas among clusters for fault tolerance and load balancing • Extension of striping cluster filesystem • Arbitrary file block length • Unified I/O and compute node • Parallel I/O, parallel file transfer, and more • Extreme I/O bandwidth, >TB/s • Exploit data access locality • File affinity scheduling and local file view • Fault tolerance – file recovery • Write-once files can be re-generated using a command history and re-computation

Extension of cluster filesystem File is divided into file fragments Arbitrary length for each file fragment Arbitrary number of I/O nodes for each file Filesystem metadata is managed by metaserver Parallel I/O and parallel file transfer Cluster-of-cluster filesystem File replicas among (or within) clusters fault tolerance and load balancing Filesystem metaserver manages metadata at each site MS MS Inter-cluster ~10Gbps Meta- server Gfarm cluster-of-cluster filesystem (1)

Gfsd – I/O daemon running on each filesystem node Remote file operations Authentication / access control (via GSI, . . .) Fast executable invocation Heartbeat / load monitor Process / resource monitoring, management Gfmd – metaserver and process manager running at each site Filesystem metadata management Metadata consists of Mapping from logical filename to physical distributed fragment filenames Replica catalog Command history for regeneration of lost files Platform information File status information Size, protection, . . . Gfarm cluster-of-cluster filesystem (2)

Extreme I/O bandwidth (1) • Petascale file tends to be accessed with access locality • Local I/O aggressively utilized for scalable I/O throughput • Target architecture – cluster of clusters, each node facilitating large-scale fast local disks • File affinity process scheduling • Almost Disk-owner computation • Gfarm parallel I/O extension - Local file view • MPI-IO insufficient especially for irregular and dynamically distributed data • Each parallel process accesses only its own file fragment • Flexible and portable management in single system image • Grid-aware parallel I/O library

Process.0 Process.1 Process.2 Process.3 Extreme I/O bandwidth (2)Process manager - scheduling gfmd gfarm:File Host0.ch Host1.ch Host2.jp Host3.jp • File affinity scheduling Host0.ch Host1.ch Host2.jp Host3.jp PC PC PC PC gfsd gfsd gfsd gfsd File.1 File.0 File.2 File.3 Process scheduling based on file distribution Ex. % gfrun –H gfarm:File Process

File (I/O bandwidth limited by bisection bandwidth, ~GB/s, as an ordinal parallel filesystem) Extreme I/O bandwidth (3)Gfarm I/O API – File View (1) gfmd gfarm:File Host0.ch Host1.ch Host2.jp Host3.jp • Global file view Host0.ch Host1.ch Host2.jp Host3.jp PC PC PC PC Process.0 Process.1 Process.2 Process.3 gfsd gfsd gfsd gfsd File.1 File.0 File.2 File.3

Accessible data set is restricted to a local file fragment Scalable disk I/O bandwidth (>TB/s) (Local file fragment may be stored in remote node) Extreme I/O bandwidth (4)Gfarm I/O API - File View (2) gfmd gfarm:File Host0.ch Host1.ch Host2.jp Host3.jp • Local file view Host0.ch Host1.ch Host2.jp Host3.jp Process.0 Process.1 Process.2 Process.3 gfsd gfsd gfsd gfsd File.1 File.0 File.2 File.3

grep regexp grep regexp grep regexp grep regexp grep regexp Gfarm Example: gfgrep- parallel grep gfmd gfarm:input Host1.ch Host2.ch Host3.ch Host4.jp Host5.jp %gfgrep –o gfarm:outputregexpgfarm:input Host2.ch open(“gfarm:input”, &f1) create(“gfarm:output”, &f2) set_view_local(f1) set_view_local(f2) Host4.jp input.2 output.2 input.4 output.4 Host1.ch Host5.jp Host3.ch grep regexp input.1 output.1 input.5 output.5 input.3 output.3 close(f1); close(f2) KEK.JP CERN.CH

Fault-tolerance support • File replicas on an individual fragment basis • Re-generation of lost or needed write-once files using a command history • Program and input files stored in fault-tolerant Gfarm filesystem • Program should be deterministic • Re-generation also supports GriPhyN virtual data concept

Specifics of Gfarm APIs and commands (For Details, Please See Paper at http://datafarm.apgrid.org/)

Gfarm Parallel I/O APIs • gfs_pio_create / open / close • gfs_pio_set_view_local / index / global • gfs_pio_read / write / seek / flush • gfs_pio_getc / ungetc / putc • gfs_mkdir / rmdir / unlink • gfs_chdir / chown / chgrp / chmod • gfs_stat • gfs_opendir / readdir / closedir

gfrep Replicate a Gfarm file using parallel streams gfsched / gfwhere List hostnames where each Gfarm fragment and replica is stored gfls Lists contents of directory gfcp Copy files using parallel streams gfrm, gfrmdir Remove directory entries gfmkdir Make directories gfdf Displays number of free disk blocks and files gfsck Check and repair file systems Major Gfarm Commands

Porting Legacy or Commercial Applications • Hook syscalls open(), close(), write(), . . . to utilize Gfarm filesystem • Intercepted syscalls executed in local file view • This allows thousands of files to be groupedautomatically and processed in parallel. • Quick upstart for legacy apps (but some portability problems have to be coped with) • gfreg command • After creation of thousands of files, gfreg explicitly groups files into a single Gfarm file.

Initial Performance Evaluation– Presto III Gfarm Development Cluster (Prototype) • Dual Athlon MP 1.2GHz Nodes x128 • 768 MB, 200GB HDD each • Total 98GB Mem, 25TB Storage • Myrinet 2K Full Bandwidth, 64bit PCI • 614 GFLOPS (Peak) • 331.7 GFLOPS Linpack for Top500 Operation from Oct 2001

Initial Performance Evaluation (2)- parallel I/O (file affinity scheduling and local file view) 1742 MB/s on writes 1974 MB/s on reads [MB/s] open(“gfarm:f”, &f); set_view_local(f); write(f, buf, len); close(f); Presto III 64 nodes for comp. & IO nodes 640 GB of data

Initial Performance Evaluation (3)- File replication (gfrep) 443 MB/s 23 parallel streams 180 MB/s 7 parallel streams Presto III, Myrinet 2000, 10 GB each fregment [1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data Intensive Computing, Proc. of CCGrid 2002, Berlin, May 2002

Cluster node 1U, Dual 2.4GHz Xeon, GbE 480GB RAID with 8 2.5” 60GB HDDs 105MB/s on writes, 85MB/s on reads 10-node experimental cluster (will be installed by July 2002) 10U + GbE switch Totally 5TB RAID with 80 disks 1050MB/s on writes, 850MB/s on reads Gfarm testbed AIST 10 + 10 nodes Titech 256 nodes KEK, ICEPP Osaka Univ. 108 nodes ApGrid/PRAGMA testbed AIST, Titech, KEK, . . . Indiana Univ., SDSC, . . . ANU, APAC, Monash, . . . KISTI (Korea) NCHC (Taiwan) Kasetsart U., NECTEC (Thai) NTU, BII (Singapore) USM (Malaysia), . . . Design of AIST Gfarm cluster Iand Gfarm testbed 100MB/s GbE switch 10GFlops 480GB

MPI-IO No local file view that is a key issue to maximize local I/O scalability PVFS – striping cluster filesystem UNIX I/O API, MPI-IO I/O bandwidth limited by network bandwidth Fault-tolerance??? Wide-area??? Scalability?? IBM PIOFS, GPFS HPSS – hierarchical mass storage system Also limited by network bandwidth Distributed filesystems NFS, AFS, Coda, xFS, GFS, . . . Poor bandwidth for parallel write Globus – Grid Toolkit GridFTP – Grid security and parallel streams Replica Management Replica catalog and GridFTP Kangaroo – Condor approach Latency hiding to utilize local disks as a cache No solution for bandwidth Gfarm is the first attempt of cluster-of-cluster filesystem on the Grid File replica File affinity scheduling, . . . Related Work

5km Grid Datafarm Development Schedule 10xN Gbps • Initial Prototype 2000-2001 • Gfarm filesystem, Gfarm API, file affinity scheduling, and data streaming • Deploy on Development Gfarm Cluster • Second Prototype 2002(-2003) • Grid security infrastructure • Load balance, Fault Tolerance, Scalability • Multiple metaservers with coherent cache • Evaluation in cluster-of-cluster environment • Study of replication and scheduling policies • ATLAS full-geometry Geant4 simulation (1M events) • Accelerate by National “Advanced Network Computing initiative” (US$10M/5y) • Full Production Development (2004-2005 and beyond) • Deploy on Production GFarm cluster • Petascale online storage • Synchronize with ATLAS schedule • ATLAS-Japan Tier-1 RC “prime customer” KEK TsukubaWAN SuperSINET 10 Gbps U-Tokyo (60km)TITECH (80km) AIST/TACC

Summary http://datafarm.apgrid.org/ datafarm@apgrid.org • Petascale Data Intensive Computing Wave • Key technology: Grid and cluster • Grid datafarm is an architecture for • Online >10PB storage, >TB/s I/O bandwidth • Efficient sharing on the Grid • Fault tolerance • Initial performance evaluation shows scalable performance • 1742 MB/s on writes on 64 cluster nodes • 1974 MB/s on reads on 64 cluster nodes • 443 MB/s using 23 parallel streams • Metaserver overhead is negligible • I/O bandwidth limited by not network but disk I/O (good!)

Grid Datafarm Architecture for Petascale Data Intensive Computing

Grid Datafarm Architecture for Petascale Data Intensive Computing

Presentation Transcript

Petascale Data Intensive Computing

Data-Intensive Distributed Computing

Data-Intensive Computing

Data-Intensive Distributed Computing

Data Intensive Computing

Petascale Data Intensive Computing for eScience

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

From Compute Intensive to Data Intensive Grid Computing

How Grid Computing May Contribute to Petascale Computing

From Compute Intensive to Data Intensive Grid Computing