1 / 16

Workload Analysis of Globus’ GridFTP

Workload Analysis of Globus’ GridFTP. Nicolas Kourtellis Joint Work with: Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi, Dan Fraser University of South Florida & Argonne National Laboratory. Metrics Project Dataset. Start with ~137.5 million records (Jul’05 - Mar’07)

kaylee
Download Presentation

Workload Analysis of Globus’ GridFTP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workload Analysis of Globus’ GridFTP Nicolas Kourtellis Joint Work with: Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi, Dan Fraser University of South Florida & Argonne National Laboratory

  2. Metrics Project Dataset Start with ~137.5 million records (Jul’05 - Mar’07) ~22.8 million records with size: ≤0 ~1000 records with buffer size: <0 ~3.9 million records for directory listing ~4,600 records with invalid hostnames/IPs (e.g. /[B@89712e) ~11.4 million records from identified ANL-Teragrid testings ~16.8 million records: identified duplicate reports ~5.75 million records: self transfers (same hostname for source/destination) => In the end: ~77.2 million records or ~56.2%!

  3. Metrics Project Dataset (Cont.) • Server to server transfers => Duplicate Report of the same transfer • Criteria to identify duplicates: • Window of 5 records • Complementary stor_or_retr code (0 or 1) • Number of bytes, buffer size, block size same • Transfer time (end-start) within 1sec difference • Start (or end) time within 60sec difference • For more than one matching records, pick the smallest difference in transfer time. • 16.8 million server-to-server transfers • ~5.75 million records: self transfers (same hostname for source/destination)

  4. Results (1): Transfer Size Distribution Notes: 1stPeak: 16MB - 32MB, ~13 million records. 2nd Peak: 512B - 1KB (low transfer size), ~7.4 million records. 3rd Peak: 0-2B, ~5,2 million records. Maximum bucket: 8TB-16TB, 45 records. GB region buckets: ~255,000 records.

  5. Results (2): Buffer Size Distribution Notes: 60% from the original table: 0B. Most commonly used: 16 – 128KB. Maximum bucket: 1-2GB, 92 records.

  6. Results (3): Average Bandwidth Distribution Notes: Peak: 128-256Mbps, ~7.7 million records Most common: 4Mbps - 1Gbps of average bandwidth (58%)

  7. Results (4): Number of Streams Distribution Notes: ~70% of the transfers used 1 stream!! Only ~20% of the transfers used 4 streams (suggested number by ANL’s website), Total 10% of the user base used other numbers of streams Maximum of the CDF: 1010 streams(!!)

  8. Results (5): Average Bandwidth VS Number of Streams Notes: Only transfer sizes ≥1GB Bandwidth increase of a factor 2 or 3, for streams > 10. Bandwidth ceiling at 800 - 900Mbps, after ~32 streams => Gbps infrastructure

  9. Results (6): Number of Stripes Distribution Summary of results: 1 Stripe: 99.5% of the transfers! 2-31 Stripes: 0.5% of the transfers!

  10. Results (7): User and organization evolution over time Notes: 1) Continuing Increase on the evolution of the user and organization population. 2) Forecasts: 43 new IPs and 14 organizations (domains) per month.

  11. Results (8): Geographical Characterization Notes: USA: 78.4% or ~50.8 million transfers and 82.9% or ~1.3 PB in volume. Some activity from Canada, Taiwan, Japan and Spain (~14 million transfers and 346TB in volume).

  12. Results (9): Server to Server Transfers (a) - - - # of Transfers ___ Volume _ _ Linear fitting Notes: ~257,000 transfers per month Growth rate of ~27,000 transfers per month.

  13. Results (10): Server to Server Transfers (b) Notes: The small % of Inter-Domain transfers but respective high % in Volume The opposite for Intra-Domain (InterIP) transfers High reporting of Self Transfers (more than 1/3).

  14. Results (11): Year Round Comparison for the Number of transfers (per month) Comment: The ratio decreases as time goes by, suggesting a stabilizing trend, on the # of transfers.

  15. Results (12): Year Round Comparison for the Volume of transfers (per month) Comment: The ratio decreases as time goes by, suggesting a stabilizing trend, on the Volume of transfers. There is an unexplainable (?) dramatic increase on the Dec 06!

  16. DISCUSSION Open Questions: How can the functionalities of GridFTP be explored more for: a) Better Performance (e.g. speedup of transfers streams, stripes etc)? b) Better Utilization of Resources (e.g. bandwidth, storage etc)? => Tutorials, suggestions, solutions-tools-applications to users. How can the system evolution and usage analysis be useful for: a) Prediction and provisioning of the Globus Grid's resources b) Designing new benchmarks for evaluation of Grids’ resources like data transfer components and for more realistic simulations. c) Why aren’t the big players of GridFTP (CERN etc) reported? (Version or component?) d) How much does the version of the component affect the results? e) Bottom Line: Are these data logs representative of the GridFTP population? f) If not, then what component, would have representative logs, even with limited logging? How can we improve the Usage Statistics (Metrics) Collection system? a) Efficient to the vast number of reports in the future (e.g. daily summaries?) b) Robust (to attacks, bogus data etc) c) Apply reporting of bugs in the user system (in a form of live feedback)? d) Add more details to the reports (like source AND destination(?)) e) Eliminate bugs in the reporting: i) Duplication ii) FTP response code (with conjunction to (c)), iii) Zero & Negative Buffer Size values (crosscheck the values used and the values allowed by the system) iv) Time Fields sometimes inconsistent, v) Use of the IP field, indexes on the DB to speedup the analysis. vi) Change in the schema of the DB. (monthly differences etc).

More Related