1 / 26

STFC Cloud Developments

Explore the background, history, and current design considerations of the STFC Cloud developments, including new use cases and capabilities. Learn how this infrastructure allows for improved research and collaboration in various scientific communities.

estellar
Download Presentation

STFC Cloud Developments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STFC Cloud Developments Ian Collier & Alex Dibbo STFC Rutherford Appleton Laboratory UK Research & Innovation HEPiX Spring, Madison, WI 17 May 2018

  2. Contents • Background & History • Current design considerations • New Use cases • (Some of) What this allows us to do • IRIS (UKT0)

  3. Background • OpenNebula cloud now running for several years on a stable, but pre-production basis. • Has served very well. • Rise of more complex use cases – in particular supporting external user communities –> deploying OpenStack • In final stages of deploying • decommissioning OpenNebula • migrating remaining users in coming months

  4. Hardware • OpenNebula: reducing from ~700 to ~300 cores • OpenStack: ~500 cores, growing to ~3500 end May and ~5000 by end 2018. • Adding • Hypervisors: 108 Dell 6420 sleds (27 x 4-up 2U 6400 chassis) - Testing complete, about enter production • 16 physical cores each, 6GB RAM/core, 25G NIC • Cloud Storage: 12 Dell R730xd (12 bay 2U) - Testing complete, about enter production • 12 x 4TB each, total 576TB, 25G NICs • Replicated CEPH (3x), VM image storage • Data Storage: 21 x (Dell R630 + 2 x MD1400 ) sets - Just finished testing • Added to ECHO for UKT0 project • 4PB raw /~2.9TB configured • Network: • 7 x Mellanox SN2100 16x100Gb port

  5. OpenStack implementation • Flexible • Within reason want a design that can accomodate whatever comes along – within reason • Multi tenancy • Separating different user communities • Highly Available • Services should be as highly available as possible – OpenStack services behind HAProxy • Ceph • A replicated Ceph cluster called SIRIUS provides block storage for VMs and Volumes • 3x Replication • Optimised for lower latency

  6. Ceph RBD • A replicated Ceph cluster called SIRIUS provides block storage for VMs and Volumes • 3x Replication • Optimised for lower latency • Helps speed up launch of VMs • Are considering use cases that would benefit from local VM storage • Currently being rebuilt – fresh installation in new machine room location.

  7. Multi Tenancy • Projects (Tenants) need to be isolated • From each other • From STFC site network • Security Groups • VXLAN private project networks • Brings its own problems

  8. Private Networks • Virtual machines connect to a private network • VXLAN is used to tunnel these networks across hypervisors • Ingress and Egress is via a virtual router with NAT • Distributed Virtual Routing is used to minimise this bottleneck – every hypervisor runs a limited version of the virtual router agent.

  9. VXLAN • VXLAN by default showed significant overheads • VXLAN performance was ~10% of line rate • Tuning memory pages, CPU allocation, mainline kernel • Performance is ~40% of line rate • Hardware offload • VXLAN offload to NIC gives ~80% of line rate • High Performance Routed network + EVPN • 99+% of line rate

  10. Cloud Network

  11. Flexible • Availability Zones across site • 1st will be in ISIS soon • In discussion with other departments • GPU support • AAI • APIs • Nova, EC2, OCCI • Design decisions should, so far as possible, preclude anything • ‘Pet’ VMs as well as ‘cattle’

  12. Current User Communities • STFC Scientific Computing Department “internal” • Self service VMs • Some programmatic use • Tier1 • Bursting the batch farm • ISIS Neutron Source • Data-Analysis-as-a-Service • Jenkins Build Service for othere groups in SCD • Central Laser Facility – OCTOPUS project • Diamond Lightsource - Xchem • Cloud bursting Diamond GridEngine • Xchem data processing using OpenShift • WLCG Datalake project • QuattorNightlies • Range of H2020 & similar projects

  13. AAI – EGI CheckIn - Horizon

  14. AAI – EGI CheckIn – Horizon 2

  15. AAI – EGI CheckIn

  16. AAI – Google Login

  17. International include: • CERN • ESO • SKA • ESS • ESRF • ILL 250m Daresbury Laboratory ESRF & ILL, Grenoble What do we do? • Programmes include: • Particle Physics • Astronomy • Nuclear Physics • National Laboratories include: • Neutron and Muon Source • Synchrotron Radiation Source • High Power Lasers • Scientific Computing/Hartree Centre • Space Science • Electronics, Cryogenics Square Kilometre Array Large Hadron Collider

  18. STFC computing: GridPP: HTC for the LHC Hartree Centre: Industry Scientific Computing Department DiRAC: HPC for Theory & Cosmology HTC and data storage for Astronomy, Nuclear, Astroparticle

  19. Wider UK landscape • Effort running for several years now under ‘UKT0’ banner • Bringing together different communities • HTC and HPC for ‘STFC’ science • Linking together diverse resources into coherent and complementary support for existing & emerging communities • “UKT0” (now IRIS) is an umbrella project which • represents users • ‘contracts’ to resource providers • eventually should provide effort for cross-cutting activities

  20. IRIS – eInfrastructure for STFC Science • Initiative to bring STFC computing interests together • Formed bottom up by the science communities and compute providers • Association of peer interests (PPAN + National Facilities + Friends) • Particle Physics: LHC + other PP experiments (GridPP) • DiRAC (UK HPC for theroretical physics) • National Facilities: Diamond Light Source, ISIS • Astro: LOFAR, LSST, EUCLID, SKA, .... • Astro-particle: LZ, Advanced-LIGO • STFC Scientific Computing Dept (SCD) • CCFE (Culham Fusion) • Not a project seeking users - users wanting and ready to work together

  21. Evolution – UKT0 -> IRIS • 2015-2017 Loose association •  Community Meeting Autumn 2015 • 2017/2018 first funding £1.5M to be spent by 31 March’18 • interim-PMB (to manage short term completion) • Grant delivered resources into machine rooms • Cambridge, Edinburgh, Manchester, RAL, • First (proto) Collaboration Meeting – March 2018 • Project was successful and on time (spent money/wrote software) • 2018 Approved BEIS “keep lights on funding” of £4M p.a. for 4 years for UKT0/ALC. • Now finalisingOrganisational structures and delivery planning

  22. UKT0 ‘idea’ Science Domains remain “sovereign” where appropriate Activity 1 (e.g. LHC, SKA, LZ, EUCLID..) VO Management Reconstruction Data Manag. Analysis Ada Lovelace Centre (Facilities users) Activity 3 .... Services: Monitoring Accounting Incident reporting .... Federated HTC Clusters FederatedData Storage Tape Archive Services: Data Mgm Job Mgm. AAI VO tools .... Federated HPC Clusters Public & Commercial Cloud access Share where it makes sense to do so

  23. Long Term Aspiration

  24. Initial Architecture

  25. Investment – next 4 years • Bulk is from Department of Business, Energy & Industrial Strategy (BEIS) • £10.5M eInfrastructure compute and storage assets • £5.5M for eInfrastructure Digital assets • Minimal resource  • £250K from BEIS in years 3+4 • £250K resource from STFC in years 1+2

  26. Questions? ian.collier@stfc.ac.uk alexander.dibbo@stfc.a.c.uk

More Related