1 / 20

GPU p assthrough in the cloud: A comparison of KVM, Xen , VMWare ESXi and LXC

GPU p assthrough in the cloud: A comparison of KVM, Xen , VMWare ESXi and LXC. John Paul Walters Project Leader, USC Information Sciences Institute jwalters@isi.edu. ISI@USC: Summary. Large , vibrant , path-breaking research Institute

foy
Download Presentation

GPU p assthrough in the cloud: A comparison of KVM, Xen , VMWare ESXi and LXC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU passthrough in the cloud: A comparison of KVM, Xen, VMWareESXi and LXC John Paul Walters Project Leader, USC Information Sciences Institutejwalters@isi.edu

  2. ISI@USC: Summary • Large, vibrant, path-breaking research Institute • Part of USC’s Viterbi School of Engineering located in “Marina Tech Campus” (Marina del Rey) and in Arlington, VA • ~300 people mostly research staff • Mission: Perform traditional basic and applied research combined with education and building real systems • Strength: World-class research and development in information sciences targeting issues of societal and commercial importance • Culture: Academic environment that supports researchers setting their own agendas to address important research problems with a full-time focus on research • Technology: Broad research profile across computer sciences, mathematics, engineering, and applied physics • Sponsors: DoD research agencies, intelligence community, NIH, DOE technical offices, and industrial/commercial entities

  3. Motivation • Scientific workloads demand increasing performance with greater power efficiency • Architectures have been driven towards specialization, heterogeneity • We’ve implemented support for heterogeneity in OpenStack • Support GPUs via LXC, large shared memory machines, bare metal provisioning, etc. • Most clouds are still homogeneous, without GPU access • Of the major providers, only Amazon offers virtual machine access to GPUs in the public cloud

  4. Cloud Computing and GPUs • GPU passthrough has historically been hard • Specific to particular GPUs, hypervisors, host OS • Legacy VGA BIOS support, etc. • Today we can access GPUs through most of the major hypervisors • KVM, VMWareESXi, Xen, LXC

  5. Supporting GPUs • We’re pursuing multiple approaches for GPU support in OpenStack • LXC support for container-based VMs • Xen support for fully virtualized guests • KVM support for fully virtualized guests, SR-IOV • Also compare against VMWareESXi • Our OpenStack work currently supports GPU-enabled LXC containers • Xen prototype implementation as well • Given widespread support for GPUs across hypervisors, does hypervisor choice impact performance?

  6. GPU Performance Across Hypervisors • 2 CPU Architectures, 2 GPU Architectures, 4 Hypervisors • Sandy Bridge + Kepler, Westmere+ Fermi • KVM (kernel 3.13), Xen 4.3, LXC CentOS 6.4, VMWareESXi 5.5.0 • KVM and Xen installed from Arch Linux 2013.10.01 base systems • Standardize on a common CentOS 6.4 base system for comparison • Same 2.6.32-358.23.2 kernel across all guests and LXC • Control for NUMA effects

  7. K20 Results - SHOC

  8. C2075 Results - SHOC

  9. K20 Results – SHOC Outliers

  10. C2075 Results – SHOC Outliers

  11. SHOC Observations • Overall both Fermi and Kepler systems perform near-native • This is especially true for KVM and LXC • Xen on the C2075 system shows some overhead • Likely because Xen couldn’t activate large page tables • Some unexpected performance improvement for KeplerSpmv • Appears to be a performance regression in the base CentOS 6.4 system • Performance nearly matches comparable Arch Linux base.

  12. K20 Results – GPU-LIBSVM

  13. C2075 Results – GPU-LIBSVM

  14. GPU-LIBSVM Observations • Unexpected performance improvement for KVM on both systems • Most pronounced on Westmere/Fermi platform • This is due to the use of transparent hugepages (THP) • Back the entire guest memory with hugepages • Improves TLB performance • Disabling hugepageson Westmere/Fermi platform reduces performance to 80-87% of the base system

  15. Multi-GPU with GPUDirect • Many real applications extend beyond a single node’s capabilities • Test multi-node performance with Infiniband SR-IOV and GPUDirect • 2 Sandy Bridge nodes equipped with K20 • ConnectX-3 IB with SR-IOV enabled • Ported Mellanox OFED 2.1-1 to 3.13 kernel • KVM hypervisor • Test with HOOMD, a commonly used GPUDirect-enabled particle dynamics simulator

  16. GPUDirect Advantage Image source: http://old.mellanox.com/content/pages.php?pg=products_dyn&product_family=116

  17. HOOMD Performance

  18. Multi-node Discussion • SR-IOV and GPUDirect are possible • Performance is near-native • Scales with problem size • Next step is to extend this to a much larger cluster and expand applications

  19. Current Status • Source code is available now • https://github.com/usc-isi/nova • Includes support for heterogeneity • GPU-enabled LXC instances • Bare-metal provisioning • Architecture-aware scheduler • Prototype Xen with GPU passthrough implementation

  20. Future Work • Primary focus: multi-node • Greater range of applications, larger systems • Integrate GPU passthrough support for KVM • This might come free with the existing OpenStack PCI passthrough work • NUMA support • This work assumes perfect NUMA mapping • OpenStack should be NUMA-aware

More Related