1 / 25

HEPiX Spring 2019 BOF

HEPiX Spring 2019 BOF. AuriStor File System and Linux Kernel AFS/AF_RXRPC Presented by Jeffrey Altman (jaltman@auristor.com). The /afs Vision. Location Transparency: one name space; data mobility without service interruption User Mobility: access from any device

rocior
Download Presentation

HEPiX Spring 2019 BOF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HEPiX Spring 2019 BOF AuriStor File System and Linux Kernel AFS/AF_RXRPC Presented by Jeffrey Altman (jaltman@auristor.com)

  2. The /afs Vision • Location Transparency: one name space; data mobility without service interruption • User Mobility: access from any device • Security: Flexible model for authentication, privacy, data protection and access control • High Availability: Temporary loss to small groups for short time periods • Integrity: No need for user initiated backups; users trust the system • Heterogeneity: Multiplatform; one file system for all operating environments • Self service: Low Help Desk costs • Atomic Publishing: Software, documentation, web sites, .. • Real time collaboration: Distributed File Locking; Real-time full data coherency • Distributed Administration: administer from any device

  3. AFS is 33 years young In 1985, CMU’s VICE file system was decades ahead of its time: “A comparison with other distributed file systems reveals that although this design has individual features in common with some of the other systems, it is unique in the way it combines these features to produce a total design. It is further distinguished from all the other systems in that it does not rely on the trustworthiness of all network nodes”

  4. Legacy AFS limitations: 30+ Years of Technical Debt Limited network throughput Increased call processing latency Decreased service reliability and availability Elevated risk of distributed deadlocks Inability to use full capability of available hardware Failure to keep up with competing technologies Multi-producer, multi-consumer work flows cannot be supported Reliance on weak, deprecated encryption algorithms That /afs is still in use today is a credit to the uniqueness of the vision and the strength of its architecture.

  5. AuriStor’s Vision for /afs • Application Transparency • Be a First Class file system on all major OSes • Client support built into Linux kernel • Embrace multi-producer, multi-consumer work flows • Provide persistent storage for Container deployments • Extended Integrity: • Enhanced Replication for Disaster Recovery • Be performance competitive • Lustre, GPFS, Panasas, NFSv4, … • Best of breed data security • Wire privacy and integrity protection policy guarantees • Combined identity authentication • Multi-factor authorization • Geographic isolation • Improved Ease of Use

  6. AuriStor File System Status Report

  7. AuriStor File System U.S. Dept of EnergySBIR 2008 - 2011 Worldwide deployments began in April 2016 AWS GCP Zero Flag Day conversion from IBM AFS and OpenAFS

  8. AuriStor File SystemKey Features • Security • Combined identity authentication • Multi-factor authorization • Volume and File Server policies • Data privacy for all services • Need to know policies applied to all metadata and data (EU Privacy Laws) • Networking • IPv6 • Scalable RX - 500,000 simultaneous connections per fileserver (observed) • File Server Performance Improvements – 60 x multiple over OpenAFS on same hardware • UBIK Database Improvements • Accelerated quorum establishment – 20 to 53 seconds • Membership scales to 80 servers per cell

  9. AuriStor File SystemKey Features • Supports multi-producer, multi-consumer work flows • 2038 ready • Simplified configuration management • Coverity analyzed • Robust test suite – more than 7000 tests • Docker and Kubernetes container deployments • Optimized Linux clients • Servers execute using unprivileged user; not “root”

  10. AuriStor File SystemExtended Access Control Model • Combined Identity Authentication (sequence of identities) • Anonymous User or Authenticated User or Authenticated Machine • Anonymous User and Authenticated Machine • Authenticated User and Authenticated Machine • User Centric, Constrained Elevation Access Control Entries • Grant permissions to matching prefix substrings of the combined identity sequence • Grant permissions based upon membership in more than one group • Permits elevated access from trusted machines • When negative access control entries are used, revokes permissions from untrusted but authenticated machines

  11. AuriStor File SystemIntended use cases • Cross-platform home and project directories • Local and remote access • Open Data Sets • Long term storage • Direct access from remote locations • Management of Data by Classification • Software Distribution • Multiple platforms from the path (@sys) • Linux Containers • Secure Federated Global Name Space • Storage infrastructure Stretching • Internal • DMZ • Cloud hosting providers

  12. AuriStor File System Status Report

  13. AuriStor File System AFS without Fear! • The global /afs file namespace has a long history of use within the scientific computing community • Federated authentication • Home and shared Project directories • Software distribution • Cross platform distributed access to files and directory trees over the WAN • Anything that benefits from atomic publication and/or readonly replication • Global data replication and live data migration • /afs now also handles work flows that require • Hundreds of nodes modifying a common set of objects (multiple writer, multiple reader) • Hundreds of processes on a single multi-processor thread client system • Robust multi-factor authorization and security (auth/integ/priv) requirements Our partners push the limits of /afs without fear!No more walking on egg shells.

  14. Major milestones since HEPIX SPRING 2019 AuriStor File System RX Improvements • AuriStor's RX implementation has undergone a major upgrade of its flow control model. Prior implementations were based on TCP Reno Congestion Control as documented in RFC5681; and SACK behavior that was loosely modelled on RFC2018. The new RX state machine implements SACK based loss recovery as documented in RFC6675, with elements of New Reno from RFC5682 on top of TCP-style congestion control elements as documented in RFC5681. The new RX also implements RFC2861 style congestion window validation. • When sending data the RX peer implementing these changes will be more likely to sustain the maximum available throughput while at the same time improving fairness towards competing network data flows. The improved estimation of available pipe capacity permits an increase in the default maximum window size from 60 packets (84.6 KB) to 128 packets (180.5 KB). The larger window size increases the per call theoretical maximum throughput on a 1ms RTT link from 693 mbit/sec to 1478 mbit/sec and on a 30ms RTT link from 23.1 mbit/sec to 49.39 mbit/sec. • One feature change is experimental support for RX windows larger than 255 packets (360KB). This release extends the RX flow control state machine to support windows larger than the Selective Acknowledgment table. The new maximum of 65535 packets (90MB) could theoretically fill a 100 gbit/second pipe provided that the packet allocator and packet queue management strategies could keep up.

  15. Major milestones since HEPIX SPRING 2019 AuriStor File System v0.184 • Security improvements include volserver validation of destination volserver security policies prior to transmitting marshaled volume data. Prior to v0.184 the volservers were trusted to reject volumes whose security policy could not be enforced. Linux cache managers can no longer be keyed with rxkad tokens. Introduction of a pam module capable of managing tokens for both AuriStorFS and/or Linux Kernel kAFS. • The UNIX Cache Manager underwent major revisions to improve the end user experience by revealing more error codes, improving directory cache efficiency, and overall resiliency. The cache manager implementation was redesigned to be more compatible with operating systems such as Linux and macOS that support restartable system calls. With these changes errors such as "Operation not permitted", "No space left on device", "Quota exceeded", and "Interrupted system call" can be reliably reported to applications. Previously such errors might have been converted to "I/O error". These changes are expected to reduce the likelihood of "mount --bind" and getcwd failures on Linux with "No such file or directory" errors. • A potentially serious race condition and reference counting error in the vol package shared by the Fileserver and Volserver could prevent volumes from being detached which in turn could prevent the Fileserver and Volserver from shutting down. After 30 minutes the BOSServer would terminate both processes. The reference counting errors could also prevent a volserver from marshaling volume data for backups, releases, or migrations. • This release is moves the location of the cache manager's cmstate.dat from /etc/yfs/ to /var/yfs/ or /var/lib/yfs depending upon the operating system. The cmstate.dat file stores the cache manager's persistent UUID which must be unique. The cmstate.dat file must not be replicated. If virtual machines are cloned the cmstate.dat must be removed. The cmstate.dat file must not be managed by a configuration management system.

  16. AuriStor File System Latest Linux Platform support • Distributions: • Red Hat Enterprise Linux 7.6 (and clones) • Red Hat Enterprise Linux 6.10 • Red Hat Fedora 29 • Debian Stretch • Ubuntu 18.04 LTS • Amazon Linux 2 • New Architectures • PPC-LE and PPC-BE • aarch64 • End of support • Red Hat Enterprise Linux 5.x • Fedora 26

  17. AuriStor File Server workaround for IBM/OpenAFS client RXAFSCB deadlocks • All IBM and OpenAFS Unix cache managers are vulnerable to callback service deadlocks • A deadlocked client will be unresponsive to all RXAFSCB RPCs • FS times out after 120 seconds; responds to failed RXAFSCB RPC by failing inbound RPCs with VBUSY. • CMs retry VBUSY failures 100 times before failing the VFS operation • A deadlock can resolve itself after approximately 2.5 to 3 hours • AuriStorFS Fileserver • detects deadlocked RXAFSCB and fails inbound RXAFS RPC to prevent retries and force immediate VFS operation failure

  18. AF_RXRPC, KAFS, and kafs-utils Status Report

  19. The kAFS ProjectWhy A native Linux Kernel /afs client? • IBM AFS, OpenAFS and AuriStorFS are license incompatible with Linux kernel • New Linux kernel releases every six to eight weeks • Constant changes to internal kernels interfaces – few compatibility guarantees • Interface change developers are only responsible for fixing in-tree source code • Without stable interface semantics, function call changes can silently introduce data corruption – let alone break /afs behavior • Mainline changes have full transparency – Distribution kernels do not • Too many variants to keep track of • As an integrated part of the mainline Linux kernel • kAFS can leverage GPL-only kernel APIs and benefits from all Linux mainline development processes • Tree wide Linux VFS interface changes must be applied to kAFS at the same time as other in-tree file systems • KASAM and Coverity testing is performed against kAFS • Code review by a much broader community of storage and network stack developers

  20. The kAFS ProjectThe native Linux Kernel /afs client • The kAFS project provides access to the /afs global file namespace or to individual AFS volumes from the Linux kernel without any third party kernel or FUSE modules. • kAFS shares no source code with prior RX and AFS implementations. • kAFS comprises four in-kernel components: • kAFS itself. This is a network filesystem that's used in much the same way as any other network filesystem provided by the Linux kernel, such as NFS. • AF_RXRPC. This is a socket family implementation of the RX RPC protocol for use by kAFS and userspace applications. • Kernel keyrings. This facility's primary purpose is to retain and manage the authentication tokens used by kAFS and AF_RXRPC, though it has been kept generic enough that it has been adapted to a variety of roles within the kernel. • FS-Cache. This is an interface layer acts as a broker between any network filesystem and local storage, allowing retrieved data to be cached. It can be used with NFS family, CIFS, Plan9 and Ceph in addition to kAFS. • and three userspace components: • kafs-client. This will be the configuration and systemd integration for automatically mounting the kAFS filesystem and tools for managing authentication. • kafs-utils. This is a suite of AFS management tools, written in Python3, that use AF_RXRPC from userspace as a transport to communicate with a server. • keyutils. This is a set of utilities for manipulating the keyrings facility from userspace. • See https://www.infradead.org/~dhowells/kafs/ for detailed status and example code.

  21. Status of AF_RXRPC Development (Linux 5.1) • Implemented features • Usable from userspace • socket(AF_RXRPC, ...) • Supports client and server calls over the same socket • kauth, Kerberos 4 and Kerberos 5 security • plain, partial and full encryption • IPv6 support • Kernel tracepoints • Slow start and fast restart congestion management • Service upgrades • Features that need to be added for AuriStorFS • YFS-RXGK security class

  22. Status of kAFS Development (Linux 5.1) • Implemented features: • POSIX syscalls • Advisory file locking • Encryption and authentication • Automounting of mountpoints • Failover (DB and FILE services) • Kernel tracepoints. • DNS SRV and AFSDB record lookup • Busy volume failover • Path substitution (@sys and @cell) • IPv6 support (requires AuriStorFS DB and FILE) • AuriStorFS RX service RPCs • Dynamic root /afs mount • mount -t afs none /afs -o dyn • Direct volume mount • mount -t afs "#grand.central.org:root.cell" /mnt https://www.infradead.org/~dhowells/kafs/todo.html

  23. Namespacing for Containers (Linux 5.1) • Network namespacing • Each namespace appears as a separate client to the cell servers • Token management • Network namespace propagation over automounts • New Mount API for Linux introduced in 5.1 • Still to do … • New Namespace aware Linux Keyrings • Populate credentials from the host • Processes in the container are unaware the credentials exist • Credentials apply to Container filesystem access

  24. Linux AFS Next Steps • Linux AFS will provide out of the box support for /afs file access • No more concerns about API and ABI kernel module incompatibilities • File feature requests with your preferred Linux distribution • For Red Hat: • Added to Fedora Core Rawhide for 4.19 kernel • Will be included in Fedora Core 30 • Branch from Rawhide Feb 19, Beta Mar 26, GA Apr 30 • Bugzillas for inclusion in RHEL 8.x and backport to 7.x have been submitted • For Debian, a volunteer packager must be identified • For Ubuntu, Debian must package first • For SuSE • Added to openSuSE • SLES, requests must be submitted • For other distributions, ???

  25. HEPiX Spring 2019 BoF AuriStor File System and Linux Kernel AFS/AF_RXRPC Presented by Jeffrey Altman (jaltman@auristor.com)

More Related