140 likes | 252 Views
Validating a Distributed File Replication System using Functional Programming. Nikolaj Bjørner Microsoft Corporation. This talk. Distributed File Replication Challenges Design and prototyping in OCaml Design and Test using AsmL Validation on the product core. Some background.
E N D
Validating a DistributedFile Replication System using Functional Programming Nikolaj Bjørner Microsoft Corporation
This talk • Distributed File Replication Challenges • Design and prototyping in OCaml • Design and Test using AsmL • Validation on the product core
Some background • Personal experience with SML/NJ during 90’s: • Great language, very reliable compiler • Had to add custom features to CML threads with every compiler release • Had to add custom ad-hoc RPC support • Had to rebuild the runtime for FFI • Current work uses mainly C, C++: • Even CLR is challenged among systems programmers • Not purely a matter of ignorance: • my boss wrote a scheme compiler 15 years ago
Selected comparisons • IceCube file replication • Based on high-level description of conflicting file operations • Log-based: event logs are shipped and resolved by general purpose constraint solver • Extends to undos touching just cone of influence • Unison • Written in OCaml • Cross platform, limits handling of renames/moves • Provides precise definition of conflict • Geared for manual conflict resolution, pair-wise synchronization • NTFRS • Shipped as part of Windows Server • Targets multi-master 24/7 replication • Automatic conflict resolution is a prerequisite
What makes file replication hard? • Distributed consensus over rich data structure: • Scalability – Large corporations have thousands of servers • Performance – Some customers have a lot of files • Responsiveness – Who want delays? • Handling limitations and faults: • Memory is finite – relying on event logs is not an option • Machines may appear/disappear at any time • Disks may fault at any time • What the customer needs: • Replication must work with applications it has no control over touch –r * • Rich feature interaction
It is an old problem, but: • Protocol specification and implementation are easily confused. • What is this protocol supposed to do? What should be tested? • Reasoning about changes. • It may work for 5 but not 6 machines. • New features are 99% likely to be at odds with previous assumptions. • Typical test metric is code (path) coverage. • It is a only a partial metric. It says very little about system behavior on data and combinations of events. • Scalability, reliability and performance continue to be challenging
What we did about it • Specify replication consistency declaratively • Describe conflict resolution scheme using a state machine • Capture design in a high-level language at an abstract level • Validate design by extensive state space exploration against declarative specification • Develop production version from using same decomposition as design
First prototype used OCaml • Reasons for choosing OCaml for initial design • FP is perfect for fast prototyping • OCaml has a decent compiler and no “a no more tears” FFI. • Positioning OCaml in house • Camouflage: “We used a scripting language, you don’t want to know the details” • It was already blessed: “OCaml is a dialect of F#” • Ask the consultants: “Collaboration Application Modeling Language ... distributed by INRIA since 1984…” • Some real hurdles for using OCaml more • Licensing • Most colleagues (in file systems) don’t do FP. • Windows build has rigid requirements for supported compilers. • Even CLR based development has its own hurdles. • Deal with out of memory errors • There are attractive alternatives for design – more about this later.
Prototype validation • Identify main components with modules. • Identify main invariants with each module. • Expose main events as state transformations • Create, update, delete, move file • Synchronize • Perform depth first search over event vocabulary. • Optimization: partial order reduction easy by weaving Morale: • This is easy using FP (OCaml) – 5 weeks on initial design + prototype. • Applicative features enabled extremely efficient search over state space.
Protocol validation on the production code • Repeated validation approach on shippable C++ core. • Challenges we had to solve: • Dealing with threads • Blocking calls • Enforce layering among developers • Efficient state resumption • Benefit: Validation on ≈ ½ trillion traces during early development.
Validation summary Early CAML prototype was validated on 120B traces (80 machines / 48 hours) C++ production core was later checked on 500B traces (200 machines/ 2 weeks) Both systems operate in both virtual and real mode. Virtual mode replaces device state by much faster main memory. Model Simulation Actual system Stress Test ~10 12 scenarios Synchronization Core modules ~10 6 scenarios Virtual modules: Fake File System In memory Database Network using LPC Real modules: NTFS DatabaseRPC NTFS: NT File System RPC: Remote Procedure Call LPC: Local Procedure Call
Time went by, requirements evolved,.. Protocol design update with AsmL Compelling reasons for using AsmL: • Readable Protocol Description • Built-in Test Case Generation tools • Embedded (Word) Design Documentation It is our up to date protocol reference document Some notes on AsmL: • AsmL borrows several features found in functional and specification languages • AsmL is being displaced by Spec# • Today used more by test and testers than designers • By default parallelism in AsmL can be hard to adapt to.
FP within Core File Systems • Some of the promises of OCaml and other FP • Clear benefit for early prototyping and validation • Within Microsoft: related environments are getting traction AsmL, F#, Spec#,… • Productivity enhancement is unparalleled • Some of the issues for taking FP further: • Licensing and support • Dealing with research tools/prototypes • Libraries and Integration with existing APIs • CLR is one approach, but even CLR libraries require tuning • Python is a different compelling example for the FP community • Ignorance and inertia • College groomed Systems Programmers rarely date FP aficionados • Test and support is specialized to mainstream languages • FP systems need to address basic problems: • Memory management: fragmentation, handling memory allocation failure • Engineering centric problems: such as debugging and build environments