Working Group: Practical Policy Rainer Stotzka, Reagan Moore

Working Group: PracticalPolicyRainer Stotzka, Reagan Moore

Agenda • Thursday March 27, 2014 3:30-5:00 PM • Introduction to policy-based data management • Discussion of data policy manager for EUDAT (Mark van de Sanden) • Presentation on natural language rule processing (ChittaBaral) • Initial presentation of summary of policies across data centers and research projects (Jewel Ward) • Friday March 28, 2014 11:00-12:30 PM • Discussion of policy summary • Identification of best practices • Discussion of policy testing – interoperability testbed • Integration with deliverables from other working groups • Persistent identifiers • Linked-data – HIVE • Type registry • Data Foundation and Terminology • Preservation interest group

Practical Policy Working Group Focuses: • Identify the most important policies • Practical implementations for managing research data collections • Provide recommendations for a “starter kit” • Testbeds: • Evaluate standard policies • Test interoperability across WGs Policy: Assertion or assurance that is enforced about a collection or a dataset

Concept Graph by Reagan Moore Purpose Collection Defines Defines Integrity Isa Persistent State Information Property Policy Procedure Defines Updates Controls Isa Workflow Chains Function HasFeature Isa SysChksumDataObj Consistency

Concept Graph by Reagan Moore Purpose DATA_ID DATA_REPL_NUM DATA_CHECKSUM Collection Defines Replication Policy Has Isa Isa Isa Has Isa Checksum Policy Defines Digital Object Attribute Has Isa Quota Policy Has Isa Integrity Data Type Policy Isa Updates Isa Isa Authenticity Persistent State Information Isa Property Policy Procedure Defines Updates Controls Access control Isa Isa SubType Has HasFeature GetUserACL Periodic Assessment Criteria Policy HasFeature Workflow Isa Policy Enforcement Point SetDataType Completeness HasFeature Chains Isa SetQuota Correctness Isa Function HasFeature Invokes Isa DataObjRepl Consensus Isa Isa SysChksumDataObj Operation Consistency Client Action

Policy Categories Integrity Management Administrative Assessment AccessControl Replication Provenance Preservation Collection-based Policies Regulatory Description Data Management Plans Data Staging Publication Data Lifecycle Management Federation Compliance

Management Testbeds • iRODSRenaissance Computing Institute • E-iRODSDataNet Federation Consortium – DFC • dCacheInstitute of Physics of the Academy of Sciences, CESNET • DataVerseOdum Institute • List of policies in the RDA Wiki • Monthly telephone conferences (RDA) • “Policy of the month”Review of policies that have been submitted • 54 persons registered

Interactions with other WGs Data Foundation and Terminology WG • Discussion of a vocabulary for operations Preservation Infrastructure IG • Policies for preservation Persistent Identifiers • Properties versus operations on identifiers Data Citation WG • Type registry Metadata • Linked-data vocabularies

EUDAT Data Policy Manager • Peisar – Storage Policies at CESNET

Natural Language Rule Processing • Why? • Users or domain experts need not learn the syntax of the rule language. • They specify their rules using natural language. • How? • Natural language specification of rules is translated to rules in the syntax of the rule language – in two steps though • Step 1: Natural language to an intermediate language (focus is on correct translation of natural language and dealing with the challenges and quirkiness of natural language) • Step 2: Intermediate language to Rule language (Should be more straightforward as both languages are formal languages, and the intermediate language has a very restricted vocabulary) • Our focus in this presentation is on Step 1.

Underlying Technical Approach • Montague’s approach: The meaning of words and phrases are Lambda calculus formulas • The meaning (or translation) of sentences are obtained by combining the meaning of its words and phrases. • Usually as dictated by a grammar • Categorial Grammar (especially CCG) are often used as they give directionality regarding how to combine.

Print financial report [S] NL to Policy Example print(report(finance)) (λy. y@finance) @ (λx. report(x)) ( λx. report(x))@finance financial report [NP] Print [S/NP] report(finance) report(finance) λz. print(z) financial [NP/N] report [N] λy. y@finance λx. report(x)

Illustration of Montague’s approach using CCG and λ-calculus • Every boxer walks.

The Key Issue(s) • Where do we get the Lambda expressions from? • Handcrafting them is not scalable • Lambda expressions get complex in a hurry and handcrafting creates a bottleneck • Too many words • Since target language is not unique we can not painstakingly make new dictionaries for each target language • Target languages evolve • Other standard issues • Ambiguity: Multiple meanings of words; word sense disambiguation; etc.

How to get the lambda expressions? How we learned natural languages? • Often • We know the meaning of a sentence • We know the meaning of most of the individual words in that sentence • But we do not a-priori know the meaning of some particular word(s) in that sentence • We are able to correctly guess the meaning of those words • Follow a similar approach • Given a set of training examples and an initial dictionary, learn the lambda expressions for the words in those examples that are not in the dictionary • Inverse Lambda operators

Inverse λ Example • Every boxer walks.

Inverse λ – another Example Print financial report [S] print(report(finance)) financial report [NP] Print [S/NP] report(finance) λz.print(z) financial [NP/N] report [N] λy. y@finance λx. report(x)

Another Example Send email to curator of the collection [S] send(email, curator(collection)) to curator of the collection {NP] curator(collection) Send email [S/NP] curator of the collection [NP] λz. send(email,z) to [NP/NP] curator(collection) λx.x Send [(S/NP)/NP] email [NP] of the collection[NP\NP] curator [NP] λy. y@collection email λy. λz. send(y,z) λx. curator(x) the collection [NP] of [(NP\NP)/NP] collection λx.λy. y@x

NL2KR System Architecture NL2KR-L NL2KR-T

NL2KR-L NL2KR-L System Learning Process Generate all parse trees of the sentences Learn lexicon using Inverse-λ and Generalization Generalize complete lexicon Parameter Estimation

NL2KR-T NL2KR-T System Translation Process Generate all parse trees of the sentences Generalize the missing meanings of words and recomputed parse trees PCCG to rank the translation

Current Status • We have a prototype that translates English description of policy rules to a formal representation • Working towards making it usable in iRODS • Step 1: English to a formal policy specification (in an intermediate language) • Step 2: Formal policy specification to Rules (in a lower level language)

Illustration: Training Data Set

Illustration: Initial Lexicon

Illustration: Iteration 1 of Inverse λ

Illustration: Lexicon after parameter estimation

NL2KR Webpage

NL2KR Download Page

From the NL2KR manual

Natural Language Rule Processing: Conclusion • Described an approach to translate natural language (NL) specification to an intermediate (formal) language - which can then be translated to rules. • Theory: Augmented Inverse-Lambda based learning to Montague’s Lambda Calculus based approach. • System: Developed the NL2KR system. • Used the NL2KR system to build a translation system from NL to Intermediate Policy Description Language. • Nl2KR system can be used for developing translation systems from natural language to other formal languages. • Has been evaluated in domains such as Geoquery, Robocup language, puzzles, and Biology questions.

Invitation We are seeking: Data experts & Domain scientists ! • Provide policies already in use: RDA Wiki • Description • Implementation • Express wishes about policies you might need • Discuss and analyze policies • Enhance the cross-over to other WGs, IGs and initiatives

Survey of 30 Institutions for Highest Priority Policies

Summary of policies in production use • Policy for data retention. How long, how short? Need preservation, or not? (5) Retention and disposition • Notification policies. (Ex. must warn data researcher that their data will be deleted at X time.) (6) notification on event • Transferability policies. The data must be transferable from the repository back to the researcher and the repository of origin. Or, in the event of defunding, the data must be de-accessioned and moved to another repository (or not, depending on relevant SOPs, agreements, etc.). • Policies re: costs and who pays for all of this data storage (8) • Policies around context. Sometimes the original data and additional metadata are needed. Sometimes, the context or derived data is what matters, and not the data itself. (7) • Policies re: tagging/annotating data • Search/Information Retrieval policies. What parts of the data will you search on, or not search on? (4) Controlling search • Standard Sys Admin policies: (1)replication, back up, (2) integrity checks, syncing with back ups. • Content policies: do we care what content and file formats users upload? Some do, some don't. (3) Transformative migration • Policy to educate researchers about all of the different policies relevant to the data repository. For example, a user agreement/Terms & conditions statement that researchers must check off.

Best Practices for production policies • Consensus on a policy • Use at multiple institutions • Generality • Best practice policy components • Name of operation that policy controls • Constraints that policy implements • State information that policy uses or modifies • Verification policy • Example of running code • Documentation

Operations managed by policies • Paper posted that lists 70 operations • Policy-verification.docx • Candidate operations • Access control • Backups • Data retention • Descriptive metadata • Format creation • Integrity checks • Notification • Policy constraints • Replication • Restricted search • Storage cost • Tags • Use agreements

Types of policies

Policy Types

Replication Policy • Operation that is being controlled • Replicate a file • Controls • When is replication done? • When file is ingested • When file is changed • Which files are replicated? Choose based on: • Collection • User • Size • Replication properties • Choice of replication location • Choice of access controls on replica • Requirement for checksum • Verification of checksum on replica creation • Variants: • Versioning of changes vs replication • Backups vs replication (time-stamped copy) • Verification • When should replica existence be verified

Policy : Operation : Constraints : State Information

Interactions with other Working Groups • Interoperability testbed • Demonstrate that RDA recommendations can be jointly implemented • Control policies • Demonstrate that a desired practice can be applied consistently • Assessment policies • Verify that a recommended practice is followed • Integration • Demonstrate semantic consistency across systems level integration • Example – are data objects considered to be immutable

Practical Policy WG Interfaces with the other WGs • Interoperability testbed provided by Practical Policy WG • Persistent identifiers • Handle system • Metadata • HIVE linked-data vocabularies • Type registry • Expect implementation for integration • Data Foundation and Terminology • Exchange of concepts based on use cases • Preservation interest group • ISO 16363 assessment policies

Proposal - Special Interest Group on Interoperability Testbeds • New interest group is driven by the need to have testbeds with a longer lifetime than the Practical Policy working group. • Current testbeds • Dataverse • dCache • iRODS • Testbed functions • Demonstrate interoperability • Provide platform to evaluate proposed best practices / software • We need working groups to provide software systems or policies for testing. • Need a liaison to each working group

Special Interest Group on Interoperability Testbeds • Interested participants include: • David Antos CESNET • Jon Crabtree Dataverse • MarcioFaerman OSU • Patrick FuhrmanndCache testbed, DESY • Thomas Jejkal KIT Data Manager repository • TiborKalmanPersistent identifier consortium • Reagan Moore DataNet Federation Consortium • JakubPeisardCache testbed • Raphael Ritz MPG

Working Group: Practical Policy Rainer Stotzka, Reagan Moore