1 / 20

Kathleen Fisher* AT&T Labs Research padsproj

PADS: A System for Managing Ad Hoc Data. Kathleen Fisher* AT&T Labs Research www.padsproj.org. *And many many others…. Kenny Zhu. Dr. Zhu has been one of the main contributors to the PADS project.

dolph
Download Presentation

Kathleen Fisher* AT&T Labs Research padsproj

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PADS: A System for Managing Ad Hoc Data Kathleen Fisher* AT&T Labs Research www.padsproj.org *And many many others…

  2. Kenny Zhu • Dr. Zhu has been one of the main contributors to the PADS project. • He is finishing his Post Doc at Princeton and looking for jobs, both in North America and Asia. http://www.cs.princeton.edu/~kzhu/

  3. Data, Data, Everywhere! Incredible amounts of data stored in well-behaved formats: Databases: Tools • Schema • Browsers • Query Languages • Standards • Libraries • Books, documentation • Training courses • Conversion tools • Vendor support • Consultants... XML:

  4. We’re not always so lucky! Vast amounts of chaotic ad hoc data: Tools • Perl • Awk • C • ...

  5. Web Logs 207.136.97.49 - - [15/Oct/2006:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013 207.136.97.49 - - [15/Oct/2006:18:46:51 -0700] "GET /turkey/clear.gif HTTP/1.0" 200 76 207.136.97.49 - - [15/Oct/2006:18:46:52 -0700] "GET /turkey/back.gif HTTP/1.0" 200 224 207.136.97.49 - - [15/Oct/2006:18:46:52 -0700] "GET /turkey/women.html HTTP/1.0" 200 17534 208.196.124.26 - Dbuser [15/Oct/2006:18:46:55 -0700] "GET /candatop.html HTTP/1.0" 200 - 208.196.124.26 - - [15/Oct/2006:18:46:57 -0700] "GET /images/done.gif HTTP/1.0" 200 4785 www.att.com - - [15/Oct/2006:18:47:01 -0700] "GET /images/reddash2.gif HTTP/1.0" 200 237 208.196.124.26 - - [15/Oct/2006:18:47:02 -0700] "POST /images/refrun1.gif HTTP/1.0" 200 836 208.196.124.26 - - [15/Oct/2006:18:47:05 -0700] "GET /images/hasene2.gif HTTP/1.0" 200 8833 www.cnn.com - - [15/Oct/2006:18:47:08 -0700] "GET /images/candalog.gif HTTP/1.0" 200 - 208.196.124.26 - - [15/Oct/2006:18:47:09 -0700] "GET /images/nigpost1.gif HTTP/1.0" 200 4429 208.196.124.26 - - [15/Oct/2006:18:47:09 -0700] "GET /images/rally4.jpg HTTP/1.0" 200 7352 128.200.68.71 - - [15/Oct/2006:18:47:11 -0700] "GET /amnesty/usalinks.html HTTP/1.0" 143 10329 208.196.124.26 - - [15/Oct/2006:18:47:11 -0700] "GET /images/reyes.gif HTTP/1.0" 200 10859

  6. Haskell HI Files 00000000: 0001 face 0000 0073 0400 0000 3600 0000 .......s....6... 00000010: 3000 0000 3500 0000 3000 0000 0000 0000 0...5...0....... 00000020: 0001 0000 0000 0100 0000 0043 0001 0000 ...........C.... 00000030: 0002 0200 0000 0200 0000 0300 0000 0200 ................ 00000040: 0000 0400 0000 4800 0100 0000 0200 0000 ......H......... 00000050: 0502 0000 0000 0006 0000 0000 0007 0000 ................ 00000060: 0001 0000 0000 6800 0000 0000 006f 0000 ......h......o.. 00000070: 0000 0100 0000 0800 0000 0968 6173 6b65 ...........haske 00000080: 6c6c 3938 0000 0007 4350 5554 696d 6500 ll98....CPUTime. 00000090: 0000 0462 6173 6500 0000 0847 4843 2e42 ...base....GHC.B 000000a0: 6173 6500 0000 0e47 4843 2e46 6f72 6569 ase....GHC.Forei 000000b0: 676e 5074 7200 0000 0e53 7973 7465 6d2e gnPtr....System. 000000c0: 4350 5554 696d 6500 0000 0a67 6574 4350 CPUTime....getCP 000000d0: 5554 696d 6500 0000 1063 7075 5469 6d65 UTime....cpuTime 000000e0: 5072 6563 6973 696f 6e Precision

  7. Ad Hoc Data from AT&T

  8. And Many Others... • Gene ontology data • Cosmology data • Financial trading data • Telecom billing data • Router config files • System logs • Call detail data • Netflow packets • DNS packets • Java JAR files • Jazz recording info • ...

  9. Why a data description language? • Ad hoc data is difficult to manage • Data arrives “as is” in a wide-variety of encodings and formats. • Documentation is out of data or non-existent. • Data is buggy and potentially malicious. • Processing must detect errors and respond in application-specific ways. • Data sources often have high volume. • Existing solutions are insufficient • Lex/Yacc-like technologies target language syntax, rather than data. • Hand-coded C/Perl programs are time-consuming to produce, brittle with respect to changes, and fail to handle errors well. • Data description languages (DDLs) address these issues • Data expert writes declarative description rather than a parser. • Description serves as living documentation. • Parser exhaustively detects errors without cluttering user code. • Parser can be proven correct with respect to its handling of buggy data. • From declarative specification, compiler can generate auxiliary tools. Data description languages facilitate managing ad hoc data.

  10. The PADS/C Data Description Language • Provides rich and extensible set of base types for describing atomic data. • Pint8, Puint8, … // -123, 44 • Pstring(:’|’:) // hello| Pstring_FW(:3:) // catdog Pstring_ME(:”/a*/”:) // aaaaaab • Pdate, Ptime, Pip, … • Provides type constructors to describe structured data, by analogy with C: • Pstruct, Parray, Punion, Ptypedef, Penum • Allows arbitrary predicates to describe expected properties. • Compiler generates parser, printer, and other useful tools in a type directed fashion. In the PADS/C DDL, each piece of data is described by a type, which specifies the physical format and semantic constraints of the data. PADS uses a type metaphor to declaratively describe ad hoc data.

  11. Parray Phostname{ Pstring_SE(:"/[. ]/":) [] : Psep('.') && Pterm(Pnosep); }; Punion host { Pip ip; /- 135.207.23.32 Phostname host; /- www.research.att.com }; Punion auth_id { Pchar unauthorized : unauthorized == '-'; Pstring(:' ':) id; }; Penum method { GET, PUT, POST, HEAD, DELETE, LINK, UNLINK }; Pstruct version { "HTTP/"; Puint8 major; '.'; Puint8 minor; }; int chkVersion(version v, method m) { if ((v.major == 1) && (v.minor == 0)) return 1; if ((m == LINK) || (m == UNLINK)) return 0; return 1; }; Pstruct request { '\"'; method meth; ' '; Pstring(:' ':) req_uri; ' '; version version : chkVersion(version, meth); '\"'; }; Ptypedef Puint16_FW(:3:) response : response x => { 100 <= x && x < 600}; Punion length { Pchar unavailable : unavailable == '-'; Puint32 len; }; PrecordPstruct entry { host client; ' '; auth_id remoteID; ' '; auth_id auth; " ["; Pdate(:']':) date; "] "; request request; ' '; response response; ' '; length length; }; PsourceParray clf { entry []; } Common Log Format in PADS/C A complete PADS/C description of the web server log data shown in the box: 207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013 PADS allows concise, precise, and intuitive data specifications.

  12. PADS Parsing and Printing From a data description, the PADS compiler generates • a parser, which maps raw input data and a mask to a pair of an in-memory representation and a parse descriptor. • a parse descriptor, which records meta-data about a parse, including location and error information. • a mask, which allows dynamic customization of parser behavior. PADS has a formal semantics, so we can prove formal properties about the generated parsers, such as: • If the mask specifies “check all properties” and “set all representations,” and the parse descriptor indicates no errors, then the in-memory representation is correct. • Malicious data cannot corrupt the parser. The PADS compiler also generates a printer, which maps an in-memory rep and a parse descriptor back to raw form. We’d like printing and parsing to be inverses, but that is a hard problem in general… PADS uses meta-data to manage buggy or malicious data.

  13. Parser Formatter Statistical Analysis Tools PADS data description PADS Compiler Xquery integration Translator to XML … Visualization Tools Leverage! Given a data description, the computer essentially understands the data. We can leverage that understanding to generate many tools beyond a parser: Type directed programming provides this leverage. For each base type, we have to specify the desired behavior. The compiler then lifts the behavior to all structured types. Type-directed programming allows generation of useful tools from descriptions.

  14. Learning: Goals & Approach Visual Information End-user tools Email struct { ........ ...... ........... } ASCII log files Binary Traces Raw Data Data Description CSV XML Standard formats & schema; Problem: Producing useful tools for ad hoc data takes a lot of time. Solution: A learning system to generate data descriptions and tools automatically.

  15. Format Inference Overview XML XMLifier Raw Data Accumlator Analysis Report Chunking Process Tokenization PADS Description PADS Compiler Structure Discovery IR to PADS Printer Scoring Function Format Refinement

  16. Possible Additional Material • PADS in More Depth: The language, the tools, the semantics. [PLDI 05, POPL 06, POPL 07, PADL 08] (long talk). • Format Inference: Basic algorithm, small demo, and experimental evaluation [POPL 08](long talk). • In Progress: (short talk) • Improving format inference by learning tokenizations [PADL 09] • Taking steps towards making inference incremental. • Learning Demo: Perhaps better offline.

  17. Contributors • AT&T: Yitzhak Mandelbaum, Mary Fernandez, and Andrew Forest • Princeton: David Walker, Kenny Zhu, Qian Xi • Galois: Peter White and David Burke • Penn: Nate Foster and Michael Greenberg

  18. Motivation: Token Ambiguity Problem (TAP) • Given a string, there are multiple ways to tokenize it. • Example 1: 127.0.0.1 • IP • Float Dot Float • Int Dot Int Dot Int Dot Int Example 2: • Message • Word White Word White Word White... White URL • Word White Quote Filepath Quote White Word White...

  19. How does learnPADS deal with TAP ? • Tokenization Phase: • Take the first, longest match. Float • A fixed order is assigned by the end user. • We have no order to pick. Int ID Path As a result, the current learning system: can’t have ambiguous base tokens – Message, Text, ID. sometimes produces descriptions that are too precise.

  20. Scaling to Larger Data Sets • Original algorithm keeps entire data set in memory, so won’t scale to large data sets. • Proposed conceptual architecture to permit incremental learning:

More Related