1 / 32

File Format Identification and Archival Processing

File Format Identification and Archival Processing. William Underwood NARA Briefing GTRI Washington, DC Atlanta, Georgia February 6, 2009. Background

harlan
Download Presentation

File Format Identification and Archival Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. File Format IdentificationandArchival Processing William Underwood NARA Briefing GTRI Washington, DC Atlanta, Georgia February 6, 2009

  2. Background • File Command- Magic Expressions • DROID-File Format Signature Expressions • Comparison-File Command/Magic & DROID/FFSignatures • Summary Overview

  3. Presidential Electronic Records PilOt System (PERPOS) (2001-2006) • Advanced Decision Support for Archival Processing of Presidential Electronic Records (2007-2009) Background – Projects

  4. One of the first presidential libraries to have electronic presidential records, particularly from hard drives • Word Processing Files • Databases • Spreadsheets • Presentations • Email • Computer Programs • Scanned Paper Records Backgound: Electronic Records atGeorge H.W. Bush Pres. Library

  5. The archival functions needed to process paper records are well understood. • We had few tools to identify, view or review electronic records in response to PRA/FOIA requests • Tools Initially Needed: • File Format Identification Tool • Viewers for Records in Legacy File Formats • Tool for Filtering OS and Office Applications Software from User-created Files • Tools for Converting Legacy to Current Formats • Tools to Support Redaction of E-records Background: Where We Began

  6. Result: Integrated set of tools called PERPOS Background: Evolutionary Prototyping

  7. Background: Archival Activities Supported by PERPOS

  8. Contents of PC Hard Disk

  9. File Format Names

  10. Filter Contents of a Hard Drive

  11. OS and Software Application Files Blocked by Filter

  12. File Types of Passed Files

  13. Properties of Filtered Files

  14. OS/App Hash Code Filter

  15. National Software Reference Library

  16. NSRL Reference Data Set

  17. Viewers, Archive Extractors, Password Recovery, Decrypters, Converters, Repairers

  18. Magic File – Man Page

  19. Magic File – Man Page

  20. Magic File – Man Page

  21. Magic for individual file formats • Output of file command/magic file is File Format ID • Rewriting file command code for identifying Characteristics of Text files and Document Types • Defined approx. 750 file format signatures • Collected examples of approx. 500 of the file format types • Created File Signature Database • Verified that magic file correctly identifies approx. 500 File Types Extensions of File Command and Magic File

  22. GUI for File Type Identifier

  23. File signatures for about 200 File Formats that are currently defined in DROID File Signature file only by file name extensions • Examples: Microsoft Outlook Personal folders (97-2002), AIFF (Compressed), AutoCAD Design Web Format, Adobe Framemaker Document, Applixware Spreadsheet, Chiwriter 3 Document • File signatures for about 300 file formats that probably should be included in Pronom Registry and DROID Signature File. • Examples: MHTML Web Page Archive, Outlook Express E-mail Folder, Autodesk Revit Project, CATIA Model File V4, CATIA Drawing V5, ClarisWorks 3 Document, MacWrite 4.x Document, PDF/X1a

  24. In PRONOM, an internal signature is composed of one or more byte sequences, each comprising a continuous sequence of hexadecimal byte values and, optionally, regular expressions. A signature byte sequence is modelled by describing its starting position within a bitstream and its value. The starting position can be one of two basic types: •Absolute: the byte sequence starts at a fixed position within the bitstream. This position is described as an offset from either the beginning or the end of the bitstream. Variable: the byte sequence can start at any offset within the bitstream. The byte sequence can be located by examining the entire bitstream. DROID – File Signature Expressions

  25. The value of the byte sequence is defined as a sequence of hexadecimal values, optionally incorporating any of the following regular expressions: • ??: wildcard matching any pair of hexadecimal values (i.e. a single byte). • *: wildcard matching any number of bytes (0 or more). • {n}: wildcard matching n bytes, where n is an integer. • {m-n}: wildcard matching between m-n bytes inclusive, where m and n are integers or ‘*’. • (a|b): wildcard matching one from a list of values (e.g. a or b), where each value is a hexadecimal byte sequence of arbitrary length containing no wildcards. • [a:b]: wildcard matching any sequence of bytes which lies lexicographically between a and b, inclusive (where both a and b are byte sequences of the same length, containing no wildcards, and where a is less than b). The endian-ness of a and b are the same as the endian-ness of the signature as a whole. • [!a]: wildcard matching any sequence of bytes other than a itself (where a is a byte sequence containing no wildcards). • [!a:b]: wildcard matching any sequence of bytes which does not lie lexicographically between a and b, inclusive (where a and b are both byte sequences of the same length, containing no wildcards, and where a is less than b).

  26. DROID Applied to Sample Files

  27. DROID Matches sequences of hex values at offsets Regular expressions on hex values Efficient substring search Identifies all possible signatures and then selects the one of highest priority Includes offsets from EOF GTRI File Type Identifier Matches a variety of data types at offsets Regular expressions on strings in lines Less efficient substring search, but more indirect offsets increase efficiency Preorders signatures and stops search when pattern matches. Lacks offsets from EOF Comparison of DROID and GTRI file Type Identifier Technologies

  28. PERPOS File Format Resources • File Format Signatures • File Format Specifications/Reverse Engineering Documents • Software • Viewers/players • Archive Extractors • Converters • Password Recovery & Decryption • Repairers • Sample Files Research Issues • File Signature Representation Languages • Metadata Extraction Languages • File Format Description Languages Summary

More Related