1 / 20

XML Compression Techniques: Survey and Comparison

XML Compression Techniques: Survey and Comparison. Angela McCarthy CP5080, SP1 2010. Overview. Received: 14 August 2008 Revised: 13 November 2008 Written by Sherif Sakr of University of New South Wales, Australia

Download Presentation

XML Compression Techniques: Survey and Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Compression Techniques: Survey and Comparison Angela McCarthy CP5080, SP1 2010

  2. Overview • Received: 14 August 2008 • Revised: 13 November 2008 • Written by SherifSakr of University of New South Wales, Australia • eXtensibleMarkupLanguage (XML), standard for data representation over World Wide Web • Large document sizes, compression introduced to deal with issues • Paper provides survey over compression techniques

  3. Introduction • Author looking at XML compression techniques and launch a study • Surveys each of the different compression techniques and compares advantages and disadvantages of each • Data transmitted online is rather large • XML usage is growing, thus a demand for efficient XML compression tools exists

  4. Introduction • Contributions made: • Comprehensive survey of XML compression techniques • A rich XML corpus collected and constructed • Contains wide variety of XML data sources, natures and document sizes • Detailed results examining performance and characteristics • Work repeatable • Webpage of study provides access to test files, examined XML compressors and detailed results of study

  5. Classifications • Each section goes through each of the classifications of compressors • General Text Compressors • Treats XML as plain text, uses traditional text compression techniques • XML Conscious Compressors • Takes advantage of awareness of XML files • Uses document structure to achieve better compression rates

  6. Classifications • Non-Queriable (Archival) XML Compressors • No queries can be processed over compressed format • Focus to achieve highest compression ratio • Queriable XML Compressors • Queries can be processed over compressed format • Compression ratio actually worse then archival XML compressors • Focus to avoid full document decompression during query execution

  7. Compressor Characteristics

  8. XML Data Sets

  9. XML Testing Corpus • Large variety of data sets (see previous) • From 0.5MB to 1.3GB • Four Categories • Structural Documents • Textual Documents • Regular Documents • Irregular Documents • Testing Environments • To ensure consistency, two different were environments used, high VS low

  10. Performance Metrics • Performance Metrics measured and compared • Compression Ratio • Ratio between sizes of compressed and uncompressed • Compression Ratio = (Compressed Size)/(Uncompressed Size) • Compression Time • Elapsed time during compression process • Decompression Time • Elapsed time during decompression process • The lower the metric value, the better the compressor

  11. Framework • 11 XML Compressors Evaluated • Three general purpose text compressors • Gzip, bzip2, PPM • Eight XML conscious compressors • XMillGzip, XMillBzip, XMillPPM, XMLPPM, SCMPPM, XWRT, AXECHOP • Compressors evaluated under default settings • Additional experiments run with tuned parameters for highest level of compression paramters • In total, 16 variant compressors

  12. Results • Ideally want to provide a global ranking on XML compression tools • Results show there is noclear winner • Dependant upon the weight of each metric • Three ranking functions • – WF1 = (1/3 ∗ CR)+(1/3 ∗ CT)+(1/3 ∗ DCT) • – WF2 = (1/2 ∗ CR)+(1/4 ∗ CT)+(1/4 ∗ DCT) • – WF3 = (3/5 ∗ CR)+(1/5 ∗ CT)+(1/5 ∗ DCT) • CR represents the compression ratio metric, CT represents the compression time metric and DCT represents the decompression time metric

  13. Compression Ratio

  14. Compression Time

  15. Decompression Time

  16. Conclusions • Paper surveyed state-of-the-art XML compression techniques • Reported the behaviour of various different XML compressors using large corpus of XML documents • Paper could be valuable for • Developers of new XML compression tools • Users for making an effective decision on most suitable compressor for requirements • Fig 7. Shows none of XML conscious compressors has achieved outstanding compression ratio

  17. Average Compression Ratios

  18. Future Work • Planning to continue maintaining and updating webpage of study with further evaluations • Enable visitors to perform online experiments using set of available compressors and own XML documents

  19. Metadata • Large number of references • Due to different compression techniques used • Large amount of data • Thorough in research methods • Large amount of data tested • Tested on different systems • Tested using different techniques • Abbreviations/Acronyms given • Designed for specific audience • Paper seems to be a reference tool • User to read to help decide on which compression tool to use

  20. Questions? Thanks for listening!

More Related