1 / 19

Clustering by Compression

Clustering by Compression. Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA). Overview. Input to the software is a set of files Output is a hierarchical clustering shown as an unrooted binary tree This is a case of unsupervised learning (example follows). Process Overview.

sanam
Download Presentation

Clustering by Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)

  2. Overview • Input to the software is a set of files • Output is a hierarchical clustering shown as an unrooted binary tree • This is a case of unsupervised learning • (example follows)

  3. Process Overview • 1. File translations, if necessary, for example from MIDI to “player-piano” type format. • 2. Calculation of Normalized Compression Distance, or NCD. • 3. Representation as an unrooted binary tree.

  4. What’s Unique? • This clustering system is unique in that it can be described as feature-free • There are no parameters to tune, and no domain-specific knowledge went into it. • Using general-purpose data compressors gives us a parameterized family of features automatically for each domain

  5. Featureless Clustering • No parameters and no customized features makes it convenient to develop as well as use • Since it is based on information-theoretic foundations, it tends to be less brittle than other methods that make considerably more domain-specific assumptions • So how does it work?

  6. Midi Translation • In order to restrict information entering the algorithm, we removed undesirable MIDI fields such as artist or composer name, headers, and other non-musical data. • We keep only the basic MIDI-track decomposition as well as note timing and duration events. We throw away individual note volume.

  7. Gene sequence translation • Genetic sequences are represented in ASCII ain four letter alphabets: A,T,G,C • Almost no translation at all

  8. Image Translation • Black and white images are converted to ASCII using spaces for black and # for white • Newlines are used to separate rows

  9. NCD • Once a group of songs has been acquired and translated, a quantity is computed on each pair in the group • Normalized Compression Distance measures how different two files are from one another.

  10. NCD • NCD is based on an earlier idea called Normalized Information Distance. • NID uses as compressor a mathematical abstraction called Kolmogorov Complexity, often abbreviated K. • K represents a perfect data compressor, and is therefore uncomputable.

  11. NCD • Since we cannot compute K, we approximate it using real general-purpose file-compressors like gzip, bzip2, winzip, ppmz, and others • NCD depends on a particular compressor and NCD with different compressors may give different results for the same pair of objects

  12. NCD • C(x) means “the compressed size of x” • C(xy) means “compressed size of x and y” • 0 <= NCD(x,y) <= 1 (roughly)

  13. NCD • NCD measures how similar or different two strings (or equivalently, files) are. • NCD(x,x) = 0, because nothing is different from itself • NCD(x,y) = 1 means that x and y are completely unrelated • Often less extreme values in real cases

  14. NCD • Computing NCD of every song with every other song yields a 2-dimensional symmetric distance matrix • Next step is transforming this array of distances into something easier to grasp • We use the Quartet Method to construct an unrooted binary tree from the NCD matrix

  15. Quartet Method • Our algorithm is a slight enhancement of the standard quartet method of tree reconstruction popular for the last 30 years • The input is a matrix of distances (NCD) • The output is an unrooted binary tree topology where each song is at a leaf and each non-leaf node has exactly three connections. • Tree is just one visualization of NCD matrix

  16. Newer developments • Since the original Algorithmic Clustering of Music paper, we have since developed further the underlying mathematical formalisms upon which the method is based in a new paper, Clustering by Compression • We’ve included experiments from many other areas: biology, astronomy, images…

  17. Current and future work • This year, we’ve begun experimenting with automatic conversion from .mp3 (and most other audio formats) to MIDI. This enables us to participate in new emerging spaces • We’re investigating alternatives for all stages of this process, to try to understand more about this apparently general machine learning algorithm

  18. New directions • Combination of NCD and Support Vector Machine (SVM) learning for providing scalable generalization in a wide class of domains both musical and otherwise • Application of our techniques in real outstanding questions within the musical community

  19. Contact and more info • Related papers and information: http://www.cwi.nl/~cilibrar • Software: http://complearn.sourceforge.net/ • Rudi.Cilibrasi@cwi.nl • Paul.Vitanyi@cwi.nl • Ronald.de.Wolf@cwi.nl

More Related