1 / 83

An Introduction to Machine Learning with Perl

An Introduction to Machine Learning with Perl. February 3, 2003 O’Reilly Bioinformatics Conference. Ken Williams ken@mathforum.org. Tutorial Overview. What is Machine Learning? (20’) Why use Perl for ML? (15’) Some theory (20’) Some tools (30’) Decision trees (20’) SVMs (15’)

albert
Download Presentation

An Introduction to Machine Learning with Perl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Machine Learning with Perl February 3, 2003 O’Reilly Bioinformatics Conference Ken Williams ken@mathforum.org

  2. Tutorial Overview • What is Machine Learning? (20’) • Why use Perl for ML? (15’) • Some theory (20’) • Some tools (30’) • Decision trees (20’) • SVMs (15’) • Categorization (40’)

  3. References & Sources • Machine Learning, Tom Mitchell. McGraw-Hill, 414pp, 1997 • Foundations of Natural Language Processing, Christopher D. Manning & Hinrich Schütze. MIT Press, 680 pp, 1999 • Perl-AI list (perl-ai@perl.org)

  4. What Is Machine Learning? • A subfield of Artificial Intelligence (but without the baggage) • Usually concerns some particular task, not the building of a sentient robot • Concerns the design of systems that improve (or at least change) as they acquire knowledge or experience

  5. Typical ML Tasks • Clustering • Categorization • Recognition • Filtering • Game playing • Autonomous performance

  6. Typical ML Tasks • Clustering

  7. Typical ML Tasks • Categorization

  8. Typical ML Tasks • Recognition Vincent Van Gogh Michael Stipe Mohammed Ali Ken Williams Burl Ives Winston Churchill Grover Cleveland

  9. Typical ML Tasks • Recognition Little red corvette The kids are all right The rain in Spain Bort bort bort

  10. Typical ML Tasks • Filtering

  11. Typical ML Tasks • Game playing

  12. Typical ML Tasks • Autonomous performance

  13. Typical ML Buzzwords • Data Mining • Knowledge Management (KM) • Information Retrieval (IR) • Expert Systems • Topic detection and tracking

  14. Who does ML? • Two main groups: research and industry • These groups do listen to each other, at least some • Not many reusable ML/KM components, outside of a few commercial systems • KM is seen as a key component of big business strategy - lots of KM consultants • ML is an extremely active research area with relatively low “cost of entry”

  15. When is ML useful? • When you have lots of data • When you can’t hire enough people, or when people are too slow • When you can afford to be wrong sometimes • When you need to find patterns • When you have nothing to lose

  16. An aside on your presenter • Academic background in math & music (not computer science or even statistics) • Several years as a Perl consultant • Two years as a math teacher • Currently studying document categorization at The University of Sydney • In other words, a typical ML student

  17. Why use Perl for ML? • CPAN - the viral solution™ • Perl has rapid reusability • Perl is widely deployed • Perl code can be written quickly • Embeds both ways • Human-oriented development • Leaves your options open

  18. But what about all the data? • ML techniques tend to use lots of data in complicated ways • Perl is great at data in general, but tends to gobble memory or forego strict checking • Two fine solutions exist: • Be as careful in Perl as you are in C (Params::Validate, Tie::SecureHash, etc.) • Use PDL or Inline (more on these later)

  19. Interfaces vs. Implementations • In ML applications, we need both data integrity and the ability to “play with it” • Perl wrappers around C/C++ structures/objects are a nice balance • Keeps high-level interfaces in Perl, low-level implementations in C/C++ • Can be prototyped in pure Perl, with C/C++ parts added later

  20. Some ML Theory and Terminology • ML concerns learning a target function from a set of examples • The target function is often called a hypothesis • Example: with Neural Network, a trained network is a hypothesis • The set of all possible target functions is called the hypothesis space • Training process can be considerd a search through the hypothesis space

  21. Some ML Theory and Terminology • Each ML technique will • probably exclude some hypotheses • prefer some hypotheses over others • A technique’s exclusion & preference rules are called its inductive bias • If it ain’t biased, it ain’t learnin’ • No bias = rote learning • Bias = generalization • Example: kids learning multiplication (understanding vs. memorization)

  22. Some ML Theory and Terminology • Ideally, a ML technique will • not exclude the “right” hypothesis, i.e. the hypothesis space will include the target hypothesis • Prefer the target hypothesis over others • Measuring the degree to which these criteria are satisfied is important and sometimes complicated

  23. Evaluating Hypotheses • We often want to know how good a hypothesis is • To know how it performs in real world • May be used to improve learning technique or tune parameters • May be used by a learner to automatically improve the hypothesis • Usually evaluate on test data • Test data must be kept separate from training data • Test data used for purpose 3) is usually called validation or held-out data. • Training, validation, and test data should not contaminate each other

  24. Evaluating Hypotheses • Some standard statistical measures are useful • Error rate, accuracy, precision, recall, F1 • Calculated using contingency tables

  25. Evaluating Hypotheses • Error = (b+c)/(a+b+c+d) • Accuracy = (a+d)/(a+b+c+d) • Precision = p = a/(a+b) • Recall = r = a/(a+c) • F1 = 2pr/(p+r) Precision is easy to maximize by assigning nothing Recall is easy to maximize by assigning everything F1 combines precision and recall equally

  26. Evaluating Hypotheses • Example (from categorization) • Note that precision is higher than recall - indicates a cautious categorizer Precision = 0.851, Recall = 0.711, F1 = 0.775 These scores depend on the task - can’t compare scores across tasks Often useful to compare categories separately, then average (macro-averaging)

  27. Evaluating Hypotheses • The Statistics::Contingency module (on CPAN) helps calculate these figures: use Statistics::Contingency; my $s = new Statistics::Contingency; while (...) { ... Do some categorization ... $s->add_result($assigned, $correct); } print "Micro F1: ", $s->micro_F1, "\n"; print $s->stats_table; Micro F1: 0.774803607797498 +-------------------------------------------------+ | miR miP miF1 maR maP maF1 Err | | 0.243 0.843 0.275 0.711 0.851 0.775 0.006 | +-------------------------------------------------+

  28. Useful Perl Data-Munging Tools • Storable - cheap persistence and cloning • PDL - helps performance and design • Inline::C - tight loops and interfaces

  29. Storable • One of many persistence classes for Perl data (Data::Dumper, YAML, Data::Denter) • Allows saving structures to disk: store($x, $filename); $x = retrieve($filename); • Allows cloning of structures: $y = dclone($x); • Not terribly interesting, but handy

  30. PDL • Perl Data Language • On CPAN, of course (PDL-2.3.4.tar.gz) • Turns Perl into a data-processing language similar to Matlab • Native C/Fortran numerical handling • Compact multi-dimensional arrays • Still Perl at highest level

  31. PDL demo PDL experimentation shell: ken% perldl perldl> demo pdl

  32. Extending PDL • PDL has extension language PDL::PP Lets you write C extensions to PDL Handles many gory details (data types, loop indexes, “threading”)

  33. Extending PDL • Example: $n = $pdl->sum_elements; # Usage: $pdl = PDL->random(7); print "PDL: $pdl\n"; $x = $pdl->sum_elements; print "Sum: $sum\n"; # Output: PDL: [0.513 0.175 0.308 0.534 0.947 0.171 0.702] Sum: [3.35]

  34. Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );

  35. Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );

  36. Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, double tmp; tmp = 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );

  37. Extending PDL pp_def('sum_elements', Pars => 'a(n); [o]b();', Code => <<'EOF’, $GENERIC() tmp; tmp = ($GENERIC()) 0; loop(n) %{ tmp += $a(); %} $b() = tmp; EOF );

  38. Inline::C • Allows very easy embedding of C code in Perl modules • Also Inline::Java, Inline::Python, Inline::CPP, Inline::ASM, Inline::Tcl • Considered much easier than XS or SWIG • Developers are very enthusiastic and helpful

  39. Inline::C basic syntax • A complete Perl script using Inline: (taken from Inline docs) #!/usr/bin/perl greet(); use Inline C => q{ void greet() { printf("Hello, world\n"); } }

  40. Inline::C for writing functions • Find next prime number greater than $x #!/usr/bin/perl foreach (-2.7, 29, 30.33, 100_000) { print "$_: ", next_prime($_), "\n"; } . . .

  41. Inline::C for writing functions use Inline C => q{ int next_prime(double in) { // Implements a Sieve of Eratosthenes int *is_prime; int i, j; int candidate = ceil(in); if (in < 2.0) return 2; is_prime = malloc(2 * candidate * sizeof(int)); for (i = 0; i<2*candidate; i++) is_prime[i] = 1; . . .

  42. Inline::C for writing functions for (i = 2; i < 2*first_candidate; i++) { if (!is_prime[i]) continue; if (i >= first_candidate) { free(is_prime); return i; } for (j = i; j < 2*first_candidate; j += i) is_prime[j] = 0; } return 0; // Should never get here } }

  43. Inline::C for wrapping libraries • We’ll create a wrapper for ‘libbow’, an IR package • Contains an implementation of the Porter word-stemming algorithm (i.e., the stem of 'trying' is 'try’) # A Perlish interface: $stem = stem_porter($word); # A C-like interface: stem_porter_inplace($word);

  44. Inline::C for wrapping libraries package Bow::Inline; use strict; use Exporter; use vars qw($VERSION @ISA @EXPORT_OK); BEGIN { $VERSION = '0.01'; } @ISA = qw(Exporter); @EXPORT_OK = qw(stem_porter stem_porter_inplace); . . .

  45. Inline::C for wrapping libraries use Inline (C => 'DATA', VERSION => $VERSION, NAME => __PACKAGE__, LIBS => '-L/tmp/bow/lib -lbow', INC => '-I/tmp/bow/include', CCFLAGS => '-no-cpp-precomp', ); 1; __DATA__ __C__ . . .

  46. Inline::C for wrapping libraries // libbow includes bow_stem_porter() #include "bow/libbow.h" // The bare-bones C interface exposed int stem_porter_inplace(SV* word) { int retval; char* ptr = SvPV_nolen(word); retval = bow_stem_porter(ptr); SvCUR_set(word, strlen(ptr)); return retval; } . . .

  47. Inline::C for wrapping libraries // A Perlish interface char* stem_porter (char* word) { if (!bow_stem_porter(word)) return &PL_sv_undef; return word; } // Don't know what the hell these are for in libbow, // but it needs them. const char *argp_program_version = "foo 1.0"; const char *program_invocation_short_name = "foofy";

  48. When to use speed tools • A word of caution - don’t use C or PDL before you need to • Plain Perl is great for most tasks and usually pretty fast • Remember - external libraries (like libbow, pari-gp) both solve problems and create headaches

  49. Decision Trees • Conceptually simple • Fast evaluation • Scrutable structures • Can be learned from training data • Can be difficult to build • Can “overfit” training data • Usually prefer simpler, i.e. smaller trees

  50. Decision Trees • Sample training data:

More Related