390 likes | 536 Views
Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines. A Presentation by Ian Graham Carnegie Mellon University August 2, 2002. The March of Progress. 1. Literal string search (exact substring) 2. Extended string search (character classes)
E N D
Kleene Would Be ShockedRedrawing the Link Between Theory and Modern Regex Engines A Presentation by Ian Graham Carnegie Mellon University August 2, 2002
The March of Progress • 1. Literal string search (exact substring) • 2. Extended string search (character classes) • 3. Regular expression matching • 4. Approximate matching • 5. “Extended” regular expression matching
Begin at the Beginning • The simplest case of a regular expression is a literal string search • Literal—any symbol in the alphabet • Literal string—a concatenation of literals • Literal string search—the problem of finding all occurrences of one literal string within another literal string (find “cad” in “abracadabra”)
Quick Review • Knuth-Morris-Pratt (KMP) and Boyer-Moore (BM) • Two classical literal string search algorithms • About 25 years old • Used to achieve O(m+n) search performance, where m is the length of the search pattern and n is the length of the text to be searched
Quick Review • KMP scans from left to right, shifting by aligning the longest prefix of the search pattern which matches a suffix of the text scanned • BM scans from right to left along a window that shifts from left to right by choosing the largest shift amount from multiple shift rules
Practical Developments • For any alphabet size, there is always an algorithm which achieves better experimental results than KMP or BM. • The Horspool algorithm (1980) simplifies BM, using only the bad character shift rule instead of calculating multiple shift amounts and using the best
Practical Developments • Horspool is O(m+n) in the average case (assuming equal probability of all alphabet characters), O(mn) in the worst case • BM is O(m+n) average, O(m+n) worst • Evaluating multiple shift rules for BM greatly increases its runtime constant • Horspool is much faster in practice, and is extremely hard for any algorithm to beat over large alphabets
Bit-Parallelism • Recent algorithms (1992~) create nondeterministic automata to keep track of each possible match along the length of the pattern • States of these NFAs are mapped to bits in a word, and transitions are simulated utilizing the parallelism of bitwise operations
Bit-Parallelism • Possible matches may be represented by “1”s, and proceed in parallel along the pattern until they reach the end, indicating a match
Bit-Parallelism • Savings due to parallelism depends on the word size • Bit-parallel algorithms often only perform well for patterns of size near to or less than the word size
Bit-Parallelism • Most analysis assumes constant word size, either 32 or 64 bits • Savings under this assumption are constant, but result in extremely good performance for practical applications
A Wrench • Let a “character class” be an item which matches a single character from a range or explicit list. • Examples • [0-9] matches any digit • [Aa] matches A or a • [A-Za-z] matches any English letter
A Wrench • Let an “extended string” be a literal string with the additional property that it may contain character classes in place of literals. • Examples: • “abc[de]f” matches “abcdf or “abcef” • “[Aa][Nn][Ee][Uu][Rr][Ii][Ss][Mm]” matches “aneurism”, “ANEURISM”, “aNeUrIsM”…
A Wrench • Moving from literal string searches to extended string searches confounds many algorithms • Horspool may be extended, but its performance suffers greatly • Boyer-Moore may also be extended, and performs better than other well-known extensions
Bit-Parallelism on Top? • A recent (Navarro and Raffinot, 1998) bit-parallel algorithm claims to be 10-40% faster than any known variant of BM • Appears to be the fastest algorithm given: • moderate-sized alphabet (e.g. English) • moderate pattern sizes (5-110 characters)
What is a Regular Expression? • Says Stephen Kleene: • “A notation to describe regular languages.” • “A description of the behavior of a finite state machine.” • “Regular.”
A Familiar Definition • 1. a for some a in the alphabet Σ • 2. ε • 3. the null language • 4. R1 U R2 (R1, R2 regular languages) • 5. R1 ◦ R2 (R1, R2 regular languages) • 6. R1* (R1 regular language)
Efficiently Matching Regular Expressions • Attempts to extend classical literal search algorithms to process regular expressions have largely been fruitless • Efficient algorithms involve clever ways of simulating an NFA equivalent of the regular expression
Efficiently Matching Regular Expressions • For small to moderate pattern sizes, optimizations using bit-parallelism appear to result in the fastest algorithms (Navarro, Raffinot) • For large pattern sizes (greater than about 4 times the word size), partial conversion from NFA to DFA results in good performance
Where can we go from here? • Approximate matching—match a literal string to within some “difference” • Edit distance is commonly used • Rules much more complex for computational biology applications • Extensions to regular expressions • Used by most languages and applications
Where can we go from here? • Efficiently handling regular expressions and approximate matching are problems in much of today’s research • Flexible Pattern Matching in Strings, by Navarro and Raffinot, referenced here, was published June 15, 2002
What is a Regular Expression? • Say modern developers: • A pattern that can be matched against a string • Not necessarily a model of any particular machine • Not necessarily (and not usually) regular • A very powerful tool for solving text-based problems
Who uses regular expressions? • Where to find built-in “regular expression” support today? • awk, grep, sed, vi, emacs, find, more, less, lex, Perl, Ruby, Tcl, MySQL, Javascript, PHP, Python, Java, Microsoft .NET, and many, many more • Built-in support has become more frequent and more advanced in the past few years
Irregular Regular Expressions? • The languages described by most popular “regular expression” engines are NP-Hard • Construction of a “regular expression” in Perl which matches representations of 3-colorable graphs is fairly straightforward
Irregular Regular Expressions? Perl “regular expression” which matches any 3-colorable graph, given a number of vertices V and an edge-list E: $string = (join "\n", (("rgb") x $V)) . "\n:\n" . join "\n", (("rgbrbgr") x @E); $regex = '^‘ . (join "\\n", (".*(.).*") x $V) . "\\n:\\n" . (join "\\n", map {".*\\$_->[0]\\$_->[1].*"} @E) . '$' ; 3-colorable iff $regex matches $string (http://perl.plover.com/NPC/NPC-3COL.html)
Irregular Regular Expressions? • Usage of the term “regular expression” in modern development conflicts with its theoretical definition • Many are unaware of or ignore this conflict, while others choose different terminology: • “Extended regular expression” • “Regex”
Clear Definitions • Regular expression—a description of a regular language, as defined by Kleene • Regex—any pattern matched against a string, not necessarily regular
The Main Culprit • Backreferences • Ability to refer to text that has been matched in a previous part of the regex • Typically expressed as \n, where n is a number—refers to the text matched by the regex inside the nth set of parenthesis • “(.*)\1\1” matches “abcabcabc”, “abaabaaba”... • “\b(\w+)\b\s+\b\1\b” matches “the the”, “a a”…(double words)
Backreferences • Supported in limited number by vi, sed, grep, emacs, Ruby, Python, PHP • POSIX standard for Basic Regular Expressions includes capability to process nine backreferences • Bounding the number available places a bound on performance
Backreferences • Supported without quantity bounds by Perl 5 and later, Tcl, Java 1.4, .NET • Number of backreferences limited only by physical memory restrictions
Backreferences • Slow—processing a regex becomes NP-Hard (for unbound amounts of backreferences) • Extremely useful—add a great deal of expressive power to a regex • Largely untouched by theoretical analysis • No real bounds on efficiency
Lookahead • Also known as “zero-width matching” • Ability to check text ahead without consuming it in a match • Typically expressed as (?text) • Example • “abc(?def)” will match “abc”, but only if followed by “def”
Thank Larry Wall • Perl 5 regexes offer the ability to embed code within a regex • Perl 6 will support recursive regexes
Why the divide? • Very little theory has touched on extended regular expressions. • Backreferences are indispensable for many programmers, and often even in non-development use of *NIX systems
Why the divide? • Developers implemented regular expression processors shortly after Kleene created regular expressions in the 50’s
Why the divide? • New and more powerful features were quickly added to practical “regular expressions” so that users and programmers could express more languages • Regexes soon left theory in the dust
Moral of the Story • It’s much easier to hack than to make a good proof
The Future • Unbound backreferences are becoming a standard feature in regex libraries and languages • The idea of implementing regexes in a common module and sharing it among different languages and platforms is growing in popularity • PCRE(Perl-Compatible Regex Engine) is used by Python, PHP, Apache, KDE…
The Future • Regex implementations seem to be moving towards more standardization • Meanwhile, a solid theoretical foundation has been laid down for regular expressions and modest extensions • Practice may not come to theory, but theory may soon come to practice