210 likes | 358 Views
Pattern Matching: Simple Patterns. Introduction. Programmers often need to scan a file, directory, etc. for a specific substring. Find all files that begin with “ A ”. Find all files that end in “ txt ” This capability is provided by a variety of tools. e.g. egrep, grep, awk,
E N D
Introduction • Programmers often need to scan a file, directory, etc. for a specific substring. • Find all files that begin with “A”. • Find all files that end in “txt” • This capability is provided by a variety of tools. • e.g. egrep, grep, awk, • Useful to include this functionality in a programming language.
Perl’s Pattern Matcher • Perl has a built in pattern matcher. • Motivation: system administrators frequently use regular expressions. They also use Perl. • Syntax is borrowed from the grep utility in Unix. • Based on regular expressions from computer science.
Perl’s Pattern Matcher (cont.) • Operates over a single string. • Contexts: • Scalar: Returns true or false. • List: Matching substrings returned in a list. • The syntax is: m dl pattern dl [modifiers] • (/) is the most common delimiter. • m operator is unnecessary. • Other delimiters can be used: m~pattern~
Simple Patterns • Simple patterns – match individual characters or character classes. • An abstract representation of a set of strings. • A pattern “matches” when the string it’s compared with is in the set. • Matching is done from left to right.
Three Categories of Characters • Normal characters: • Match themselves. • Includes escape characters – e.g. \t, \cC • Metacharacters: • Have special meanings in patterns • \ | ( ) [ ] { } ^ $ * + • Period: • Matches any character except newline.
An Example $_ = “It’s snowing today.”; if (/snow/) { print “There was snow somewhere in $_”; } else { print “$_ was snowless \n”; }
Character Classes • Character classes specify collections of characters in patterns. • Defined by placing the set in [ ] • e.g. /[<>=] • Dashes are used specify ranges of characters: • /[A-Za-z]/ • /[0-7]/ • /[0-3-]/
Exclusion From a Class • Characters can be excluded from a class with (^) • Matches anything except the specified characters. • For example: • /[^A-Za-z]/ • /[^01]/
Some Examples • /[A-Z]”\s/ • /[\dA-Fa-f]/ • /\w\w:\d\d/ • /0x\d/
Variables in Patterns • A variable in a pattern is interpolated. • For example, $hexpat = “\\s[\dA-Fa-f]\\s”; if (/$hexpat/) { print “$_ has a hex digit.” }
Quantifiers • Quantifiers can make a pattern more powerful. • Allows a pattern to be repeated a specified number of times. • Perl has four kinds of quantifier: • *, +, ?, {m, n} • Quantifier immediately follows the pattern it quantifies.
{m, n} • {n} – exactly n repetitions. • {m,} – at least m repetitions. • {m,n} – at least m, but not more than n repetitions.
{m,n} Examples • /a{1,3}b/ - ab, aab, aaab • /ab{3}c/ - abbbc • /ab{2,}c/ - abbc, abbbc, abbbbc, … • /c{3} z{5}/ - ccc zzzzz • /[abc] {1, 2}/ - a,b,c,ab,ac,ba,bc,ca,cb
Asterisk (*) • (*) means zero or more repetitions. • Equivalent to {0,} • For example, • /0\d\d*/ • /\w\w*/ • /bob.*cat/
Plus (+) • (+) means one or more repetitions. • Equivalent to {1,} • For example, • /\w+/ • /[A-Za-z][A-Za-z\d_]+/ • /\d+\.\d+/
Question Mark (?) • (?) means either zero or one. • Equivalent to {0,1}. • For example, • /\d+\.?/ • /\$?\d+\.\d\d/ • /”?\w+”?/
Subpatterns • Quantifiers modify only the last character. • e.g. /ball*/ • () can be used to group parts of patterns. • The quantifier modifies the group. • For example, • /(ball)*/ • /(boo! ){3}/
Alternation • (|) is the logical OR operator in a pattern. • /a|e|i|o|u/ is equivalent to /[aeiou]/ • For example, • /(Bob|Tom|Pussy|Scaredy)cat/ • /t(oo?|wo)/ • Be careful! • /Tom|Tommie/
Precedence • The precedence of the operators are: • Parenthesis • Quantifiers • Character Sequence • Alternation • For example, • /#|-+/ • /(#|-)+/