150 likes | 247 Views
Regular expressions step by step. Tamás Váradi varadi@nytud.hu. What are they?. Regular expressions (regexp) define a pattern, which may match a whole series of strings Powerful, compact, fast Useful for all sorts of text processing tasks. Where can I use them?.
E N D
Regular expressions step by step Tamás Váradi varadi@nytud.hu BTANT129 w6
What are they? • Regular expressions (regexp) define a pattern, which may match a whole series of strings • Powerful, compact, fast • Useful for all sorts of text processing tasks BTANT129 w6
Where can I use them? • In text editors/word processors (even in Ms Word to some extent!) like: • Textpad, EditPad Pro (to name but two) • Special programs to search a set of files: • grep, egrep, sed (free) • powergrep • Visual REGEXP • In programming languages • Perl, Python and other so-called script languages BTANT129 w6
What about INTEX? • Yes, INTEX has a built-in regexp facility • But it is a little limited and peculiar (INTEX offers graphs as an alternative) • In this lecture, we are going to cover regular expressions as used in the text processing tools mentioned above BTANT129 w6
Is there a standard variety? • More or less • There are variants that differ in • notation • features (expressive power, elegance etc) • Here we'll concentrate on what you can expect regular expressions to do BTANT129 w6
First things first • Any character will match itself • Except characters with a special meaning (metacharacters): \ | ( ) [ { ^ $ * + ? . < > • The pattern is applied from top to bottom left to right, as if a sliding window onto the text BTANT129 w6
Special characters • . will match any one character • ? will match the preceding character zero or once (at most once) • + will match the preceding character one or any number of times (at least once) • * will match the preceding character zero or any number of times • {n,m} BTANT129 w6
Examples • .at matches bat, cat, fat, pat, rat • c*at matches at and cat and ccat, cccat etc. • guess what c* will match and why? • c+at matches cat and ccat, cccat etc. but not at • c?at matches at and cat, BTANT129 w6
Anchor points • A regexp is matched against the text at any point where the first char of the regexp matches a char in the target text – a sliding window • matching is done line-by line by default • ^ : match at the beginning • $ : match at the end BTANT129 w6
Groups and alternations • (bla)* • Sir|Madam BTANT129 w6
Character classes • [aeiou] matches one of the set • [^aeiou] matches any other char except one in the set • [a-zA-Z0-9] consecutive characters can be referred to with a range • Note: whatever the length of the set, it always represents a single character in the pattern – so it's a single character alternation ('or' relation between characters BTANT129 w6
Extended features • \d a digit • \D a non-digit • \s a space, tab, linefeed, newline • \S a non-whitespace • \w a word-character • \W a non-wordcharacter • \b word-boundary • \n a newline • \t a tabulator BTANT129 w6
Longest vs. shortest match • When using quantifiers with non-literal characters (".","\w","\S" etc.) one can easily get unintended matches • .+ longest match (default) • .+? shortest match BTANT129 w6
The escape character • Problem:What if we want to find characters that are special metacharacters for regexp(\ | ( ) [ { ^ $ * + ? . < >) • Solution:They have to be preceded by "\" to strip them of their special value e.g.: • \( \$ \[ \? etc. BTANT129 w6
Things to do • Look up the tutorial athttp://www.zvon.org/other/PerlTutorial/Output/contents.html • Download one of the toolsVisualRegexp, Prowergrep,EditPad Proand experiment with texts • Follow the tutorial of EditPad Pro, which you can find in its Help BTANT129 w6