1 / 25

Text Processing with Regular Expressions

Text Processing with Regular Expressions. What is Regular Expression?. Regular expression is a language designed to manipulate text. Users use its extensive pattern-matching notations to write regular expressions to: Search text; Extract, edit, replace, or delete text substrings;

Mercy
Download Presentation

Text Processing with Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Processing with Regular Expressions

  2. What is Regular Expression? • Regular expression is a language designed to manipulate text. Users use its extensive pattern-matching notations to write regular expressions to: • Search text; • Extract, edit, replace, or delete text substrings; • Validate input data: • values, formats • Examples: • *.doc • Select * From Student Where Sname = ‘C%’;

  3. System.Text.RegularExpressions Namespace • We need to import the system.text.regularExpressions namespace and use the Regex class to create regular expressions. • Imports System.Text.RegularExpressions • Dim re as New Regex(“[aeiou]\d”)

  4. Regular Expression Language Elements1. Character Escapes Provides information on the set of escape characters that signal to the regular expression parser that the character is not an operator and should be interpreted as a matching character. ordinary characters Characters other than . $ ^ { [ ( | ) * + ? \ match themselves. \a Matches a bell (alarm). \b Matches a backspace \t Matches a tab. \r Matches a carriage return . \f Matches a form feed . \n Matches a new line. \e Matches an escape. \* When the backslash is followed by a character that doesn’t form an escape sequence, it matches the character. \* matches *, \( matches (

  5. 2. Character Classes Provides information on the set of regular expression characters that define the substring to match.

  6. . Matches any character except \n. [aeiou] Matches any single character included in the specified set of characters. [^aeiou] Matches any single character not in the specified set of characters. [0-9a-fA-F] Use of a hyphen (–) allows specification of contiguous character ranges. \w Matches any word character. \w is the same as [a-zA-Z_0-9]. \W Matches any nonword character. \W is the same as [^a-zA-Z_0-9]. \s Matches any white-space character. \s is the same as [ \f\n\r\t\v]. \S Matches any non-white-space character. \S is the same as [^ \f\n\r\t\v]. \d Matches any decimal digit.. \D Matches any nondigit.

  7. Atomic Zero-Width Assertions Provides information on zero-width assertions that cause a match to succeed or fail depending on the regular expression parser's current position in the input string.

  8. ^ Specifies that the match must occur at the beginning of the string or the beginning of the line. $ Specifies that the match must occur at the end of the string, or at the end of the line. Ex. Abc$ -- match any abc immediately before the end of a line. \A Specifies that the match must occur at the beginning of the string (ignores the Multiline option). \Z Specifies that the match must occur at the end of the string (ignores the Multiline option). \z Specifies that the match must occur at the end of the string (ignores the Multiline option). \G Specifies that the match must occur at the point at which the current search started (often, this is one character beyond where the last search ended). \b Specifies that the match must occur on a boundary between \w (alphanumeric) and \W (nonalphanumeric) characters. The match must occur on word boundaries — that is, at the first or last characters in words separated by spaces. \B Specifies that the match must not occur on a \b boundary.

  9. Quantifiers Add optional quantity data to regular expressions. A particular quantifier applies to the character, character class, or group that immediately precedes it.

  10. * Specifies zero or more matches; Same as {0,}. • + Specifies one or more matches; Same as {1,}. • \w+ • ? Specifies zero or one matches; Same as {0,1}. • {n} Specifies exactly n matches; for example, \d(3) matches groups of 3 or more digits. • {n,} Specifies at least n matches. {n,m} Specifies at least n, but no more than m, matches. For example, \d{3, 5} matches groups of three, four or five digits. \w{3,) – words with at least 3 characters.

  11. Alternation Constructs Provides information on alternation information that modifies a regular expression to allow either/or matching.

  12. | Matches any one of the terms separated by the | (vertical bar) character; for example, cat|dog|tiger. The leftmost successful match wins. (?(expression)yes|no) Matches the "yes" part if the expression matches at this point; otherwise, matches the "no" part. The "no" part can be omitted. (?(name)yes|no) Matches the "yes" part if the named capture string has a match; otherwise, matches the "no" part. The "no" part can be omitted.

  13. Grouping Constructs Provides information on grouping constructs that cause a regular expression to capture groups of subexpressions.

  14. ( ) Captures the matched substring (or noncapturing group; for more information, see the ExplicitCapture option in Regular Expression Options). Captures using () are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, capture element number zero, is the text matched by the whole regular expression pattern. (?<name> ) Captures the matched substring into a group name or number name. The string used for name must not contain any punctuation and it cannot begin with a number. You can use single quotes instead of angle brackets; for example, (?'name').

  15. Matches Method • Matches: takes a string as input and returns a MatchCollection object that contains 0 or more Match objects. • Match object properties and method: • Value • Index • Length • NextMatch • matchCollection properties: • Count • Item(index)

  16. Imports System.Text.RegularExpressions Public Class Form1 Inherits System.Windows.Forms.Form Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim re As Regex re = New Regex(txtRE.Text) Dim source As String source = txtSource.Text Dim mc As MatchCollection = re.Matches(source) Dim m As Match Dim result As String For Each m In mc result = result + m.Value + vbCrLf Next txtMatches.Text = result End Sub

  17. Match Method • Returns the first match object: • Dim m As match=re.match(source) • Do While m.Success • … • m=m.NextMatch • Loop

  18. IsMatch Method • Checks if the pattern is contained in the source string: • If re.IsMatch(source) Then • … • End if

  19. Replace Method • Replace portions of the source string that match the regular expression. • Re.Replace(Source, newString) • Dim re As Regex • re = New Regex(txtRE.Text) • Dim source As String • source = txtSource.Text • txtSource.Text = re.Replace(source, txtReplace.Text)

  20. Validating Input Format with Regular Expressions • Date format: • \d{2}-\d{2}-\d{2}$ • \d{2}-\d{2}-(\d{2}$|\d{4}$) • Phone number: • \(\d{3}\)-\d{3}-\d{4}$ • EmpID begins with E followed by 3 digits: • E\d{3} • 5 lower or upper case letters • [a-zA-Z]{5}

  21. Validating Input Value with Regular Expressions • Allowable values: • San Francisco|Los Angeles|Taipei

  22. Searching with Regular Expressions • Pattern search: • \D\d • \w+ • Value search • http

  23. Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Try Dim re As Regex re = New Regex(txtRE.Text) Dim source As String source = txtSource.Text Dim mc As MatchCollection = re.Matches(source) Dim m As Match Dim result As String For Each m In mc result = result + m.Value + vbCrLf Next txtMatches.Text = result Catch ex As System.Exception MessageBox.Show(ex.Message) End Try End Sub Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click Try Dim re As Regex re = New Regex(txtRE.Text) Dim source As String source = txtSource.Text txtSource.Text = re.Replace(source, txtReplace.Text) Catch ex As System.Exception MessageBox.Show(ex.Message) End Try End Sub

  24. IsMatch Method Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Try Dim re As Regex re = New Regex(txtFormat.Text) Dim source As String source = txtSource.Text If re.IsMatch(source) Then MessageBox.Show("valid") Else MessageBox.Show("not valid") End If Catch ex As SystemException MessageBox.Show(ex.Message) End Try End Sub

More Related