500 likes | 745 Views
Text manipulation. Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites You might, for example, wish to include the Guardian’s sports headlines on your page. Adding these headlines manually.
E N D
Text manipulation • Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites • You might, for example, wish to include the Guardian’s sports headlines on your page
Adding these headlines manually • You would have to access the source of the Guardian page
You would then have to find the text which defines the headlines • Analyse it • And copy the relevant bits into the HTML for your own web-page
Examining it, we find that the source contains one HTML table for each sport in the list of top stories • Here is the table for the tennis headlines on the page seen earlier: <table cellspacing="0"><tr> <td class="imgholder"><a HREF="/tennis/story/0,10069,1581862,00.html"><img src="http://image.guardian.co.uk/sys-images/Sport/Pix/pictures/2005/09/30/andy2.jpg" width="128" height="128" border="0" alt="Andy Murray in action during his win over Robby Ginepri" /></a></td> <td><font face="Geneva, Arial, Helvetica, sans-serif" size="2"><span class="mainlink"><a HREF="/tennis/story/0,10069,1581862,00.html">Murray magic books semi spot</a></span><br /><b>Tennis:</b> The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open. <br /> <a HREF="/tennis/story/0,10069,1580918,00.html">Tough home Davis Cup tie for GB</a><br /> <a HREF="/tennis/0,10067,495916,00.html">More tennis</a></font></td> </tr></table>
Here is the text which defines the main tennis headline on the page shown earlier: <span class="mainlink"> <a HREF="/tennis/story/0,10069,1581862,00.html"> Murray magic books semi spot </a></span> <br /> <b>Tennis:</b> The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open.
To get this story onto your own web-page you could then copy the relevant HTML segment into the source code for your web-page • But … • … doing this manually is very labour-intensive • We ought to automate the complete task
Adding headlines automatically • To add headlines automatically, you would have to write a program which would • Download the source code for the Guardian page • Analyse this source code to extract the appropriate text • Add the relevant text to source code for your own web-page
Adding headlines automatically • Later, we will see how to download page sources from other websites • Now, we will focus on the issue of text analysis
Regular Expressions • Regular expression technology provides a convenient way of searching string for patterns of interest
Regular expressions (contd.) • Example regular expression: /ab*c/ this searches the target string for substring(s) that comprise “an a followed byzero or more instances of b followed by by a c” • It will match any of the following substrings: ac abc abbc abbbc ….
Using regular expressions in PHP • Regular expressions are supported in several languages, including PHP • PHP provides a group of pre-defined functions for using them • For now, we will focus on just one of these, the preg_replace function
The preg_replace function • Format of call: preg_replace (regexp, replacement, subject [, int limit]) • This function returns the result of replacing substrings in subject which match regexp with replacement • The number of matching substrings which are replaced is controlled by the optional parameterlimit • An example application is on the next slide
Regular expressions (contd.) • PHP code <?php $myString = "xyzacklmabbcpqrabbbbbcstu"; echo "myString is $myString <br />"; $myString = preg_replace("/ab*c/","_",$myString); echo "myString is now $myString"; ?> • Resultant output is myString is xyzacklmabbcpqrabbbbbcstu myString is now xyz_klm_pqr_stu
Using the limit parameter in preg_replace • PHP code <?php $myString = "xyzacklmabbcpqrabbbbbcstu"; echo "myString is $myString <br />"; $myString = preg_replace("/ab*c/","_",$myString,1); echo "myString is now $myString"; ?> • Resultant output is myString is xyzacklmabbcpqrabbbbbcstu myString is now xyz_klmabbcpqrabbbbbcstu
Meta-characters • We have seen that certain characters have a special meaning in regular expressions: • the example on the last few slides used the * character which means “0 or more instances of the preceding character or pattern” • These are called meta-characters • Other meta-characters are listed on the next slide
The meta-characters include: • the * character which means “0 or more instances of preceding” • the + character, which means “1 or more instances of preceding” • the ? character, which means “0 or 1 instances of preceding” • the { and } character delimit an expression specifying a range of acceptable occurrences of the preceding character • Examples: {m} means exactly m occurences of preceding character/pattern {m,} means at least m occurrences of preceding char/pattern {m,n} means at least m, but not more than n, occurrences of preceding char/pattern • Thus, {0,} is equivalent to * {1,} is equivalent to + {0,1} is equivalent to ?
Regular expressions (contd.) • Further meta-characters are: • the^ character, which matches the start of a string • the $ character, which matches the end of a string • the . character which matching anything except a newline character • the [ and ]character starts an equivalence class of characters, any of which can match one character in the target string • the( and ) characters delimit a group of sub-patterns • the| character separates alternative patterns
Regular expressions (contd.) • Example expression: /^a.*d$/ this matches the entire target string provided the target string starts with an a, followed by zero or more non-newline characters, and ends with a d • An example application is on the next slide
Example application • PHP code <?php $myString1 = ”abcdefghijklmnopqrstuvd"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/^a.*d$/","_",$myString1); echo "myString1 is now $myString1 <br />"; $myString2 = ”xabcdefghijklmnopqrstuvd"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/^a.*d$/","_",$myString2); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is abcdefghijklmnopqrstuvd myString1 is now _ myString2 is xabcdefghijklmnopqrstuvd myString2 is now xabcdefghijklmnopqrstuvd
Regular expressions (contd.) • Example expression: /^a.{2,5}d$/ this replaces the entire target string with “x”, provided the target string starts with an a, followed by between two and five non-newline characters, and ends with a d • An example application is on the next slide
Regular expressions (contd.) • PHP code <?php $myString1 = "adabbbbccccaaaabbbbccccd"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/^a.{2,5}d$/","_",$myString1); echo "myString1 is now $myString1 <br>"; $myString2 = "afghd"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/^a.{2,5}d$/","_",$myString2); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is adabbbbccccaaaabbbbccccd myString1 is now adabbbbccccaaaabbbbccccd myString2 is afghd myString2 is now _
Regular expressions (contd.) • Example regular expression: /(abc){2,5}d/ this matches sub-string(s) in the target that comprise “between 2 and 5 repeats of the pattern abc followed by a d” • An example application is on the next slide
Regular expressions (contd.) • PHP code <?php $myString = "klmabcabcabcdpqrabcdklmabcabcabcabcdxyz"; echo "myString is $myString <br />"; $myString = preg_replace("/(abc){2,5}d/","_",$myString); echo "myString is now $myString"; ?> • Resultant output is myString is klmabcabcabcdpqrabcdklmabcabcabcabcdxyz myString is now klm_pqrabcdklm_xyz
Regular expressions (contd.) • Example regular expression: /(foo|bar)/ this matches sub-strings foo or bar • An example application is on the next slide
Regular expressions (contd.) • PHP code <?php $myString = ”abcfoodefbarghi"; echo "myString is $myString <br />"; $myString = preg_replace("/(foo|bar)/","_",$myString); echo "myString is now $myString"; ?> • Resultant output is myString is abcfoodefbarghi myString is now abc_def_ghi
Regular expressions (contd.) • Although some characters have special meanings in regular expressions, we may, sometimes, just want to use them to match themselves in the target string • We do this by escaping them in the regular expression, by preceding them with a backslash \ • Example regular expression: /^a\^+.*d$/ this matches the entire target string, provided the target string starts with an a, followed by one or more carat characters, followed by zero or more non-newline characters, and ends with a d • An example application is on the next slide
Example application • PHP code <?php $myString1 = ”adabbbbcabbcabced"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/^a\^+.*d$/","_",$myString1); echo "myString1 is now $myString1 <br />"; $myString2 = ”a^^^abbbbcabbcabceed"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/^a\^+.*d$/","_",$myString2); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is adabbbbcabbcabced myString1 is now adabbbbcabbcabced myString2 is a^^^abbbbcabbcabceed myString2 is now _
Regular expressions (contd.) • As mentioned earlier, the [ and ] characters have a special meaning in regular expressions • they delimit an equivalence class of characters, any one of which may be used to match one character in the target string • Example regular expression: /a[KLM]b/ replaces any substring comprising “the letter a followed by one of the three letters KLM, followed by the letter b”
Regular expressions (contd.) • The ^ character has a special meaning when used as the first character between [ and ] characters; this meaning is different from its special meaning when used outside the [ and ] characters • when used as the first character between the [ and ] characters, the ^ character specifies the complement of the equivalence class that would have been specified if its were absent • Example regular expression: /a[^KLM]b/ replaces any substring comprising “the letter a followed by any single letter that is not one of KLM, followed by the letter b”
Regular expressions (contd.) • The - character also has a special meaning when used between [ and ] characters: • it is used to join the start and end of a sequence of characters, any one of which may be used to match one character in the target string • Example regular expression: /a[0-9]b/ matches any substring comprising “the letter a followed by one digit, followed by the letter b”
Regular expressions (contd.) • Example regular expression: / %[a-fA-F0-9]/ matches any substring comprising “an % followed by a hexadecimal digit”
Regular expressions (contd.) • Certain escape sequences also have a special meaning in regular expressions. They define certain commonly used equivalence classes of characters: \wis equivalent to [a-zA-Z0-9_] \Wis equivalent to [^a-zA-Z0-9_] \dis equivalent to [0-9] \Dis equivalent to [^0-9] \sis equivalent to [ \n\t\f\r] \Sis equivalent to [^ \n\t\f\r] \bdenotes a word boundary \Bdenotes a non-word boundary • Note the SP characters in the meaning of \s and \S, that is the white-space equivalence includes SP • Byt the way, \f is formFeed and \r is carriageReturn
Regular expressions (contd.) • Example regular expression: / %\d\d\d\D/ matches any substring comprising “an % followed by three decimal digits, followed by a non-digit” • Example regular expression: / \s\w\w\s/ matches any substring comprising “a white-space character, followed by two word characters, followed by another white-space character”
Regular expressions (contd.) • PHP code <?php $myString = ”This is not an apple"; echo "myString is $myString <br />"; $myString = preg_replace("/\s\w\w\s/","_",$myString); echo "myString is now $myString"; ?> • Resultant output is myString is This is not an apple myString is now This_not_apple
Regular expressions (contd.) • The standard quantifiers are all "greedy” • they match as many occurrences as possible without causing the pattern to fail. • It is possible to make them “frugal” • that is, make them match the minimum number of times necessary • We do this by following the quantifier with a "?" • *? Match 0 or more times, preferably only 0 • +? Match 1 or more times, preferably only 1 time • ?? Match 0 or 1 time, preferably only 0 • {n}? Match exactly n times • {n,}? Match at least n times, preferably only n times • {n,m}? Match at least n but not more than m times, preferably only n times
Regular expressions (contd.) • PHP code <?php $myString1 = ”abcabcabcabc"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/(abc){2,5}/",”x",$myString1); echo "myString1 is now $myString1 <br />"; $myString2 = "abcabcabcabc"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/(abc){2,5}?/",”x",$myString2); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is abcabcabcabc myString1 is now x myString2 is abcabcabcabc myString2 is now xx • What is going on here? See next slide for contrast
Regular expressions (contd.) • PHP code <?php $myString1 = ”abcabcabcabc"; echo "myString1 is $myString1 <br />"; $myString1 = preg_replace(”/(abc){2,5}/",”x",$myString1,1); echo "myString1 is now $myString1 <br />"; $myString2 = "abcabcabcabc"; echo "myString2 is $myString2 <br />"; $myString2 = preg_replace(”/(abc){2,5}?/",”x",$myString2,1); echo "myString2 is now $myString2"; ?> • Resultant output is myString1 is abcabcabcabc myString1 is now x myString2 is abcabcabcabc myString2 is now xabcabc • Discussion of contrast with previous slide ...
A digression • Before proceeding to further regexp concepts, let’s look at applying to HTML manipulation what we have already seen
Example task • Suppose we have the following HTML <ul><li>wine</li><li>f12</li><li>cheese</li></ul> • Suppose we want to eliminate from the list any list item whose content comprises only non-digits • That is, we want the HTML to become <ul><li>f12</li></ul>
Regular expressions (contd.) • PHP code <?php $myString = ”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>"; echo "myString is $myString <br />"; $myString = preg_replace(”/<li>\D+<\/li>/",”",$myString); echo "myString is now $myString <br />"; ?> • Resultant output is myString is • wine • f12 • cheese myString is now • f12
Seeing the raw-HTML • Suppose we want to see the raw HTML in our output • That is, suppose we wanted to see myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now<ul><li>f12</li></ul> • We would have to replace all occurrences of < with < • We could use regular expressions for this but, • the string to be replaced is a constant • so we can use a simpler technology
Regular expressions (contd.) • PHP code <?php $myString = ”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>"; echo "myString is ".str_replace(“<“,”<”,$myString).”<br>"; $myString = preg_replace("/<li>\D+<\/li>/",”x",$myString); echo "myString is now ".str_replace(“<“,”<”,$myString); ?> • Now the resultant output is myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now <ul><li>f12</li></ul>
Suppose we want to replace every list item with the fixed phrase listItem • That is, we wanted to see this output myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now <ul> listItem listItem listItem </ul>
Regular expressions (contd.) • Suppose we try this <?php $myString = ”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>"; echo "myString is ".str_replace(“<“,”<”,$myString).”<br>"; $myString = preg_replace("/<li>.+<\/li>/",” listItem ",$myString); echo "myString is now ".str_replace(“<“,”<”,$myString); ?> • Resultant output is myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now <ul> listItem </ul> • What is wrong? • We need to make the + quantifier ungreedy
Regular expressions (contd.) • We must do this <?php $myString = ”<ul><li>wine</li><li>f12</li><li>cheese</li></ul>"; echo "myString is ".str_replace(“<“,”<”,$myString).”<br>"; $myString = preg_replace("/<li>.+?<\/li>/",” listItem ",$myString); echo "myString is now ".str_replace(“<“,”<”,$myString); ?> • Resultant output is myString is <ul><li>wine</li><li>f12</li><li>cheese</li></ul> myString is now <ul> listItem listItem listItem </ul>
End of digression • Back to regular expressions ...
Regular expressions (contd.) -- remembering subpattern matches • When a <pattern> is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern • Sub-patterns whose matching substrings are to be remembered are enclosed in parentheses • The sub-patterns are implicitly numbered, starting from 1 and their matching substrings can then be re-used later in the pattern by using back-references like \1 or \2 or \3 • However, to get the backslash, we need to escape it, so we must type \\1 or \\2 or \\3 in our regular expressions
Using back-references (contd.) • PHP code <?php $myString1 = ”klmAklmAAklmABklmBklmBBklm"; echo "myString is $myString <br />"; $myString1 = preg_replace(”/([A-Z])\\1/",”_",$myString1); echo "myString1 is now $myString1 "; ?> • Resultant output is myString1 is klmAklmAAklmABklmBklmBBklm myString1 is now klmAklm_klmABklmBklm_klm