1 / 0

Algorithm Design and Analysis (ADA)

Algorithm Design and Analysis (ADA). 242-535 , Semester 1 2013-2014. Objective to discuss four kinds of string searching. 13. String Searching. T:. P:. Overview. 1. What is String Searching? 2. The Brute Force Algorithm 3. The Boyer-Moore Algorithm

damara
Download Presentation

Algorithm Design and Analysis (ADA)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithm Design and Analysis (ADA) 242-535, Semester 1 2013-2014 Objective to discuss four kinds of string searching 13. String Searching T: P:
  2. Overview 1. What is String Searching? 2. The Brute Force Algorithm 3. The Boyer-Moore Algorithm The Knuth-Morris-Pratt Algorithm The Rabin-Karp Algorithm Summary of the Implementations Hashing Reminder
  3. 1. What is String Searching? Definition: given a text string T and a search string (pattern) P, find P inside T T: “the rain in spain stays mainly on the plain” P: “n th” Applications: text editors, Web search engines (e.g. Google), image analysis
  4. String Concepts Assume S is a string of size m. A substring S[i .. j] of S is the string fragment between indexes i and j. A prefix of S is a substring S[0 .. i] A suffix of S is a substring S[i .. m-1] i is any index between 0 and m-1 "start of S" "end of S"
  5. Examples S a n d r e w Substring S[1..3] == "ndr" All possible prefixes of S: "andrew", "andre", "andr", "and", "an”, "a" All possible suffixes of S: "andrew", "ndrew", "drew", "rew", "ew", "w" 0 5
  6. 2. The Brute Force Algorithm Check each position in the text T to see if the pattern P starts in that position T: a n d r e w T: a n d r e w P: r e w P: r e w P moves 1 char at a time through T . . . .
  7. Brute Force Code Return index where pattern starts, or n int brute(String text,String pattern) { int n = text.length(); // n is length of text int m = pattern.length(); // m is length of pattern int j; for(int i=0; i <= (n-m); i++) { int j; for (j = 0; j < m; j++) if (text.charAt(i+j) != pattern.charAt(j)) break; if (j == m) return i; // match at i } return n; // no match} // end of brute()
  8. Example 2 Text: A B A C A D A B R A C Pattern: A B R A
  9. Running Time Brute force pattern matching runs in time O(mn) in the worst case. actually O((n-m+1)*m) But most searches of ordinary text take O(m+n), which is very quick The brute force algorithm is fast when the alphabet of the text is large e.g. A..Z, a..z, 1..9, etc. It is slower when the alphabet is small e.g. 0, 1 (as in binary files, image files, etc.) continued
  10. Worst case Example Text: A A A A A A A A A B Pattern: A A A A B
  11. 3. The Boyer-Moore Algorithm The Boyer-Moore pattern matching algorithm is based on two techniques. 1. The looking-glass technique find P in T by moving backwards through P, starting at its end
  12. 2. The character-jump technique when a mismatch occurs at T[i] == x the character in pattern P[j] is not the same as T[i] There are 3 possible cases, tried in order. T x a i P b a j
  13. Case 1 If P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i]. T T ? ? a a x x i inew and move i and j right, so j at end P P x c b a x c b a j jnew
  14. Case 2 If P contains x somewhere, but a shift right to the last occurrence is not possible, then shift P right by 1 character to T[i+1]. T T ? a x x x x a i inew and move i and j right, so j at end P P c w a x c w a x j jnew x is after j position
  15. Case 3 If cases 1 and 2 do not apply, then shift P to align P[0] with T[i+1]. T T ? ? a x a x ? i inew and move i and j right, so j at end P P d c b a d c b a 0 j jnew No x in P
  16. Boyer-Moore Example (1) T: P:
  17. Last Occurrence Function Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet A to build a last occurrence function L() L() maps all the letters in A to integers L(x) is defined as: // x is a letter in A the largest index i such that P[i] == x, or -1 if no such index exists
  18. L() Example P a b a c a b A = {a, b, c, d} P: "abacab" 0 1 2 3 4 5 x a b c d L(x) 4 5 3 -1 L() stores indexes into P[]
  19. Note In Boyer-Moore code, L() is calculated when the pattern P is read in. Usually L() is stored as an array something like the table in the previous slide
  20. Boyer-Moore Example (2) x a b c d L(x) 4 5 3 -1 T: P:
  21. Boyer-Moore Code Return index where pattern starts, or n int bmMatch(String text, String pattern) { int last[] = buildLast(pattern); int n = text.length(); int m = pattern.length(); int i = m-1; if (i > n-1) return n; // no match if pattern is // longer than text :
  22. int j = m-1; do { if (pattern.charAt(j) == text.charAt(i)) if (j == 0) return i; // match else { // looking-glass technique i--; j--; } else { // character jump technique int lo = last[text.charAt(i)]; //last occ i = i + m - Math.min(j, 1+lo); j = m - 1; } } while (i < n); return n; // no match } // end of bmMatch()
  23. int[] buildLast(String pattern) /* Return array storing index of last occurrence of each ASCII char in pattern. */ { int last[] = new int[128]; // ASCII char set for(int i=0; i < 128; i++) last[i] = -1; // initialize array for (int i = 0; i < pattern.length(); i++) last[pattern.charAt(i)] = i; return last; } // end of buildLast()
  24. Example 2 Text: F I N D I N A H A Y S T A C K N E E D L E Pattern: N E E D L E
  25. Running Time Boyer-Moore worst case running time is O(mn + A) But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small. e.g. good for English text, poor for binary Boyer-Moore is significantly faster than brute forcefor searching English text.
  26. Worst Case Example T: "aaaaa…a" P: "baaaaa" T: P:
  27. 4. The KMP Algorithm The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-to-right order (like the brute force algorithm). But it shifts the pattern more intelligently than the brute force algorithm. continued
  28. If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons? Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]
  29. Example i T: j = 5 P: jnew = 2
  30. Why Find largest prefix (start) of: "a b a a b" ( P[0..j-1] )which is suffix (end) of: "b a a b" ( p[1 .. j-1] ) Answer: "a b" Set j = 2 // the new j value j == 5
  31. KMP Failure Function KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. j = mismatch position in P[] k = position before the mismatch (k = j-1). The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].
  32. Failure Function Example j 0 1 2 3 4 k F(j) 0 0 1 1 2 F(k) (k == j-1) P: "abaaba" j: 0 1 2 3 4 5 In code, F() is represented by an array, like the table. F(k) is the size of the largest prefix.
  33. Why is F(4) == 2? P: "abaaba" F(4) means find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" = 2
  34. Using the Failure Function Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm. if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j
  35. KMP Code Return index where pattern starts, or n int kmpMatch(String text,String pattern) { int n = text.length(); int m = pattern.length(); int fail[] = computeFail(pattern); int i=0; int j=0; :
  36. while (i < n) { if (pattern.charAt(j) == text.charAt(i)) { if (j == m - 1) return i - m + 1; // match i++; j++; } else if (j > 0)j = fail[j-1]; else i++; } return n; // no match } // end of kmpMatch()
  37. int[] computeFail(String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0; int m = pattern.length(); int j = 0; int i = 1; :
  38. while (i < m) { if (pattern.charAt(j) == pattern.charAt(i)) { //j+1 chars match fail[i] = j + 1; i++; j++; } else if (j > 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code to kmpMatch()
  39. Example T: P: k 0 1 2 3 4 F(k) 0 0 1 0 1
  40. Why is F(4) == 1? P: "abacab" F(4) means find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" = 1
  41. KMP Advantages KMP runs in optimal time: O(m+n) very fast The algorithm never needs to move backwards in the input text, T this makes the algorithm good for processing very large files that are read in from external devices or through a network stream
  42. KMP Disadvantages KMP doesn’t work so well as the size of the alphabet increases more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later
  43. KMP Extensions b b a a a a a a a a b b The basic algorithm doesn't take into account the letter in the text that caused the mismatch. a b a a x T: b Basic KMP does not do this. P:
  44. 5. The Rabin-Karp Algorithm String search based on hashing see section 7 for a reminder on hashing We compute a hash for the pattern and then look for a match by using the same hash function on each possible M-character substring of the text.
  45. Note: we do not need a hash table because we only need to store the hash pattern, and can generate the M-character text hashs at run-time. Lots of hashing can be slow, but Rabin and Karp have a fast way to compute hashs for M-character substrings in constant time.
  46. From Text to Hash A substring of length M is converted to an M-digit number (p). We require the hash function to convert this M-digit number (p) into an integer between 0 and Q-1. One solution is to use modulo: return the remainder when p is divided by Q: (p mod Q) Use a prime number for Qthat is small enough to fit within a computer word to help speed up the calculation.
  47. Example Convert the string "2 6 5 3 5" to a hash value. Q = 997 (a prime) Assume that a string can only contain digit characters, so represent by the 5-digit number 26535 The hash value is 26535 % 997 = 613
  48. Speedy Hashing of the Text Use ti to mean the character at position i in the text, then the M-character substring of the text starting at position i is converted into the number xi using: R is the base used for converting each character into a number (e.g. base 2 for binary data, 10 for digits, 128 for ASCII).
  49. The hash function is: hash(xi) = xi mod q The next M-character substring is 1 character to the right in the text. Its numerical value (xi+1) can be calculated much more quickly using: We convert xi to xi+1 by subtracting the leading digit of xi, multiply by R, and add the new digit for the new character on the right (ti+M).
  50. Example The current substring is "4 1 5 9 2", and the next will be one character to the right: "1 5 9 2 6" Xi == 41592 (assuming base R = 10) How is Xi+1 calculated?
  51. Modulo Properties A nice property of mod is that it is distributive over +, - and *; e.g: xi is calculated only with these operations, and hash() is implemented with mod This means that we do not need to store xi to calculate xi+1, and then hash; instead we can store the hash(xi) value (which is small), and use it to calculate hash(xi+1) (which will also be small).
  52. And Horner's Method In addition, we can simplify the calculation of xi by using Horner's method, which rephrases a polynomial, such as: cn xn + cn–1 xn–1 + cn–2 xn–2 + . . . + c2 x2 + c1 x + c0 into a nested series of multiplications and additions: ( ( . . .( ( cn x + cn–1 ) x + cn–2 ) x + . . . + c2 ) x + c1 ) x + c0
  53. O(1) Hashing The result is that we can generate hash values for substrings in constant time (O(1)) We move through the text left-to-right, one character at a time, building up the hash for a M-character substring from preceding hash values. The execution speed is independent of the size of M.
  54. Hash of the Pattern Pattern: 2 6 5 3 5 Assume that text characters are digits, so set the R base = 10. Use Q = 997 the hash value
  55. Hashing the M-substrings Text: "3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3" M = 5, R = 10, Q = 997 The hash values for the M-char substrings
  56. The hash of the first M-char substring was built from the hash values for smaller substrings. Seven hash values for M-char substrings were calculated before one was found that had the same value as the hash for the pattern "2 6 5 3 5"
  57. The Problem of Collisions The preceding example ended with the pattern hash (613) matching the M-char substring hash from the text. Is that enough for the code to report "match found"? No, since a hash function might map two different strings to the same hash value (i.e. create a collision).
  58. What's To be Done? Two possible solutions: when the pattern and substring hash values match, return to the original text and double-check that they are the same Increase the Q value, therby making the mapping range VERY big (e.g. Q = 232-1 is 4.3 billion) this will make the chance of a collision VERY small (1/Q), and so don't bother double-checking
  59. Gambling Names The "hope for the best" algorithm (no. 2) is a famous example of a Monte Carlo algorithm it has linear running time but can output an incorrect answer with a small probability (1/Q) The "double-checking" algorithm (no 1) is known as a Las Vegasalgorithm it can be slow (but is usually almost linear) but is guaranteed correct
  60. Rabin-Karp Code Monte Carlo version long Q; // a large prime int R = 256; // alphabet size int RabinKarp(String txt, String pat) { Q = longRandomPrime(); int M = pat.length(); // pattern length long patHash = hash(pat, M); // pattern hash value int N = txt.length(); // text length long txtHash = hash(txt, M); // initial M-char substring of text hash value :
  61. if (patHash == txtHash) return 0; // match found at beginning of text long RM = 1; for (int i = 1; i <= M-1; i++) // Compute R^(M-1) % Q for use RM = (R * RM) % Q; // in removing leading digit. // Search for hash match in text for (int i = M; i < N; i++) { // Remove leading digit, add trailing digit, check for match. txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q; txtHash = (txtHash*R + txt.charAt(i)) % Q; if (patHash == txtHash) { if (doubleCheck(i - M + 1)) return i - M + 1; // match } } return N; // no match found } // end of RabinKarp()
  62. long hash(String key, int M) // Compute hash for key[0..M-1] using Horner { long h = 0; for (int j = 0; j < M; j++) h = (R * h + key.charAt(j)) % Q; return h; } boolean doubleCheck(int i) // in Monte Carlo do nothing by returning true // For Las Vegas, check pat vs txt[i..i-M+1] { return true; }
  63. 6. Summary of Implementations
  64. 7. Hashing Reminder Hashing = a hash table(often an array) and a hash function If the array is not sorted, a search requires O(n) time. If the array is sorted, a binary search requires O(log n) time If the array is organized using hashing then it is possible to have constant time search: O(1)
  65. A hash function takes an item and returns a number which can be used as the item's index position in the table but the function must be carefully chosen A hash table is often an array so that insertion and search is O(1) time when lookup can utilize a known index
  66. Example 1 hash table The simple hash function: hash("apple") = 5hash("watermelon") = 3hash("grapes") = 8hash("cantaloupe") = 7hash("kiwi") = 0hash("strawberry") = 9hash("mango") = 6hash("banana") = 2 kiwi 0 1 2 3 4 5 6 7 8 9 banana watermelon apple mango cantaloupe grapes strawberry
  67. Example 2 0  1 (025610001, jim) 2 (981100002, ad) 3  4 (451220004, tim) … 9997  9998 (200759998, tim) 9999  A hash table storing (ID, Name) items, ID is a nine-digit integer The hash table is an array of size N= 10,000 The hash functionhash(ID) returns the last four digits of the ID key
  68. A Real Hash Function A more realistic hash function produces: hash("apple") = 5hash("watermelon") = 3hash("grapes") = 8hash("cantaloupe") = 7hash("kiwi") = 0hash("strawberry") = 9hash("mango") = 6hash("banana") = 2hash("honeydew") = 6 kiwi 0 1 2 3 4 5 6 7 8 9 banana watermelon apple mango cantaloupe grapes strawberry • Now what?
  69. Collisions A collision occurs when two items hash to the same table index position e.g "mango" and "honeydew" both hash to 6 Where should we place the second item (and any others) that hash to this same index? Three popular solutions: linear probing, double hashing, bucket hashing (chaining)
More Related