1 / 15

Data Structures and Algorithms Analysis

Data Structures and Algorithms Analysis. String Matching Dr. Ken Cosh. Review. Memory Management Memory Allocation Garbage Collection. This Week. String Matching String matching is a common task for many computer users; Internet Searches String manipulation in word processing

catess
Download Presentation

Data Structures and Algorithms Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Structures and Algorithms Analysis String Matching Dr. Ken Cosh

  2. Review • Memory Management • Memory Allocation • Garbage Collection

  3. This Week • String Matching • String matching is a common task for many computer users; • Internet Searches • String manipulation in word processing • Advanced DNA sequence matching • Therefore effective pattern matching algorithms are essential.

  4. Brute Force • Our first simple string matching algorithm is brute force. • We check the first character, if it is a match, we check the second character, if not a match, we step forward one character and start again. • Any useful information that could be used in subsequent searches is then lost.

  5. Brute Force bruteForceStringMatching(pattern P, text T) i=0; while i ≤ |T| - |P| j=0; while Ti == Pj && j < |P| i++; j++; if j == |P| return match at i-|P|; i = i – j + 1; return no match;

  6. Brute Force • T = ababcdababababababad, P=babab ababcdababababababad 1 babab • babab • babab • babab • babab • babab • babab • babab In this case the match is found on the 8th try.

  7. Brute Force Complexity • The best case for the algorithm is that the string is matched straight away (consider searching this sentence for “The”). Here |P| comparisons are required – O(|P|). • The worst case is if the string isn’t found, but for each character in |T|, we are required to make |P| comparisons – here worst case is O(|T||P|). • The average case depends on the size and frequencies of the character set.

  8. Brute Force Complexity • Notice the nested while loops in the Brute Force algorithm; while i ≤ |T| - |P| while Ti == Pj && j < |P| • Shortly we’ll investigate how we can reduce the number of iterations of each loop. • For the worst case to occur we could search of a string such as aaaaaaaaaaaab within a string aaaaaaaaaaaaaaaaaaaaaaaaaaa etc.

  9. Improving Brute Force • A key problem with brute force is that each time we abort the comparison we have to start from the beginning of the pattern again. • We could reduce the algorithm complexity by enabling us to skip unnecessary searches. • Hancart’s algorithm allows the search to step forward 2 characters if a match won’t be found.

  10. Hancart • Hancart’s algorithm refines brute force in a couple of ways. • First the first two characters of the pattern are compared • Either they are the same, or they are different. • Second comparisons begin with the 2nd character in the Text.

  11. Hancart • Hancart’s revision works by allowing us to skip forward 2 characters in situations where there can’t be a match. • Notice that the situations where 2 steps forwards are allowed depends on whether the first 2 characters of the pattern. • We can refine the search further by extending this observation – that the number of steps forward allowed depends on the contents of the pattern. • The Knuth Morris Pratt algorithm observes that the pattern contains enough information to determine where the next match could begin.

  12. Hancart • Hancart’s algorithm reduces the number of iterations through the outer loop – by sometimes allowing the increment to be; i = i – j + 2;

  13. Knuth Morris Pratt • The Knutt Morris Pratt algorithm begins by finding the longest suffix, which is equal to a prefix of the same substring. • Substring: A,B,C,D,A,B,D • Longest Suffix: 0,0,0,0,1,2,0 • i.e. when the 2nd A comes it is both a suffix and a prefix for the substring. The following B forms ‘AB’ a 2 character prefix and suffix. • Now for each iteration of the outer loop i can be increased by j-x, where x is the longest suffix. • i.e. if a mismatch is found when comparing the second A, j=5, so i can be increased by 4 (j-1)

  14. Test Try searching for this substring, A,B,C,D,A,B,D within this string ABCDABCABCDABDE

  15. Knuth Morris Pratt complexity • Knuth Morris Pratt removes some of the complexity of the brute force algorithm by preprocessing the substring being searched for (to create the suffix table). • Now as we don’t need to recheck characters in the text it is O(|T|) for the outer loop. • Preprocessing can be performed quickly, in O(|P|) time, leaving a total complexity of O(|T|+|P|)

More Related