Data Structures and Algorithms Analysis

Data Structures and Algorithms Analysis String Matching Dr. Ken Cosh

Review • Memory Management • Memory Allocation • Garbage Collection

This Week • String Matching • String matching is a common task for many computer users; • Internet Searches • String manipulation in word processing • Advanced DNA sequence matching • Therefore effective pattern matching algorithms are essential.

Brute Force • Our first simple string matching algorithm is brute force. • We check the first character, if it is a match, we check the second character, if not a match, we step forward one character and start again. • Any useful information that could be used in subsequent searches is then lost.

Brute Force bruteForceStringMatching(pattern P, text T) i=0; while i ≤ |T| - |P| j=0; while Ti == Pj && j < |P| i++; j++; if j == |P| return match at i-|P|; i = i – j + 1; return no match;

Brute Force • T = ababcdababababababad, P=babab ababcdababababababad 1 babab • babab • babab • babab • babab • babab • babab • babab In this case the match is found on the 8th try.

Brute Force Complexity • The best case for the algorithm is that the string is matched straight away (consider searching this sentence for “The”). Here |P| comparisons are required – O(|P|). • The worst case is if the string isn’t found, but for each character in |T|, we are required to make |P| comparisons – here worst case is O(|T||P|). • The average case depends on the size and frequencies of the character set.

Brute Force Complexity • Notice the nested while loops in the Brute Force algorithm; while i ≤ |T| - |P| while Ti == Pj && j < |P| • Shortly we’ll investigate how we can reduce the number of iterations of each loop. • For the worst case to occur we could search of a string such as aaaaaaaaaaaab within a string aaaaaaaaaaaaaaaaaaaaaaaaaaa etc.

Improving Brute Force • A key problem with brute force is that each time we abort the comparison we have to start from the beginning of the pattern again. • We could reduce the algorithm complexity by enabling us to skip unnecessary searches. • Hancart’s algorithm allows the search to step forward 2 characters if a match won’t be found.

Hancart • Hancart’s algorithm refines brute force in a couple of ways. • First the first two characters of the pattern are compared • Either they are the same, or they are different. • Second comparisons begin with the 2nd character in the Text.

Hancart • Hancart’s revision works by allowing us to skip forward 2 characters in situations where there can’t be a match. • Notice that the situations where 2 steps forwards are allowed depends on whether the first 2 characters of the pattern. • We can refine the search further by extending this observation – that the number of steps forward allowed depends on the contents of the pattern. • The Knuth Morris Pratt algorithm observes that the pattern contains enough information to determine where the next match could begin.

Hancart • Hancart’s algorithm reduces the number of iterations through the outer loop – by sometimes allowing the increment to be; i = i – j + 2;

Knuth Morris Pratt • The Knutt Morris Pratt algorithm begins by finding the longest suffix, which is equal to a prefix of the same substring. • Substring: A,B,C,D,A,B,D • Longest Suffix: 0,0,0,0,1,2,0 • i.e. when the 2nd A comes it is both a suffix and a prefix for the substring. The following B forms ‘AB’ a 2 character prefix and suffix. • Now for each iteration of the outer loop i can be increased by j-x, where x is the longest suffix. • i.e. if a mismatch is found when comparing the second A, j=5, so i can be increased by 4 (j-1)

Test Try searching for this substring, A,B,C,D,A,B,D within this string ABCDABCABCDABDE

Knuth Morris Pratt complexity • Knuth Morris Pratt removes some of the complexity of the brute force algorithm by preprocessing the substring being searched for (to create the suffix table). • Now as we don’t need to recheck characters in the text it is O(|T|) for the outer loop. • Preprocessing can be performed quickly, in O(|P|) time, leaving a total complexity of O(|T|+|P|)

Data Structures and Algorithms Analysis

Data Structures and Algorithms Analysis

Presentation Transcript

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Analysis: Algorithms and Data Structures

Algorithms and Data Structures

DATA STRUCTURES AND ALGORITHMS

Algorithms and Data Structures

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures

Data Structures and Algorithms

Data Structures and Algorithms

Analysis: Algorithms and Data Structures

Algorithms and Data Structures