630 likes | 653 Views
Repeats. Professor Dina Sokol The Graduate Center of the City University of N.Y. (http://www.sci.brooklyn.cuny.edu/~sokol/). 2 or more copies. Maximal Runs. A run is also called: periodic string repetition tandem repeat tandem array
E N D
Repeats Professor Dina Sokol The Graduate Center of the City University of N.Y. (http://www.sci.brooklyn.cuny.edu/~sokol/)
Maximal Runs • A run is also called: • periodic string • repetition • tandem repeat • tandem array • A run that occurs in a larger string is maximal if it cannot be extended to the right or left. e.g. aaabcxabcabcabcabaa
Problem Definition • Input: a string S over alphabet • Output: all squares that occur in S. After we show an algorithm for squares, we show how it can be used to find all maximalruns.
Naive Algorithm • Consider all possible pairs (i,j) 1 ≤ i ‹ j ≤ n • Compare the suffixes si … sn to sj … sn • If the length of the match is at least j-i characters, then there is a square beginning at location i with size 2(j-i).
Naïve Algorithm Si i j n i j n Sj 1 2 3 4 5 6 7 8 9 Example: x x x x x a b c d a b c d x x x i=6 j=10
Time Complexity of Naïve Algorithm • Consider all pairs (i,j): O(n2) pairs • Compairing the substrings: O(n) per pair • Overall: O(n3) time.
Main and Lorentz algorithm Use Landau/Vishkin or KMP to reduce the matching from O(n) to constant time. Reduce the number of pairs considered from O(n2) to O(n log n). How?
Longest Common Extension LP[i] = longest common prefix between a string and its ith suffix This can be computed using suffix trees with an LCP query between S and the ith suffix of S (constant time per location i).
Longest Common Extension • Use KMP to compute LP array: Text = Pattern = S • Or, use a variation of KMP: Assume LP[j] is computed for j<i To compute LP[i]: Always remember position of location whose value reaches the RM position reached.
Formally, we know position k that maximizes k+LP[k]. if k+LP[k] > i, we consider three cases for ℓ = k+LP[k]-i: if LP[i-k+1] > ℓ then LP[i] = ℓ if LP[i-k+1] < ℓ then LP[i]=LP[i-k+1] if LP[i-k+1] = ℓ then we continue comparing and position i becomes the new k.
Consider three cases for ℓ = k+LP[k]-i: LP[k] S[i+ℓ] i k ℓ S[k+LP[k]] i-k+1 S[ℓ+1] 1 if LP[i-k+1] (grey box) > ℓ then LP[i] = ℓ if LP[i-k+1] < ℓ then LP[i]=LP[i-k+1] if LP[i-k+1] = ℓ then we continue comparing and position i becomes the new k.
Find Repeats Crossing a Boundary • Find all repeats whose right half crosses the boundary (right repeats) • Find all repeats whose left half crosses the boundary (left repeats)
To find right repeats that cross the center Use the center as an anchor, and pair each index (j) with the center 1 n/2 j n Forward Extension: match to the right as much as possible Backward Extension: match to the left as much as possible
Forward and backward extensions If the back extension meets the forward extension, we have a repeat with period j-n/2. 1 n/2 j n Note: green arrow + back arrow is at least p=j-n/2.
Example Main and Lorentz 1 n/2 j n …. a b c d a b c d ….
ALGOROTHM Find Right Repeats Right-Repeats(x, y) 1 LPy ← Longest-Prefix-Extension(y) 2 LSx|y← Longest-Suffix-Extension(x, y) 3 R ← ∅ 4 for p ← 1 to |y| do 5 if LSx|y(p) + LPy(p + 1) ≥ p then 6 r ← (m− LSx|y(p) + 1,m + p + LPy(p + 1)) 7 R←R∪{r} 8 return R
Reduce number of iterations • In first iteration, find all repeats that cross the center of the input string. • Recursively solve each half. 1 n/2 (center) n Clearly, there are O(log n) levels.
Overall Time: O(n log n) • Each iteration is done in linear time: every j along string pairs with the center in two directions, and longest common extension is done in constant time per j. • There are log n iterations.
Approximate Repeats • Suppose we have exactly two copies (square), and errors are introduced. • Assume that a Hamming Distance of k is allowed between the first and second copy of each repeat.
Use Kangaroo Jumps • Using the same framework of Main and Lorenz, pair every j with the center • Instead of computing the Longest Common Extension, use the Kangaroo method to find the positions of the first k mimsatches.
Runs • Recall: a maximal run r[i…j] is a repeat with period p whose length is at least 2p, and it cannot be extended to the right or left. • the rational number (j-i+1)/p is the exponent of the run. example: alfalfa = (alf)7/3
Maximal Runs • Goal: Find all maximal runs in an input string S of length n. • Main and Lorentz can be trivially extended to find all maximal runs!
Periodicity • Definition: A string p is periodicif p=vku, with k >1, u a proper prefix of v. e.g. p = abcabcabca • Alternate Definition: A string p is periodicif it matches itself before position |p|/2. e.g. p = abcabcabca abcabcabca
n/2 j Maximal Runs In general: we have a repeat from the leftmost point of the back extension until the rightmost point of the forward extension. (Notice periodicity using alternate defintion) 1 n/2 j n
Approximate Runs • also called tandem repeats • need a distance measure • need a way to count errors • consider Hamming Distance measure – how to count mismatches • give 2 ideas (LSS and ACLS) • time permitting: continue to edit distance
What is a Tandem Repeat? A tandem repeat is a pattern of nucleotides that occurs consecutively 2 or more times. Example: The pattern CGT is repeated 5 times. …tcatacgt cgt cgt cgt cgttacaaacgtcttccgt…
Approximate Tandem Repeats Shown are and a consensus pattern More typically, the tandem copies are only approximate due to mutations. Here is an alignment of copies from a human TR from Chromosome 5. 23.7 copies From TRDB the Tandem Repeats Database. NAR v35, D80-87, January 2007.
Why are tandem repeats interesting? • They are associated withhuman disease: Fragile-X mental retardation, Myotonic dystrophy Huntington’s disease, Friedreich’s ataxia Epilepsy, Diabetes, Ovarian cancer • They are often polymorphic, making them valuable genomic markers. • They are involved ingene regulationand often contain putativetranscription factor binding sites. • They can causeparamutation, an epigenetic suppression of gene expression.
Approximate Tandem Repeats (ATR) Allow errors in the copies of the repeats, such as: • Mismatches (also called point mutations) • Insertions and Deletions (also called frame-shift mutations)
Defining an ATR 2 approaches to describing the errors in an ATR: • Consensus-type repeat – errors relative to a consensus • Evolutive Repeats – errors relative to the preceding copy
Available Tools • TRF – Tandem Repeats Finder [Benson] • ATRHunter [Wexler et al.] • TandemSWAN [Boeva et al.] - use heuristics and statistical methods. • mreps [Kolpakov and Kucherov] • [Landau/Schmidt/Sokol] - exhaustive search, allow only mismatches
Consensus-type Repeats e.g. AGAC AGCC ATAC AGAA
Evolutive Tandem Repeats The assumption here is that each copy is derived from a neighbor copy. e.g. AGCC ACCC ACCT GCCT
Observation: Every consensus type repeat with k errors, is also an evolutive repeat with no more than 2k errors. e.g. AGACk = 3 consensus AGAC AGCC k = 6 evolutive ATAC AGAA AGAC
Our Goal • Perform an exhaustive search for all evolutive tandem repeats in a given sequence. • Allow up to k insertions, deletions, and mismatches.
K-edit Repeats A k-edit repeat is a tandem repeat that has at most k indels/mismatches (copy to copy) over all copies of the repeat. Ex. CAAGCTCAGCTCCGCT is a 2-edit repeat copy 1:CAAGCT copy 2:CA-GCT CAGCT shown twice copy 3:CCGCT
Observation A string is a k-edit repeat iff there exists an alignment of the string with a proper suffix of itself, with ≤ k errors.
From previous Example CAAGCTCAGCTCCGCT copy 1:CAAGCTcopy 2:CA-GCTCAGCTcopy 3:CCGCT alignment: CAAGCTCAGCTCCGCT CAAGCTCA-GCTCCGCT copy 1 copy 2 copy 2 copy 3
Problem Definition • Input: 1. a string S over alphabet 2. an integer k • Output: all maximal k-edit repeats that occur in S. A string is maximal if it cannot be extended to the right or left.
Straightforward Algorithm • Consider all possible pairs (i,j) 1 ≤ i ‹ j ≤ n • Construct the edit distance alignment of S1 = si … sn to S2 = sj … snusing dynamic programming. • If the first j-i characters of S1 participate in the alignment with ≤ k errors, then a repeat exists.
Straightforward Algorithm S1 i j n i j n S2 Attempt to align si … sn to sj…sn
A B C D E F G H I D E F G H I p-Restriction Alignment Since we are comparing a string with a suffix of the same string, we need to ensure that the string does not “catch up” with itself. e.g. ABCDEFGHI - - -DEFGHI
Example of Straightforward Algorithm k = 4 ctc-- ctcgagctcctgacctcgtga copy 1: ictc--gagi+5 copy 2: i+6ctcctgaci+13 copy 3: i+14ctcgtgai+20 i j = i + 6 gagctcctgacctcgtga i j
Analysis of Straightforward Algorithm • Consider all pairs (i,j): O(n2) • Computing the edit distance matrix: O(n2) per pair. • Overall: O(n4) time.
Speedups Use the Main and Lorentz algorithm to reduce the number of pairs considered from O(n2) to O(nlogn). Use Ukkonen and Landau/Vishkin to reduce the edit distance matrix computation from O(n2) to O(k2). Use Landau/Myers/Schmidt to reduce the computation of each following matrix.
Speedup #1: Reduce number of iterations • In first iteration, find all repeats that cross the center of the input string. • Recursively solve each half. 1 n/2 (center) n Clearly, there are O(log n) levels.
To find repeats that cross the center Use the center as an anchor, and pair each index (j) with the center 1 n/2 j n Forward Extension: match to the right as much as possible Backward Extension: match to the left as much as possible