1 / 56

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao. Dynamic Programming.

dinesh
Download Presentation

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao

  2. Dynamic Programming • Dynamic programming is a class of solution methods for solving sequential decision problems with a compositional cost structure. • Richard Bellman was one of the principal founders of this approach.

  3. Two key ingredients • Two key ingredients for an optimization problem to be suitable for a dynamic-programming solution: 2. overlapping subproblems 1. optimal substructures Subproblems are dependent. (otherwise, a divide-and-conquer approach is the choice.) Each substructure is optimal. (Principle of optimality)

  4. Three basic components • The development of a dynamic-programming algorithm has three basic components: • The recurrence relation (for defining the value of an optimal solution); • The tabular computation (for computing the value of an optimal solution); • The traceback (for delivering an optimal solution).

  5. = F 0 0 = F 1 1 = + F F F for i>1 . - - i i 1 i 2 Fibonacci numbers The Fibonacci numbers are defined by the following recurrence:

  6. F8 F9 F7 F10 F7 F8 F6 How to computeF10? ……

  7. Tabular computation • The tabular computation can avoid recompuation.

  8. Maximum-sum interval • Given a sequence of real numbers a1a2…an, find a consecutive subsequence with the maximum sum. 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 For each position, we can compute the maximum-sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.

  9. O-notation:an asymptotic upper bound • f(n) = O(g(n)) iff there exist two positive constant c and n0 such that 0 f(n) cg(n) for all n n0 cg(n) f(n) n0

  10. How functions grow? function For large data sets, algorithms with a complexity greater than O(n log n) are often impractical! n (Assume one million operations per second.)

  11. ai Maximum-sum interval(The recurrence relation) • Define S(i) to be the maximum sum of the intervals ending at position i. If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.

  12. Maximum-sum interval(Tabular computation) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum sum

  13. Maximum-sum interval(Traceback) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum-sum interval: 6 -2 8 4

  14. Two fundamental problems we recently solved (I) (joint work with Lin and Jiang) • Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm. U = 3 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

  15. Joint work with Huang, Jiang and Lin • Xiaoqiu Huang (黃曉秋)Iowa State University, USA • Tao Jiang (姜濤) University of California Riverside, USA • Yaw-Ling Lin (林耀鈴)Providence University, Taiwan

  16. Two fundamental problems we recently solved (II) (joint work with Lin and Jiang) • Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm. L = 4 3 2 14 6 6 2 10 2 6 6 14 2 1

  17. Another example Given a sequence as follows: 2, 6.6, 6.6, 3, 7, 6, 7, 2 and L = 2, the highest-average interval is the squared area, which has the average value 20/3. 2, 6.6, 6.6, 3, 7, 6, 7, 2

  18. C+G rich regions • Our method can be used to locate a region of length at least L with the highest C+G ratio in O(n log L) time. ATGACTCGAGCTCGTCA 00101011011011010 Search for an interval of length at least L with the highest average.

  19. After I opened this problem in Academia Sinica in Aug. 2001 ... • JCSS (2002) – Lin, Jiang, Chao • A linear-time algorithm for a maximum-sum segment with a lower bound L and an upper bound U. • An O(n log L) time algorithm for a maximum-average segment with a lower bound L. • Two open problems raised: • O(n log L)  O(n)? • average (with an upper bound U)?

  20. After I opened this problem in Academia Sinica in Aug. 2001 ... • Comb. Math. (2002) – Yeh et. al • An expected O(nL)-time algorithm • Comb. Math. (2002) – Lu • A linear time algorithm for a maximum-average segment with a lower bound L in a binary sequence.

  21. After I opened this problem in Academia Sinica in Aug. 2001 ... • WABI (2002) & JCSS – Goldwasswer, Kao, and Lu • O(n) for a lower bound L only • O(n log (U-L+1)) for L and U • IPL (2003) – Kim • O(n) for a lower bound L only(Reformulated as a Computational Geometry Problem)

  22. After I opened this problem in Academia Sinica in Aug. 2001 ... • Bioinformatics (2003) – Lin, Huang, Jiang, Chao • A tool for finding k-best maximum-average segments • Bioinformatics (2003) – Wang, Xu • A tool for locating a maximum-sum segment with a lower bound L and an upper bound U

  23. After I opened this problem in Academia Sinica in Aug. 2001 ... • CIAA(2003) – Lu et al. • An alternative linear-time algorithm for a maximum-sum segment (online; for finding repeats) • ESA (2003) – Chung, Lu • O(n log (U-L+1))  O(n) for L and U • ISAAC (2003) – Lin (R.R.), Kuo, Chao • O(nL) for a maximum-average path in a tree • O(n) for a complete tree • Some were omitted here... • Some are coming soon...

  24. Length-unconstrained version • Maximum-average interval 3 2 14 6 6 2 10 2 6 6 14 2 1 The maximum element is the answer. It can be done in O(n) time.

  25. Q:computing density in O(1) time? • prefix-sum(i) = S[1]+S[2]+…+S[i], • all n prefix sums are computable in O(n) time. • sum(i, j) = prefix-sum(j) – prefix-sum(i-1) • density(i, j) = sum(i, j) / (j-i+1) j i prefix-sum(j) prefix-sum(i-1)

  26. A naive algorithm • A simple shift algorithm can compute the highest-average interval of a fixed length in O(n) time • Try L, L+1, L+2, ..., n. In total, O(n2).

  27. An O(nL)-time algorithm (Huang, CABIOS’ 94) • Observing that there exists an optimal interval of length bounded by 2L, we immediately have an O(nL)-time algorithm. We can bisect a region of length >= 2L into two segments, where each of them is of length >= L.

  28. Good partners • Finding good partnerg(i) for each i=1, 2, …, n-L+1. i + L g(i) L maximing avg[i, g(i)]

  29. Right-Skew Decomposition • Partition S into substrings S1,S2,…,Sk such that • each Si is a right-skew substring of S • the average of any prefix is always less than or equal to the average of the remaining suffix. • density(S1) > density(S2) > … > density(Sk) • [Lin, Jiang, Chao] • Unique • Computable in linear time.

  30. 1 1 1 0 1 1 0 1 0 1 1 0 0 Right-Skew Decomposition 1 > 2/3 > 3/5 > 1/3 Lu’s illustration

  31. An O(n log L)-time algorithm (Lin, Jiang, Chao, JCSS’02) • Decreasingly right-skew decomposition (O(n) time) 5 6 7.5 5 9 8 9 8 7 8 9 7 8 1 9 8 3 7 2 8

  32. Right-Skew pointers p[ ] 5 6 7.5 5 9 8 9 8 7 8 9 7 8 1 9 8 3 7 2 8 1 2 3 4 5 6 7 8 9 10 p[ ] 1 3 3 6 5 6 10 8 10 10

  33. Illustration g(i) i+L L i+L L Lu’s illustration

  34. An O(n log L)-time algorithm • Jumping tables that allows binary search. • O(log L) time for each position to find its good partner, therefore the total running time is O(n log L). • We also implemented a program that linearly scans right-skew segments for the good partnership. Our empirical tests showed that it ran in linear time empirically.

  35. Pointer-jumping tables 5 6 7.5 5 9 8 9 8 7 8 9 7 8 1 9 8 3 7 2 8 1 2 3 4 5 6 7 8 9 10 p(0)[ ] 1 3 3 6 5 6 10 8 10 10 p(1)[ ] 3 6 6 10 6 10 10 10 10 10 p(2)[ ] 10 10 10 10 10 10 10 10 10 10

  36. Goldwasser, Kao, and Lu’s progress • AnO(n) time algorithm for the problem • Appeared in Proceedings of the Second Workshop on Algorithms in Bioinformatics (WABI), Rome, Itay, Sep. 17-21, 2002. • JCSS, accepted.

  37. j i g(j) g(i) A new important observation • i < j < g(j) < g(i) implies • density(i, g(i)) is no more than density(j, g(j)) Lu’s illustration

  38. j i g(j) g(i) Lu’s illustration

  39. Searching for all g(i) in linear time Lu’s illustration

  40. k non-overlapping maximum-average segments(Lin, Huang, Jiang, Chao, Bioinformatics) • Maintain all candidates in a priority queue. • A new k-best algorithm + algorithms on trees (Lin and Chao (2003))

  41. Longest increasing subsequence(LIS) • The longest increasing subsequence is to find a longest increasing subsequence of a given sequence of distinct integers a1a2…an. • e.g. 9 2 5 3 7 11 8 10 13 6 • 3 7 • 7 10 13 • 7 11 • 3 5 11 13 are increasing subsequences. We want to find a longest one. are not increasing subsequences.

  42. A naive approach for LIS • Let L[i] be the length of a longest increasing subsequence ending at position i. L[i] = 1 + max j = 0..i-1{L[j] | aj < ai}(use a dummy a0 = minimum, and L[0]=0) 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 ?

  43. A naive approach for LIS L[i] = 1 + max j = 0..i-1{L[j] | aj < ai} 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 4 5 6 3 The maximum length The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence. This method runs in O(n2) time.

  44. Binary search • Given an ordered sequence x1x2 ... xn, where x1<x2< ... <xn, and a number y, a binary search finds the largest xi such that xi< y in O(log n) time. n/2 ... n/4 n

  45. Binary search • How many steps would a binary search reduce the problem size to 1?n n/2 n/4 n/8 n/16 ... 1 How many steps? O(log n) steps.

  46. An O(n log n) method for LIS • Define BestEnd[k] to be the smallest number of an increasing subsequence of length k. 9 2 5 3 7 11 8 10 13 6 9 2 2 2 2 2 2 2 2 BestEnd[1] 5 3 3 3 3 3 3 BestEnd[2] 7 7 7 7 7 BestEnd[3] 11 8 8 8 BestEnd[4] 10 10 BestEnd[5] 13 BestEnd[6]

  47. An O(n log n) method for LIS • Define BestEnd[k] to be the smallest number of an increasing subsequence of length k. 9 2 5 3 7 11 8 10 13 6 9 2 2 2 2 2 2 2 2 2 BestEnd[1] 5 3 3 3 3 3 3 3 BestEnd[2] 7 7 7 7 7 6 BestEnd[3] 11 8 8 8 8 BestEnd[4] For each position, we perform a binary search to update BestEnd. Therefore, the running time is O(n log n). 10 10 10 BestEnd[5] 13 13 BestEnd[6]

  48. Longest Common Subsequence (LCS) • A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent. • The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.

  49. LCS For instance, Sequence 1: president Sequence 2: providence Its LCS is priden. president providence

  50. LCS Another example: Sequence 1: algorithm Sequence 2: alignment One of its LCS is algm. a l g o r i t h m a l i g n m e n t

More Related