1 / 98

The Longest Common Subsequence Problem and Its Variants

The Longest Common Subsequence Problem and Its Variants. 楊昌彪 中山大學資訊工程學系 http://www.nsysu.edu.tw. Outline. Introduction to Bioinformatics Traditional LCS Algorithms Our Works Block Edit Problems LCS of Run-Length Encoded Strings Merged LCS Problem Mosaic LCS Problem Conclusions.

aricin
Download Presentation

The Longest Common Subsequence Problem and Its Variants

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Longest Common Subsequence Problem and Its Variants 楊昌彪 中山大學資訊工程學系 http://www.nsysu.edu.tw

  2. Outline • Introduction to Bioinformatics • Traditional LCS Algorithms • Our Works • Block Edit Problems • LCS of Run-Length Encoded Strings • Merged LCS Problem • Mosaic LCS Problem • Conclusions

  3. Introduction to Bioinformatics

  4. 動物細胞(細胞核、細胞質、細胞膜) • DNA位於細胞核內之「核仁」

  5. DNA and RNA • Nucleotide (核甘酸): 腺嘌呤 (adenine, A) 鳥糞嘌呤(guanine, G) 胞嘧啶(cytosine, C) 胸腺嘧啶(thymine, T) 尿嘧啶(uracil, U) • DNA(deoxyribonucleic acid , 去氧核糖核酸) {A, G, C, T} (base pair: GC, A=T ) • RNA(ribonucleic acid, 核糖核酸) {A, G, C, U} (base pair: GC, A=U, GU )

  6. DNA Double Helix (雙股螺旋)

  7. DNA Length • The total length of the human DNA is about 3109(30億) base pairs. • 1% ~ 1.5% of DNA sequence is useful. • # of human genes: 30,000~40,000 • Conclusion from the Human Genome Project (1990~2003) • Expected # is 100,000 originally.

  8. From DNA via RNA to Protein

  9. DNA TCCAACGGTGCTGAGGTGCAC Protein Gene DNA, Genes and Proteins • DNA: program for cell processes • Proteins: execute cell processes

  10. Promoter(啟動子) and Gene

  11. Amino Acids (胺基酸) 胺基酸:Protein(蛋白質)的基本單位,共20種

  12. Protein Structure

  13. Traditional Dynamic Programming (DP) for the Longest Common Subsequence (LCS) Problem

  14. The Longest Common Subsequence (LCS) Problem • A string : S1 = “TAGTCACG” • A subsequence of S1 : deleting 0 or more symbols from S1 (not necessarily consecutive). e.g. G, AGC, TATC, AGACG • Common subsequences of S1 = “TAGTCACG” and S2 = “AGACTGTC” : GG, AGC, AGACG • Longest common subsequence (LCS) :S1: TAGTCACG S2: AGACTGTC LCS: AGACG

  15. Applications of LCS • The edit distance of two strings or files. (# of deletions and insertions) S1: TAGTCACG S2: AGACTGTC Operation: DMMDDMMIMII • Spoken word recognition • Similarity of two biological sequences (DNA or protein) • Sequence alignment

  16. The Traditional LCS Algorithm • S1 = a1a2am and S2 = b1b2bn • Ai,j denotes the length of the longest common subsequence of a1a2 ai and b1 b2 bj. • Dynamic programming: Ai,j = Ai-1,j-1 + 1if ai= bj max{ Ai-1,j, Ai,j-1 }if ai bj A0,0 = A0,j = Ai,0 = 0 for 1 i m, 1 j n. • Time complexity: O(mn) a1a2 ai-1ai b1 b2 bj-1bj

  17. LCS and Edit Distance • Edit distance = |S1| + |S2| - 2 * |LCS(S1, S2)|

  18. Sequence Alignment S1 = TAGTCACG S2 = AGACTGTC  ----TAGTCACG TAGTCAC-G-- AGACT-GTC--- -AG--ACTGTC • Which one is better? • We can set different gap penalties as parameters for different purposes.

  19. Gap Penalty for Sequence Alignment • is the gap penalty. • Suppose

  20. Example for Sequence Alignment TAGTCAC-G-- -AG--ACTGTC

  21. PAM250 Score Matrix for Protein Alignment

  22. MSA, ET and LCS Multiple sequence alignment LCS Phylogeny (evolutionary tree) 親緣樹

  23. Hunt-Szymanski LCS Algorithm • By extending the idea in RSK (Robinson-Schensted-Knuth) algorithm for solving the longest increasing subsequence, the LCS problem can be solved in O(r log n) time, where r denotes the number of matches. • This algorithm is faster than the traditional dynamic programming if r is small.

  24. The Pairs of Matching in Hunt-Szymanski Algorithm • Input sequences: TAGTCACG and AGACTGTC • Pairs of matching:

  25. Example for Hunt-Szymanski Algorithm • The insertion order is row major and column backward. • Time Complexity: O(r log n), r: # of matchesEach match needs O(log n) time for binary search. L

  26. Time and Space Complexities for LCS

  27. Block Edit Problems

  28. Motivation – Finding Similar Codes

  29. Block Edit Problems • Operations: Block copy, block deletion and block move. • Shapira and Storer (2002) proved that it is NP-hard when recursive block-move operations are allowed. • Various approximations were proposed. • Our assumptions – Restricted edit sequence: • A series of edit operations are performed from left to right on the source string X. • Any two block-edit operations would not be performed on overlapping regions on X.

  30. A Series of Block Edit Operations

  31. Restricted Edit Sequence (a) General (recursive) edit operations (b) Restricted edit sequence

  32. Definitions of the Problems (1/2) • Let P(o, c) denote a block edit problem: • o: a composition of block-edit operations • c: the class of cost measures • The Block-Copy operations: • External copy: copy a substring of Xto Wi • Internal copy: copy a valid substring of Wi-1to Wi • Shifted copy: copy a shifted substring

  33. Definitions of the Problems (2/2) • The Cost Measures that can be chosen: • Constant cost: pcopy • Linear cost: ps+ k ×pe • Nested cost: pcopy+ dc(A, B) • Three problems are defined in our work: • P(EIS,C) • P(EI,L) • P(EI,N)

  34. Problem 1 -- P(EIS,C) – External, Internal, Shifted, Constant • External and internal copies are allowed in constant cost. • Shifted copies are allowed in constant cost. • It can be solved by a straightforward DP algorithm in O(nm2 (n + m) |Σ|) time. • We propose an O(nm) time DP algorithm with • O(n+m2) preprocessing time in worst case • O(n+mlogm) preprocessing time in average case

  35. Recurrence DP Formula for P(EIS,C) • Straightforward implementation:O(nm2 (n + m) |Σ|) time.

  36. Functions and Operations (1) • Character operations: • Block deletions:

  37. Functions and Operations (2) • External copies: • Internal copies:

  38. Functions and Operations (3) • Shifted copies:

  39. Preprocessing for P(EIS,C) • For external copies: • Build a suffix treeT(XR#YR$) to find the common substrings between X and Y. • For internal copies: • Build a suffix tree T(YR) to find the valid common substrings to be copied from working string Wito Wi+1. • For shifted copies: • Compute the differential stringsX'and Y'of Xand Y. • Find the valid common substrings for external / internal copies.

  40. Preprocessing - Suffix Trees

  41. Preprocessing – Longest Common Prefixes (LCP) and Suffix trees

  42. Finding and Maintaining the Range Minimum in Constant Time

  43. Problem 2 -- P(EI,L) – External, Internal, Linear • The cost of each copy or deletion is with an initial penalty plus a linear extended penalty.

  44. Problem 3 -- P(EI,N) – External, Internal, Nested • The copied strings can be further edited with character-edit operations.

  45. Summary of Block Edit Problems

  46. LCS of Run-Length Encoded Strings

  47. LCS of Run-Length Encoded Strings • Run-length encoding (RLE) compressionaaaaabbbccccdd  a5b3c4d2 • Input: • RLE string X: length n, k runs • RLE string Y: length m, l runs • Output: • LCS between X and Y.

  48. Dark & Light Blocks • Divide the DP lattice into k × l blocks. • Dark blocks: matched blocksLight blocks: mismatched blocks

  49. Results of Bunke and Csirik (1995) • Lemma 1 (Dark block): • Lemma 2 (Light block): • Only the boundaries of the blocks are needed.

  50. Results of Liu et al. (2008) • A complex modified DP formula which computes the DP lattice row by row. • Only the bottom boundaries of the blocks are needed.

More Related