490 likes | 571 Views
Lecture 2 Pairwise Sequence Alignment. WHAT?. WHAT?. Given any two sequences (DNA or protein) Seq 1: CATATTGCAGTGGTCCCGCGTCAGGCT S eq 2: TAAATTGCGTGGTCGCACTGCACGCT we are interested to know to what extent they are similar?. CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT. WHY?.
E N D
Lecture 2 Pairwise Sequence Alignment
WHAT? • Given any two sequences (DNA or protein) Seq 1: CATATTGCAGTGGTCCCGCGTCAGGCT Seq 2: TAAATTGCGTGGTCGCACTGCACGCT we are interested to know to what extent they are similar? CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
Discover function • Study evolution • Find crucial features within a sequence • Identify cause of diseases
Discover function Sequences that are similar probably have the same function
in the genome Find crucial features ? • Regions in the sequences that are strongly conserved between different sequences can indicate their functional importance High Low
Identify cause of disease • Comparison of sequences between individuals can detect changes that are related to diseases
Sickle Cell Anemia • Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/
Healthy Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
Diseased Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGTGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPVEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
Indel (replication slippage) TCCGT TCGAGT TCAGT TCGT Sequence Modifications • Three types of changes • Substitution (point mutation) • Insertion • Deletion TCAGT
In order to align two sequences we need a quantitive model to evaluate similarity between sequences. How do we quantitate sequence similarity ? For example : A and A , score= 2 A and T , score= -1
Total score +4 A weak match Substitutions Only Modelnot including indels • Sequences compared base-by-base • Count the number of matches and mismatches • For example :Matches score +2, Mismatches score -1 TTCGTCGTAGTCGGCTCGACCTGGTACGTCTAGCGAGCGTGATCCT 9 matches +18 14 mismatches -14
Total score +24 A strong match Including Indels • Create an ‘alignment’ • Count matches within alignment • Indels are scored as mismatches -1 TT-CGTCGTAGTCG-GC-TCGACC-TGGTACGTC-TAG-CGAGCGT-GATCCT- 17 matches +34 2 mismatches - 2 8 indels - 8
TT-CGTCGTAGTCG-GC-TCGACC-TGGTACGTC-TAG-CGAGCGT-GATCCT- +24 -TTCGT-CGTAGTC-GGCTCG-ACCTGGTAC-GTCTA-GCGAGCGT-GATCC-T 0 Choosing an Alignment • Many different alignments are possible • Should consider all possible • Take the best score found • There may be more than one best alignment
Why is it hard ? Alignment requires an algorithm that performs a number of comparisons roughly proportional to the square of the average sequence length n2.
Dynamic Programming • A method for reducing a complex problem to a set of identical sub-problems • The best solution to one sub-problem is independent from the best solution to the other sub-problem
Dynamic Programming • A method for reducing a complex problem to a set of identical sub-problems • The best solution to one sub-problem is independent from the best solution to the other sub-problem
What does it mean? If a path from X→Z passes through Y, the best path from X→Y is independent of the best path from Y→Z
Sequence Global Alignment Needleman-Wunsch Sequences: A = ACGCTG, B = CATGT A C G C T G 1 2 3 4 5 6 C 1 A 2 T 3 G 4 T Z 5
Score of best alignment between AC and CATG …between ACG and CATG -1 2 …between AC and CATGT Calculate score between ACG and CATGT -2 ? Example Sequences: A = ACGCTG, B = CATGT Match:+2, Other:-1
Example Align the next letter in the sequences Insertion in the first sequence (del) 3 5 - 5 Insertion in the Second sequence 3 -
-1 from before plus -1 for mismatch of G against T-2 2 from before plus -1 for mismatch of – against T1 -2 from before plus -1 for mismatch of G against –-3 Cell gets highest score of -2,1,-31 1 Example -1 2 -2 Sequences: A = ACGCTG, B = CATGT
Example -1 2 -2 Sequences: A = ACGCTG, B = CATGT
A -
ACGCTG ------
----- CATGT
A C
AC -C
ACG -C-
ACGC ---C ACGC -C--
ACG -CA
ACGCTG- -C-ATGT
ACGCTG- -CA-TGT
-ACGCTG CATG-T-
Needleman-Wunsch Global Alignment • Compare entire sequence against another • Global alignment score is bottom right cell
DorothyHodkin DorothyCrowfootHodkin Dorothy Hodkin DorothyCrowfootHodkin DOROTHY DOROTHY HODGKIN HODGKIN Global alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN Local alignment:
Local AlignmentSmith-Waterman • Best score for aligning part of sequences • Often beats global alignment score Global Alignment ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT Local Alignment CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
Global vs. Local alignment Alignment of two Genomic sequences >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Mouse DNA CATGCGTCTGACgctttttgctagcgatatcggactATCGATATA
Global vs. Local alignment Alignment of two Genomic sequences Global Alignment Human:CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA Mouse:CATGCGTCTGACgct---ttttgctagcgatatcggactATCGAT-ATA ****** ***** * *** * ****** *** Human:CATGCGACTGAC Mouse:CATGCGTCTGAC Human:ATCGATCATA Mouse:ATCGAT-ATA Local Alignment
Global vs. Local alignment Alignment of two Genomic DNA and mRNA >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA
Global vs. Local alignment Alignment of two Genomic DNA and mRNA Global Alignment DNA: CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA mRNA:CATGCGACTGAC---------------------------ATCGATCATA ************ ********** DNA: CATGCGACTGAC mRNA:CATGCGACTGAC DNA: ATCGATCATA mRNA:ATCGATCATA Local Alignment