690 likes | 870 Views
Sequence Comparison. Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment of two sequences. Multiple Sequence Alignment -Two or more sequences. Overview. Why compare sequences Homology vs. identity/similarity
E N D
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment of two sequences Multiple Sequence Alignment -Two or more sequences
Overview • Why compare sequences • Homology vs. identity/similarity • DotPlots • Scoring • Match • Mismatch • Gap penality • Global vs. local alignment • Do the results make biological sense?
Why Align Sequences • Identify conserved sequences
Why Align Sequences • Identify conserved sequences • Identify elements that repeat in a single sequence.
Why Align Sequences • Identify conserved sequences • Identify elements that repeat in a single sequence. • Identify elements conserved between genes.
Why Align Sequences • Identify conserved sequences • Identify elements that repeat in a single sequence. • Identify elements conserved between genes. • Identify elements conserved between species.
Why Align Sequences • Identify conserved sequences • Identify elements that repeat in a single sequence. • Identify elements conserved between genes. • Identify elements conserved between species. • Regulatory elements
Why Align Sequences • Identify conserved sequences • Identify elements that repeat in a single sequence. • Identify elements conserved between genes. • Identify elements conserved between species. • Regulatory elements • Functional elements
Underlying Hypothesis? EVOLUTION
Underlying Hypothesis? EVOLUTION Based upon conservation of sequence during evolution we can infer function.
Basic terms: • Similarity - measurable quantity. • Similarity- applied to proteins using concept of conservative substitutions • Identity • percentage • Homology-specific term indicating relationship by evolution
Basic terms: • Orthologs: homologous sequences found in two or more species, that have the same function (i.e. alpha- hemoglobin).
Basic terms: • Orthologs: homologous sequences found it two or more species, that have the same function (i.e. alpha- hemoglobin). • Paralogs: homologous sequences found in the same species that arose by gene duplication. ( alpha and beta hemoglobin).
Pairwise comparison • Dotplot • All against all comparison. • Every position is compared with every other position.
Pairwise comparison • Dotplot • All against all comparison. • Every position is compared with every other position. • Nucleic acids and proteins have polarity.
Pairwise comparison • Dotplot • All against all comparison. • Every position is compared with every other position. • Nucleic acids and proteins have polarity. • Typically only one direction makes biological sense.
Pairwise comparison • Dotplot • All against all comparison. • Every position is compared with every other position. • Nucleic acids and proteins have polarity. • Typically only one direction makes biological sense. • 5’ to 3’ or amino terminus to carboxyl terminus.
DotPlot • Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity.
DotPlot • Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. G A T C T G A T C T
DotPlot • Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. . G A T C T G A T C T
DotPlot • Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. . . G A T C T G A T C T
DotPlot • Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. . . G A T C T . . G A T C T
DotPlot • Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. . . G A T C T . . G A T C T .
DotPlot • Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. . . G A T C T . . G A T C T . . .
Simple plot • Window: size of sequence block used for comparison. In previous example: • window = 1 • Stringency = Number of matches required to score positive. In previous example: • stringency = 1 (required exact match)
Dot Plot • Compare two sequences in every register. • Vary size of window and stringency depending upon sequences being compared. • For nucleotide sequences typically start with window = 21; stringency = 14
DotPlot WINDOW = 4; STRINGENCY = 2 GATCGTACCATGGAATCGTCCAGATCA GATC + (4/4) GATC - (0/4) GATC - (0/4) GATC + (2/4)
Intragenic Comparison • Rat Groucho Gene
Intergenic Comparison • Rat and Drosophila Groucho Gene
Intergenic comparison • Nucleotide sequence contains three domains.
Intergenic comparison • Nucleotide sequence contains three domains. • 50 - 350 - Strong conservation • Indel places comparison out of register
Intergenic comparison • Nucleotide sequence contains three domains. • 50 - 350 - Strong conservation • Indel places comparison out of register • 450 - 1300 - Slightly weaker conservation
Intergenic comparison • Nucleotide sequence contains three domains. • 50 - 350 - Strong conservation • Indel places comparison out of register • 450 - 1300 - Slightly weaker conservation • 1300 - 2400 - Strong conservation
Groucho • These three coding regions correspond to apparent functional domains of the encoded protein
Scoring Alignments • Quality Score: • Score x for match, -y for mismatch;
Scoring Alignments • Quality Score: • Score x for match, -y for mismatch; • Penalty for: • Creating Gap • Extending a gap
Scoring Alignments • Quality Score: • Quality = [10(match)]
Scoring Alignments • Quality Score: • Quality = [10(match)] + [-1(mismatch)]
Scoring Alignments • Quality Score: • Quality = [10(match)] + [-1(mismatch)] - [(Gap Creation Penalty)(#of Gaps)