Sequence Alignment

Sequence Alignment Finding similarities between sequences (Chapter 11)

The Problem • you have a sequence and you want to know if it is similar to another known sequence • Is it identical to another, known sequence? • Is it similar to another , known sequence? • If so, how similar, and is the similarity restricted to a few regions, or generalized?

Methods - Conceptual • one way of looking at this problem is a big motif search, i.e. treating one sequence as a motif and scanning the other to find if it matches. • Regular expressions • HMM • Scoring

Regular Expressions • not a workable approach • a simple regex scan will only probe for a small subset of all possible matches, simplest being a perfect match. • no way to implement a quantitative evaluation of how similar • not computationally rapid enough for searching big databases

HMM Methods • currently very powerful for finding subtle similarities between a number of sequences • conceptually complex, not easy to implement (although some tools are coming on line) • Can not easily be speeded up, so not workable for searches of large databases, but very good method for doing small number of alignments

Quantitative Method • want to be able to assign a score to each potential alignment between two sequences, and find the best score • therefore, this is a maximization problem • we will go through development of current best quantitative alignment methods

Simple Example • have two nucleotide sequences: ACGGTTGAATGC CGATTCATGC • by eye, you would probably get: ACGGTTGAATGC -CGATTCA-TGC

Three Important Points • maximized the number of exact matches • to do this had to add gaps to one or both of the sequences • allowed some mismatches

But what about alternatives: ACGGTTGAATGC -CGATTCA-TGC -CGATT-CATGC -CGATTC-ATGC • Or ACGG-TTG-AATGC -CG-ATT-CA-TGC

Finding the “Best” Alignment • to call something the “best” you need to have some criterion • typically this involves a scoring scheme for assigning value to any alignment and then finding the alignment (out of all possible alignments) that has the maximum score based on that scoring scheme

Scoring Scheme • implicit in that simple, intuitive alignment are three general concepts: • you get points for matching • gapping has no cost • you didn’t subtract points for mismatch • however, cost-free gapping and mismatching are usually not optimal

Boundary Conditions • a very simple scheme gives you 1 point for each match, minus for a mismatch, zero for a gap • under this scheme the optimal score can always be obtained by adding unlimited gaps to get every individual alignment possible, or at worst match a gap to every mismatch

Gap Penalties • in biological terms, a gap in one sequence in an alignment is called an indel, short for insertion/deletion • reason: a priori you can not tell whether extra sequence was inserted into one homologue, or sequence was removed from the other

in any case, a gap is a hypothesis that either an insertion or a deletion occurred, and such events are relatively uncommon • therefore, a penalty should be imposed for a gap, usually rather high to reflect the idea that some number of mismatches are more likely than an indel

Affine Gap Penalties • once a gap is hypothesized, the size of the gap is not well defined, so the penalty for having a string of n gaps should be less than xn, where x is the penalty for opening a gap • this yields a 2-parameter formula for a gap penalty, an affine gap penalty

Penalty = G + (n-1)L • G is the gap opening penalty • L is the gap extension penalty • n is the length of the gap • values for these parameters are empirically determined • depend on the scoring scheme that is in use • usually G>L

Scoring Matrix • when we were looking at the scanning window case, we assigned a number value to each amino acid and then summed them • for an alignment, we need to assign a scoring value to every position in an alignment

therefore, need a score for every pairwise combination of amino acids • gaps in either sequence are scored using a gap penalty • so, how do we get a scoring matrix? • simplest possibility is to assign 1 for a perfect match and 0 or -1 for every mismatch • not bad for nucleic acid sequences • terrible for protein sequences

Problems with Simple Identity Matrix • the model implicit in this simple matrix grossly oversimplifies the process by which two sequences diverge. • different amino acids convert to other amino acids with different frequencies • this phenomenon is based on • the chemical nature of the aa sidechain • the genetic code

Amino Acid Similarity • some side chains are chemically very similar, e.g. D and E, R and K, S and T, I and L and V • this similarity means that changes within these groups tends to have a smaller effect on protein structure and function than changes between the groups

therefore, we can score a S:T pairing as more of a match than a S:I pairing • conversely, a R:E pairing, or a R:L pairing should get a negative score, since they involve major changes in sidechain identity, which tend to be selected against • Note, this phenomenon is based on natural selection, not on susceptibility to mutational change

Genetic Code • the genetic code is a degenerate 3-letter nucleotide code that translates into the amino acid sequence of proteins. • some nucleotide changes will not alter the amino acid encoded by the codon that contains the change (degeneracy)

each amino acid is related to the other 19 amino acids by one or two or three nucleotide changes • therefore, you could score mismatches at the amino acid level based on the minimum number of nucleotide changes that would be required to interconvert the two residues • e.g. F:L = UUU:UUA, 1 change • e.g. W:M = UGG:ATG, 2 changes

How Do We Generate a Good Scoring Matrix? • a priori approaches are so oversimplified as to be misleading • therefore, need to extract information from real data so the scoring matrix reflects the real process underlying the comparison • therefore, extract information from “undeniable alignments”

5’ Break

Protein Scoring Matrices • two major sets: • PAM - Point Accepted Mutation matrix, based on differences between closely related proteins (Dayhoff et al. [1978] in Atlas of Protein Sequence and Structure) • BLOSUM - BLOcks SUbstitution Matrix based on BLOCKS database of local alignments with different similarities (Henikoff and Henikoff [1992] PNAS 10915-10919)

PAM Matrices • based on alignments of closely related sequences, mainly of antibody proteins • if there is no selection pressure, then the substitution matrix could be directly derived from the amino acid frequency (called the background frequency) • but the observed substitution frequencies (target frequencies) differ

that means that there are certain transitions that are more accepted as point mutations • constructed a matrix, PAM1, of the natural log of the ratio (target frequency/background frequency) for comparisons in which the overall sequence difference is <15%, corrected to reflect a 1% divergence • by multiplying the matrix by itself can generate the appropriate matrix for successively more divergent sequences

typically people use the PAM250 matrix, but others are available • appropriate scoring matrix is the one that reflects the amount of divergence between sequences, so the more divergent the proteins being aligned the larger the PAM number should be

BLOSUM Matrices • unlike the PAM matrices, where the substitution matrix is obtained by multiplying the log odds matrix from highly similar sequences to get matrices compatible with higher divergence • BLOSUM matrices are derived from alignments from the BLOCKS database of multiple alignments

blocks are chosen to have different levels of divergence, more divergent blocks yield a matrix that is appropriate for aligning more divergent proteins • hence, BLOSUM is based on observed target frequencies at different levels of divergence, rather than extrapolation from very similar sequence alignments

Finding the “Best” Alignment • we have a scoring algorithm, a scoring matrix and gap penalties • we operationally define the best alignment as that which yields the highest score when the scoring scheme is applied to it • How do we find the best score?

Simple Method • construct all possible alignments • determine the score for each of them • select the one(s) with the highest score

Problem • how many alignments can you make with two sequences, length m and n? • number is so large (of the order of 3mn ) as to make the problem computationally intractable for most cases of interest. • so, need an algorithm that will allow you to find the maximum score without evaluating every alignment

Dot Plot • place two sequences along two axes of a square graph, table • simple algorithm • mark each cell that corresponds to two sequence elements matching with a dot • pattern should show diagonals where regions of sequence match

Improved Dot Plot • instead of using single site comparisons, compare equal sized windows using a scoring matrix. • graph dots with an intensity that is a function of the score between the windows. • implemented in Peptool

Dot Plot Between Proteins • if you use two different sequences on the dot plot, get a graphical indication of where they align and where they do not • for very similar sequences, get a diagonal with varying intensity along its length quantitatively indicating the sequence similarity along the alignment

Dot Plot Comparison • when the proteins are more different, get diagonal line segments, indicating regions of similarity • gaps between lines indicate areas of low sequence similarity • offsets between the segments indicate different lengths of the low-similarity regions

Connection • dot plot is a heuristic for visualizing the quantitative relationship between two aligned sequences • implicit in the dot matrix are all possible alignments • each global alignment can be represented by tracing a path through the matrix

Dynamic Programming • a class of algorithms that can rapidly find an optimal solution to a problem if that problem can be broken down into a a set of sub-problems that can also be optimized • does a good job of finding optimal paths through graphs, which is one way of looking at the alignment problem

General idea: • If you have a prefix alignment of length i then there are only three possibilities for lengthening that alignment: • The next elements of each sequence are aligned with each other • The next element of the top sequence is aligned with a gap • The next element of the bottom sequence is aligned with a gap

Sequence Alignment