Pairwise alignment algorithms (Haverford College, Fall 2001)

Pairwise alignment algorithms(Haverford College, Fall 2001) Elisabetta Manduchi manduchi@pcbi.upenn.edu phone: 215-573-4408

References • Ewens and Grant, Statistical Methods in Bioinformatics: An Introduction, Springer-Verlag, New York, 2001 Sections 6.2 and 6.4 (all subsections) See also problems 6.1-6.5 Background: appendix B, sections B.1-B.9 • Most textbooks on Computational Biology/Bioinformatics cover these topics

Algorithms: definition Webster’s definition: “a procedure for solving a mathematical problem in a finite number of steps that frequently involves a repetition of an operation; or broadly: a step-by-step procedure for solving a problem or accomplishing some end”

Time and Space Complexity • Time Complexity or Running Time of an algorithm: number of steps. • Space Complexity: amount of temporary storage required to run the algorithm. These are functions of the input size. They refer to the worst case scenario.

Example • Given a list of numbers x1,x2,…,xn, where each xi is 0 or 1, find the first 0, if any, in this list and change it to a 1. • For i=1,2,…,n, if xi=0, set xi=1. Stop. • Input size is n. • Total number of steps? Depends on the input, e.g. 1 for 0,1,1,1 3 for 1,1,0,0 • In the worst case scenario, the total number of steps is n. This is the time complexity of this algorithm.

The O notation • Ignore constant factors when trying to evaluate time or space complexity. • Definition: Let f and g be positive-valued functions of n. We say that f(n)=O(g(n)) as n if there exist constants c0 and N such that, for all nN, we have f(n)cg(n). • That is, if for sufficiently large values of n, f(n) is no bigger than some fixed non-negative constant times g(n).

Examples • 2n3=O(n3) • 10n4+15=O(n4)=O(n5)=… • a polynomial of degree s is O of any polynomial of degree t, for ts. In particular it is O(nt). • If a>1 and if g is a polynomial, then f(n)=an is NOT O(g(n))

Substitutions, Insertions, Deletions • Mutation: one of • switch from one nucleotide to another • insertion • deletion • Substitution: a switch in nucleotides which spreads throughout most of a species. • Substitutions, insertions and deletions passed along two independent lines of descent cause a divergence of the two sequences from the original (and from each other): cgggtatccaa cggtatgcca ccctaggtccca

Questions • Given a collection of sequences. Do they descend from the same “ancestor sequence”? • If so, how did they diverge from their ancestor? I.e. how should they be aligned?

Example • For the previous example cggtatgcca cgggtatccaa , ccctaggtccca, the two descendent sequences align as follows c g g g t a - - t - c c a a c c c - t a g g t c c c - a • “-” (indel) represents an insertion or deletion. • A sequence of  consecutive indels is called a gap of length  .

Alignments • A means of comparing sequences to answer the questions stated above. • Types of alignments: • global/local • gapped/ungapped • pairwise/multiple • In what follows we will focus on pairwise alignments.

Alignments (cont.) • Given two sequences, find an “optimal” alignment between them and use it to answer the questions stated above. • What is an “optimal” alignment? • Need a way to compare alignments. • Attach a score to each alignment. • This should reflect the likelihood that this alignment was produced as a consequence of divergence from a common ancestor.

Scoring schemes • Given a scoring scheme, • an optimal alignment between two sequences is one with the best score (there might be more than one optimal alignment). • the score of the sequence pair is such a best score. • Using the scores of sequence pairs one can: • investigate the hypothesis that two sequences diverged from a common ancestor • use the alignment of a pair of sequences that are judged to be related in order to discover common patterns. • by comparing scores among different species, get information to help reconstruct the phylogenetic tree that relates them all.

Types of scores • Similarity Scores: the higher the score, the more closely related are the two aligned sequences. • Distance scores (or distance measures): the lower the score, the more closely related the sequences. In what follows we will use similarity score.

Using substitution matrices and gap penalties • Score of an alignment = sum of individual scores, one for each aligned pair of residues, and a score (penalty) for each gap. • A substitution matrix s is a (symmetric) matrix where s(X,Y) is the score assigned to the aligned pair of residues X and Y. • A gap penalty is usually a function of the length  of a gap: (). The simplest example is the linear gap model() = -d, for some non-negative linear gap penalty d.

Substitution Matrices • A 44 (NA) or a 2020(AA) symmetric matrix. • Examples: • see table 6.7. • s(X,Y)=1 if X=Y, -1 otherwise. • In what follows we will assume that a scoring scheme, consisting of a substitution matrix and a gap penalty function, is given. • Substitution matrices (such as BLOSUM and PAM matrices) are discussed in detail in section 6.5 of the Ewens and Grant book and will also be discussed next time.

Example • Let s(X,Y)=1 if X=Y, -1 otherwise and use a linear gap score with d=2. Then the score of the alignment c t t a g - g - - c a t - g a g a a is 1 –1 +1 -2 +1 -2 +1 -2  2 = -5

Gapped Global Pairwise Alignments(linear gap model) Given: • a scoring scheme described by a substitution matrix and a linear gap penalty • two sequences of lengths m and n: x=X1X2…Xm and y=Y1Y2…Yn Goal: • find the score of this sequence pair (i.e. the score of an optimal gapped global alignment between x and y) • find an optimal alignment of this type.

Naïve approach Exhaustive search: • List all possible global gapped alignments of x and y. • For each such alignment, compute its score using the given scoring scheme. • Find the maximum of the scores and the corresponding alignment(s).

Time complexity of the naïve approach • Let c(m,n) be the number of global gapped alignments between two sequences of lengths m and n respectively. • How does c grow with m and n? • Let g(m,n) be the number of groups obtained by grouping together those alignments that have the same combination of aligned residue pairs, ignoring indels. e.g. X1 X2 X3 - X4and X1 X2 - X3 X4 - Y1 - Y2 Y3 - Y1 Y2 - Y3

Time complexity of the naïve approach (cont.) • Then g(m,n)c(m,n). • But • The RHS equals (see problem 6.1). • This number grows quite fast with m and n.

Time complexity of the naïve approach (cont.) • To get a sense of how fast this number grows, assume m=n. • Then (use Stirling’s approximation B.5). • For example

Needleman-Wunsch algorithm (1970)Gotoh’s version (1982) • This is an example of dynamic programming algorithm: • break the problem into sub-problems of the same kind • build the final solution using the solutions for the sub-problems.

Notation • s = substitution matrix (for brevity, we will write s(i,j) for s(Xi,Yj)), d = linear gap penalty • x=X1X2…Xm and y=Y1Y2…Yn: sequences to align For i=1,2,…,m and j=1,2,…,n, let: • x1,i=X1X2…Xi, y1,j=Y1Y2…Yj • B(i,j)= score of the pair (x1i,,y1,j) • B(i,0)=-id, B(0,j)=-jd, B(0,0)=0 Get an (m+1)(n+1) matrix B, whose entryB(m,n) is the desired score of the pair (x,y)

The algorithm • Fill in the elements of B recursively, proceeding from top left to bottom right. • A highest scoring alignment between x1,i andy1,j could terminate in one of 3 ways: Xi Xi - Yj - Yj • In the first case, B(i,j)= B(i-1,j-1)+s(i,j) • In the second case, B(i,j)= B(i-1,j)-d • In the third case, B(i,j)= B(i,j-1)-d

Recursive formula B(i,j) = max{B(i-1,j-1)+s(i,j), B(i-1,j)-d, B(i,j-1)-d} Thus, to compute B(m,n), the running time is O(mn). To find an alignment that has this score, we must keep track, at each step of the recursion, of the choice made to get the max.

Example • x=gaatct, y=catt (m=6 and n=4) • s(X,Y)=1 if X=Y, -1 otherwise • d=2

Example (cont.) • 3 optimal alignments: gaatct gaatct gaatct c-at-t ca-t-t -cat-t

Fitting one sequence into another(linear gap model) Given: • a scoring scheme described by a substitution matrix and a linear gap penalty • two sequences of lengths m and n: x=X1X2…Xm and y=Y1Y2…Yn, with m n. Goal: • find the score of an optimal (gapped) alignment of x with a subsequence of y. • find an optimal alignment of this type.

Restating the goal Notation: • s = substitution matrix (for brevity, we will write s(i,j) for s(Xi,Yj)), d = linear gap penalty • for 1 i m and 1 k j n, let x1,i=X1X2…Xi, and yk,j=YkY2…Yj • for two sequences u and v, let B(u,v) be the score of the pair (u,v) Goal: find max{B(x,yk,j): 1 k j n}

Naïve approach • For each choice of k and j, with 1 k jn, apply the NW algorithm to the sequences x and yk,j to find B(x,yk,j). Then find the max. • Running time to find the optimal score is therefore O(mn3): • there are choices for k and j • for each such choice, the running time of NW to find B(x,yk,j) is O(m(j-k)).

A better approach • For i=1,2,…,m and j=1,2,…,n, let F(i,j)=max{B(x1,i,yk,j): k=1,2,…,j} • Then max{B(x,yk,j): 1 k j n} equals max{F(m,j): j=1,2,…,n}, since max{B(x,yk,j): 1 k j n}= max{B(x1,m,yk,j): 1 k j n}= max {max{B(x1,m,yk,j): k=1,2,…,j }: j=1,2,…,n}= max{F(m,j): j=1,2,…,n}

Computing the matrix F • Initialize • F(i,0)=-id, for i=1,2,…,m • F(0,j)=0, for j=0,1,…,n, since deletions of the beginning of y should be without penalty • Fill in the elements of F recursively, proceeding from top left to bottom right. • Look at the last row of the (m+1)(n+1) matrix F. The max value on this row is the desired score. • Note that there might be more than one cell on this row at which the max is attained.

Recursive formula F(i,j)=max{F(i-1,j-1)+s(i,j), F(i-1,j)-d, F(i,j-1)-d} Thus, to compute max{F(m,j): j=1,2,…,n}, the running time is O(mn). To find an alignment that has this score, we must keep track, at each step of the recursion, of the choice made to get the max.

Example • x=atg, y=gaatct (m=3 and n=6) • s(X,Y)=1 if X=Y, -1 otherwise • d=2

Example (cont.) • One optimal fit: a t g g a a t c t

Local alignments(linear gap model) Given: • a scoring scheme described by a substitution matrix and a linear gap penalty • two sequences of lengths m and n: x=X1X2…Xm and y=Y1Y2…Yn. Goal: • find the score of an optimal (gapped) alignment of a subsequence of x with a subsequence of y. • find an optimal alignment of this type.

Assumption • In what follows we make the assumption that the scoring scheme we use is such that the expected score for a random alignment is negative. • If this assumption did not hold, then long matches between subsequences could score highly just because of their lengths, so that two long unrelated subsequences could give a highest scoring alignment. We do not want this to occur.

Restating the goal Notation: • s = substitution matrix (for brevity, we will write s(i,j) for s(Xi,Yj)), d = linear gap penalty • for 1h im and 1 k j n, let xh,i=XhX2…Xi, and yk,j=YkY2…Yj • for two sequences u and v, let B(u,v) be the score of the pair (u,v) Goal: find max{B(xh,i,yk,j): 1 h i m, 1 k j n}, when this is non-negative.

Naïve approach • For each choice of h, i, k and j, with 1 h i m and 1 k j n, apply the NW algorithm to the sequences xh,i and yk,j to find B(xh,i,yk,j). Then find the max. • Running time to find the optimal score is therefore O(m3n3): • there are choices for h and i and choices for k and j • for each such choice, the running time of NW to find B(xh,i,yk,j) is O((i-h)(j-k)).

A better approach:the Smith-Waterman algorithm (1981) • For i=1,2,…,m and j=1,2,…,n, let L(i,j)=max{0, B(xh,i,yk,j): h=1,2,…, i, k=1,2,…,j} • Then max{0, B(xh,iyk,j): 1 h i m, 1 k j n} equals max{L(i,j): i=1,2,…,m, j=1,2,…,n}, since max{0, B(xh,i,yk,j): 1 h i m, 1 k j n}= max {max{0, B(xh,i,yk,j): h=1,2,…,i, k=1,2,…,j }: i=1,2,…,m, j=1,2,…,n}= max{L(i,j): i=1,2,…,m, j=1,2,…,n}

Computing the matrix L • Initialize • L(i,0)=0, for i=1,2,…,m • L(0,j)=0, for j=0,1,…,n, since deletions at the beginning or end of our two sequences should be without penalty • Fill in the elements of L recursively, proceeding from top left to bottom right. • Look at the cell of the (m+1) (n+1) matrix L. The max value is the desired score. • Note that there might be more than one cell in L at which this max is attained.

Recursive formula L(i,j)=max{0, L(i-1,j-1)+s(i,j), L(i-1,j)-d, L(i,j-1)-d} Thus, to compute max{L(i,j): i=1,2,…,m, j=1,2,…,n} the running time is O(mn). To find an alignment that has this score, we must keep track, at each step of the recursion, of the choice made to get the max.

Example X2 X3 - X4 X5 Y5 Y6 Y7 Y8 Y9

Other gap models • Linear gap model: • simple • but often not appropriate for biological sequences: often it is harder for a gap to open than it is for it to extend • penalize additional gap steps a little less than the first one • use a more complicated gap cost () • need to adjust the recurrence relations in the dynamic programming algorithms

Other gap models (cont.) E.g., for global alignments, need to distinguish among alignments of x1,iand y1,j that end with Xi aligned to an indel • the score of such an alignment will also depend on how many symbols in x immediately preceding Xi are aligned to indels • if the last symbol not aligned to an indel preceding Xi is Xk, then the i-k symbols from Xk+1 to Xi are aligned to indels, and the score of a highest scoring alignment is B(k, j)+(i-k)

Recursive formula for a general gap model(global alignment) B(0,0)=0, B(i,0)=(i), B(0,j)=(j) B(i,j)= max{B(i-1,j-1)+s(i,j), B(k,j)+(i-k): k=0,1,…,i-1, B(i,k)+(j-k): k=0,1,…,j-1} Finding B(m,n) has a running time of O(m2n+mn2) as, for each cell, one must consider i+j+1 previous ones, not just 3.

Special case: affine gap model • ()=-d-(-1)e, 0<e<d; d=gap open penalty, e=gap extension penalty • Trick to get a running time of O(mn): use 3 matrices • S(i,j) holds the score of a highest scoring alignment between x1,iand y1,j , given that the alignment ends with Xi aligned to Yj • Ix(i,j) holds the score of a highest scoring alignment between x1,iand y1,j , given that the alignment ends with Xi aligned toan indel. • Iy(i,j) holds the score of a highest scoring alignment between x1,iand y1,j , given that the alignment ends with and indel aligned to Yj.

Initialization • S(0,0)= Ix(0,0)= Iy(0,0)=0 • S(0,j)= Ix(0,j)= -d-(j-1)e • S(i,0)= Iy(i,0)= -d-(i-1)e We assume that a deletion will not be followed directly by an insertion

Recursive formulae S(i,j) = max{S(i-1,j-1)+s(i,j ), Ix(i-1,j-1)+s(i,j), Iy(i-1,j-1)+s(i,j)} Ix(i,j)= max{S(i-1,j)-d, Ix(i-1,j)-e} Iy(i,j)= max{S(i,j-1)-d, Iy(i,j-1)-e} The score of the pair (x,y) is then max{S(m,n), Ix(m,n), Iy(m,n)}

Pairwise alignment algorithms (Haverford College, Fall 2001)

Pairwise alignment algorithms (Haverford College, Fall 2001)

Presentation Transcript

Developing Pairwise Sequence Alignment Algorithms

Gene Expression Arrays (Haverford College, Fall 2001)

Pairwise sequence Alignment

Pairwise Alignment

Pairwise sequence Alignment

Pairwise sequence alignment

Pairwise alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise profile alignment

Pairwise alignment of DNA/protein sequences

Algorithms for Pairwise Sequence Alignment

Pairwise Sequence Alignment

Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms

Pairwise alignment

Developing Pairwise Sequence Alignment Algorithms

Pairwise alignment

Pairwise Sequence Alignment

Pairwise alignment

Pairwise sequence alignment