Sequence analysis

Sequence analysis How to locate rare/important sub-sequences.

Sequence Analysis Tasks • Representing sequence features, and finding sequence features using consensus sequences and frequency matrices • Sequence features • Features following an exact pattern- restriction enzyme recognition sites • Features with approximate patterns • promoters • transcription initiation sites • transcription termination sites • polyadenylation sites • ribosome binding sites • protein features

Representing uncertainty in nucleotide sequences • It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is “possible” at a given position • to express ambiguity during sequencing • to express variation at a position in a gene during evolution • to express ability of an enzyme to tolerate more than one base at a given position of a recognition site

Representing uncertainty in nucleotide sequences • To do this for nucleotides, we use a set of single character codes that represent all possible combinations of bases • This set was proposed and adopted by the International Union of Biochemistry and is referred to as the I.U.B. code • Given the size of the amino acid “alphabet”, it is not practical to design a set of codes for ambiguity in protein sequences

The I.U.B. Code • A, C, G, T, U • R = A, G (puRine) • Y = C, T (pYrimidine) • S = G, C (Strong hydrogen bonds) • W = A, T (Weak hydrogen bonds) • M = A, C (aMino group) • K = G, T (Keto group) • B = C, G, T (not A) • D = A, G, T (not C) • H = A, C, T (not G) • V = A, C, G (not T/U) • N = A, C, G, T/U (iNdeterminate) X or - are sometimes used

Definitions • A sequence feature is a pattern that is observed to occur in more than one sequence and (usually) to be correlated with some function • A consensus sequence is a sequence that summarizes or approximates the pattern observed in a group of aligned sequences containing a sequence feature • Consensus sequences are regular expressions

Finding occurrences of consensus sequences • Example: recognition site for a restriction enzyme • EcoRI recognizes GAATTC • AccI recognizes GTMKAC • Basic Algorithm • Start with first character of sequence to be searched • See if enzyme site matches starting at that position • Advance to next character of sequence to be searched • Repeat previous two steps until all positions have been tested

Block Diagram for Search with a Consensus Sequence Consensus Sequence (in IUB codes) Search Engine List of positions where matches occur Sequence to be searched

Statistics of pattern appearance • Goal: Determine the significance of observing a feature (pattern) • Method: Estimate the probability that that pattern would occur randomly in a given sequence. Three different methods • Assume all nucleotides are equally frequent • Use measured frequencies of each nucleotide (mononucleotide frequencies) • Use measured frequencies with which a given nucleotide follows another (dinucleotide frequencies)

Determining mononucleotide frequencies • Count how many times each nucleotide appears in sequence • Divide (normalize) by total number of nucleotides • Result: fAmononucleotide frequency of A (frequency that A is observed) • Define: pAmononucleotide probability that a nucleotide will be an A • pA assumed to equal fA

Determining dinucleotide frequencies • Make 4 x 4 matrix, one element for each ordered pair of nucleotides • Zero all elements • Go through sequence linearly, adding one to matrix entry corresponding to the pair of sequence elements observed at that position • Divide by total number of dinucleotides • Result: fACdinucleotide frequency of AC (frequency that AC is observed out of all dinucleotides)

Determining conditional dinucleotide probabilities • Divide each dinucleotide frequency by the mononucleotide frequency of the first nucleotide • Result: p*ACconditional dinucleotide probability of observing a C given an A • p*AC = fAC/ fA

Illustration of probability calculation • What is the probability of observing the sequence feature ART?A followed by a purine, (either A or G), followed by a T? • Using equal mononucleotide frequencies • pA = pC = pG = pT = 1/4 • pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32

Illustration (continued) • Using observed mononucleotide frequencies: • pART = pA (pA + pG) pT • Using dinucleotide frequencies: • pART = pA (p*AAp*AT + p*AGp*GT)

Another illustration • What is pACT in the sequence TTTAACTGGG? • fA = 2/10, fC = 1/10 • pA = 0.2 • fAC = 1/10, fCT = 1/10 • p*AC = 0.1/0.2 = 0.5, p*CT = 0.1/0.1 = 1 • pACT = pA p*AC p*CT = 0.2 * 0.5 * 1 = 0.1 • (would have been 1/5 * 1/10 * 4/10 = 0.008 using mononucleotide frequencies)

Expected number and spacing • Probabilities are per nucleotide • How do we calculate number of expected features in a sequence of length L? • Expected number (for large L)  Lp • How do we calculate the expected spacing between features? • ART expected spacing between ART features = 1/pART

Renewals • For greatest accuracy in calculating spacing of features, need to consider renewals of a feature (taking into account whether a feature can overlap with a neighboring copy of that feature) • For example what is the frequency of GCGC in : ACTGCATGCGCGCATGCGCATATGACGA

Renewals • We define a renewal as the end of a non overlapping motif. • For example: The renewals of GCGC in ACTGCATGCGCGCATGCGCATATGCGCGCGC Are at 11,19,27,31 The clamps size are: 2,1,2,1

Renewals and Clump size. • Let R be a general pattern: R=(r1,…,rm) • Let us denote: R(i)=(r1,…,ri) R(i)=(rm-i+1,…,rm) • The clamp size is:

Clamp Frequency • Let us assume that the clamps are distributed randomly. Their frequency, and the interval between any two clamps would be:

Statistical tests • In order to test if the motif is over/under represented or non-uniformly distributed we must test the clamp distribution. • In order to test motif frequency we can test if the clamp frequency has an average and variance of nl • In order to test their distribution, we can divide the entire sequence into k subsequences of size: m<T<<1/land test that S has a c2 distribution, where Tiis the clump frequency in the subsequence and S is:

Frequency of simple motifs

Statistics of AT- or GC-rich regions • What is the probability of observing a “run” of the same nucleotide (e.g., 25 A’s) • Let px be the mononucleotide probability of nucleotide x • The per nucleotide probability of a run of N consecutive x’s is pxN • The probability of occurrence in a sequence of length L much longer than N is ≈ L pxN

Statistics of AT- or GC-rich regions • What if J “mismatches” are allowed? • Let py be the probability of observing a different nucleotide (normally py = 1 - px) • The probability of observing n-j of nucleotide x and j of nucleotide y in a region of length n is

Statistics of AC- or GC-rich regions • As before, we can multiply by L to approximate the probability of observing that combination in a sequence of length L • Note that this is the probability of observing exactlyN-J matches and exactlyJ mismatches. We may also wish to know the probability of finding at least N-J matches, which requires summing the probability for I=0 to I=J.

Frequency matrices

Frequency matrices • Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences • Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature

Weight matrix • Probabilistic model: How likely is each letter at each motif position? A C G T

Nomenclature Weight matrices are also known as • Position-specific scoring matrices • Position-specific probability matrices • Position-specific weight matrices

Scoring a motif model • A motif is interesting if it is very different from the background distribution A C G T less interesting more interesting

Relative entropy • A motif is interesting if it is very different from the background distribution • Use relative entropy*: pi, = probability of  in matrix position i b = background frequency (in non-motif sequence) * Relative entropy is sometimes called information content.

Scoring motif instances • A motif instance matches if it looks like it was generated by the weight matrix A C G T “ A C G G C G C C T” Not likely! Hard to tell Matches weight matrix

Log likelihood ratio • A motif instance matches if it looks like it was generated by the weight matrix • Use log likelihood ratio • Measures how much more like the weight matrix than like the background. i: the character at position i of the instance

Alternating approach • Guess an initial weight matrix • Use weight matrix to predict instances in the input sequences • Use instances to predict a weight matrix • Repeat 2 & 3 until satisfied. Examples: Gibbs sampler (Lawrence et al.) MEME (expectation max. / Bailey, Elkan) ANN-Spec (neural net / Workman, Stormo)

Expectation-maximization foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score EM

Sample DNA sequences >ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA GCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCAC AAAAATGGAAGTCCACAGTCTTGACAG >ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAG AAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTG CTATGCCATAGCATTTTTATCCATAAG >bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATA TAACTTTATAAATTCCTAAAATTACACAAAGTTAATAAC TGTGAGCATGGTCATATTTTTATCAAT >crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTAC AGTAATACATTGATGTACTGCATGTATGCAAAGGACGTC ACATTACCGTGCAGTACAGTTGATAGC

Motif occurrences >ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaata gcgcgtggtgtgaaagactgttttTTTGATCGTTTTCAC aaaaatggaagtccacagtcttgacag >ara gacaaaaacgcgtaacaaaagtgtctataatcacggcag aaaagtccacattgattaTTTGCACGGCGTCACactttg ctatgccatagcatttttatccataag >bglr1 acaaatcccaataacttaattattgggatttgttatata taactttataaattcctaaaattacacaaagttaataac TGTGAGCATGGTCATatttttatcaat >crp cacaaagcgaaagctatgctaaaacagtcaggatgctac agtaatacattgatgtactgcatgtaTGCAAAGGACGTC ACattaccgtgcagtacagttgatagc

Starting point …gactgttttTTTGATCGTTTTCACaaaaatgg… T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ... C 0.17 0.17 0.17 0.17 0.17 G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17

Re-estimating motif occurrences TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA T T T G A T C G T T A 0.17 0.170.170.17 0.50 ... C 0.17 0.17 0.17 0.17 0.17 G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.170.17 Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ...

Scoring each subsequence Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA Subsequences Score TGTGCTGGTTTTTGT 2.95 GTGCTGGTTTTTGTG 4.62 TGCTGGTTTTTGTGG 2.31 GCTGGTTTTTGTGGC ... Select from each sequence the subsequence with maximal score.

Re-estimating motif matrix Occurrences TTTGATCGTTTTCAC TTTGCACGGCGTCAC TGTGAGCATGGTCAT TGCAAAGGACGTCAC Counts A 000132011000040 C 001010300200403 G 020301131130000 T 423001002114001

Adding pseudocounts Counts + Pseudocounts A 111243122111151 C 112121411311514 G 131412242241111 T 534112113225112 Counts A 000132011000040 C 001010300200403 G 020301131130000 T 423001002114001

Converting to frequencies Counts + Pseudocounts A 111243122111151 C 112121411311514 G 131412242241111 T 534112113225112 T T T G A T C G T T A 0.13 0.13 0.13 0.25 0.50 ... C 0.13 0.13 0.25 0.13 0.25 G 0.13 0.38 0.13 0.50 0.13 T 0.63 0.38 0.50 0.13 0.13

Amino acid weight matrices • A sequence logo is a scaled position-specific A.A. distribution. Scaling is by a measure of a position’s information content.

Sequence logos • A visual representation of a position-specific distribution. Easy for nucleotides, but we need colour to depict up to 20 amino acid proportions. • Idea: overall height at position l proportional to information content (2-Hl); proportions of each nucleotide ( or amino acid) are in relation to their observed frequency at that position, with most frequent on top, next most frequent below, etc..

Summary of motif detection

Block Diagram for Searching with a PSSM PSSM Sequences that match above threshold PSSM search Threshold Set of Sequences to search Positions and scores of matches

Block Diagram for Searching for sequences related to a family with a PSSM Set of Aligned Sequence Features PSSM builder Expected frequencies of each sequence element PSSM Sequences that match above threshold PSSM search Threshold Positions and scores of matches Set of Sequences to search

Consensus sequences vs. frequency matrices • Should I use a consensus sequence or a frequency matrix to describe my site? • If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence • Example: Restriction enzyme recognition sites • If some allowed characters are "better" than others, use frequency matrix • Example: Promoter sequences • Advantages of consensus sequences: smaller description, quicker comparison • Disadvantage: lose quantitative information on preferences at certain locations

Similarity Functions • Used to facilitate comparison of two sequence elements • logical valued (true or false, 1 or 0) • test whether first argument matches (or could match) second argument • numerical valued • test degree to which first argument matches second

Sequence analysis