1 / 29

Repeats, Pseudogenes etc.

Repeats, Pseudogenes etc. Lecture 11/4. Paper : Wexler et al. Finding Approximate Tandem Repeats in Genomic Sequences Recomb 2004. Goal. Find approximate tandem repeat (ATR) A string of bases repeated consecutively at least twice with small differences between instances. ATR hunter.

clare
Download Presentation

Repeats, Pseudogenes etc.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Repeats, Pseudogenes etc. Lecture 11/4

  2. Paper : Wexler et al. • Finding Approximate Tandem Repeats in Genomic Sequences • Recomb 2004.

  3. Goal • Find approximate tandem repeat (ATR) • A string of bases repeated consecutively at least twice with small differences between instances

  4. ATR hunter • Screening phase (detection of candidates) • Verification of candidates

  5. Definitions of ATR • Simple ATR: A concatenation of sequences T = T1T2…Tr for which there exists a sequence T* such that (Ti,T*) > n for all i. •  is an alignment score • A neighboring ATR is one where (Ti,Ti+1) > n for all i. • A pairwise ATR is such that (Ti,Tj) > nij for all i,j, with nij being a monotonically decreasing function of |i-j|

  6. Algorithm: Screening phase • Want to find repeats of size (period) t • Look for matching pairs of words of length l < t • Fig 1 (on whiteboard) • For each pair of l-windows, compute 0/1 vector of length l • Such a vector is a “q-quality” vector iff fraction of 1s is >= q

  7. Definitions • Given l, the length of the scanning window, and q, the quality • Score St(i) is the number of q-quality vectors in the t-length string beginning at position i • Gap Deltat(i) is the maximal number of consecutive l-length windows that are not q-quality, in the t-length string beginning at i • Fig 2 (whiteboard)

  8. Screening: Criteria • A t-length string starting at position i is a candidate ATR (along with an adjacent string of length about t) if • St(i) > Threshold 1 • Deltat(i) < Threshold 2 • For every q-quality vector counted in St(i), two matched l-windows are at a maximum distance of some Threshold 3 • Due to indels, we allow matching pairs of l-windows to be closer or nearer than t

  9. Screening: algorithm • Start with pair of l-windows at position 1 and t+1 • In each step slide first window by 1, and slide second window by 0,1,or 2. • Greedily maximize the number of q-quality windows produced this way • Upper bounds on q and l chosen to ensure that alignment score desired is exceeded • (See definition of pairwise ATR)

  10. Verification phase • Explicit alignment of the two strings to check that alignment threshold crossed • Building longer ATRs out of smaller ones

  11. Statistical Framework • Determination of thresholds

  12. Threshold 3 • For every q-quality vector counted in St(i), two matched l-windows are at a maximum distance of some Threshold 3 • Distance not always t, due to indels happening • Distance may be d1, d2, d3, etc. • How much fluctuation to allow is the Threshold 3 • As in Benson’s TRF (previous lecture) • Random walk with probability pI = probability of insertion or deletion

  13. Threshold 1 • Number of q-quality vectors of length l, appearing in a random sequence of length t, drawn uniformly • This probability distribution used to decide threshold on score St(i) • Hard to compute this distribution analytically • Approximated

  14. Approximating the score distribution • Build a Markov Chain • State vi : represents all l-length strings of Hamming weight i, beginning with 0 • State v’i : represents all l-length strings of Hamming weight i, beginning with 1 • Transition probability among various vi and v’i computed • Fig 3 (whiteboard)

  15. Dynamic Programming • Compute the probability distribution of number of q-vertices (vq or v’q) visited in a random walk of length t-l • Dynamic programming.

  16. Performance compared to TRF (Benson) • In synthetic data set, about 10% more ATRs discovered • Also compared to TRF on real data, more repeats found

  17. Pseudogenes: Coin & Durbin • Novel methods for separating pseudogenes from functional genes • Unprocessed genes : result of gene duplication, and loss of function of one copy • Processed pseudogenes : due to reverse transcription of processed mRNA • lack introns

  18. Pseudogene loss • Pseudogene dies very quickly, therefore expect few pseudogenes in genome • Prokaryotes have few pseudogenes • Eukaryotes have many pseudogenes • ~20,000 human pseudogenes

  19. Pseudogene detection • Detect truncations in genes • Ratio of synonymous to non-synonymous substitution rate • Approach in this paper: • Pattern of substitution in conserved protein domains • Profile HMMs to model protein domains

  20. Program PSILC • Given an alignment A, an unrooted tree T, profile HMM D representing a protein domain aligned to A • Output: for each leaf-node n, a score representing our belief that the node is a pseudogene • Assume that the rest of the tree evolves as the protein domain would

  21. Two scores • Final branch to node n evolved as neutral (non-coding) OR as a protein domain • Final branch to node n evolved as protein-coding OR as a protein domain • Log odds ratio • If a node is a pseudogene, it does not have the protein domain constraint, so both scores should be higher than usual

  22. Terminology • A : alignment • T : Tree • Xn* : Row n, i.e., sequence at node n • X*i, : Column i, i.e., ith position of all • Fig 4 (whiteboard)

  23. Terminology • Probability that evolution on branch b in the tree is due to: • neutral DNA : Pnuc(b) • protein-coding : Pprot(b) • protein domain encoding : Pdom(b)

  24. Terminology • Cnuc = {Pnuc(bn), Pdom(T\bn)} : neutral Dna on bn, otherwise domain encoding • Cprot = {Pprot(bn), Pdom(T\bn)} : protein-coding on bn, otherwise domain encoding • Cdom= Pdom(T) = {Pdom(bn), Pdom(T\bn)} : domain encoding on all T

  25. Scores • PSILCnuc/dom(n) = • PSILCprot/dom(n) = • Each computed in a manner similar to Felsenstein’s algorithm • Fig 5 (whiteboard)

  26. Likelihood calculation • Compute prob. distr. at parent node pn given the entire tree T, except node n (assume domain-encoding evolution) • Compute probability of parent pn mutating to leaf n, given whatever evolutionary constraint Ck

  27. First step: Rest of the tree • Reroot the tree at parent pn and remove branch to node n. New tree is T\bn. • Fig 6 (whiteboard) • Product of two terms: • Probability of leaves of tree T\bn given root • Felsenstein’s algorithm • Prior probability of root of T\bn • Use equilibrium distribution

  28. Second step (and part of first):The branch mutation model • P(xchild,i|xparent,i,bchild,Pk(bchild)) • Phylogenetic models available • neutral Dna evolution (Pnuc) : HKY model • protein-coding evolution (Pprot) : WAG model • domain-encoding evolution (Pdom) : profile HMM match state emission probabilities • These give us the rate matrix Q • Pk(t) = exp(Qrt) • Free rate parameter r

  29. Tests • On human, mouse, rat data • Pprot/dom outperforms all others, including Pnuc/dom

More Related