1 / 70

Comparative genomics to identify DNA binding motifs

Comparative genomics to identify DNA binding motifs. Saurabh Sinha Dept. of Computer Science University of Illinois, Urbana-Champaign. Outline. Binding sites and motifs The motif finding problem in one species Comparative genomics and alignment

lucia
Download Presentation

Comparative genomics to identify DNA binding motifs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative genomics to identify DNA binding motifs Saurabh Sinha Dept. of Computer Science University of Illinois, Urbana-Champaign

  2. Outline • Binding sites and motifs • The motif finding problem in one species • Comparative genomics and alignment • The motif finding problem with comparative genomics

  3. Motif finding in multiple species • Footprinter : the approach without alignments • PhyloCon : The use of alignments • PhyME & PhyloGibbs : The use of alignments and an evolutionary model • MCS : Genome-wide motif finding from multiple species

  4. Binding sites and motifs

  5. Binding sites • A few binding sites of transcription factor “Bicoid” in the Drosophila (fruitfly) genome, collected experimentally

  6. http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

  7. T A A T C C C Motif http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

  8. W A A T C C N Motif W = T or A N = A,C,G,T “Consensus String” http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

  9. Motif • Common sequence “pattern” in the binding sites of a transcription factor • A succinct way of capturing variability among the binding sites

  10. Alternative way to represent motif Position weight matrix (PWM) Or simply, “weight matrix”

  11. Motif representation • Consensus string • May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; S = C/G; R = A/G; Y = T/C etc. • Tractable search space, enumerative algorithms • Position weight matrix • More powerful representation • Probabilistic treatment, algorithms • More popular

  12. The motif finding problem(in one species) • Suppose a transcription factor (TF) regulates five different genes • Each of the five genes should have binding sites for TF in their promoter region Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF

  13. The motif finding problem • Now suppose we are given the promoter regions of the five genes G1, G2, … G5 • Can we find the binding sites of TF, without knowing about them a priori ? • Binding sites are similar to each other, but not necessarily identical • This is the motif finding problem • To find a motif that represents binding sites of an unknown TF

  14. Motif finding algorithms • Version 1: Given promoter regions of co-regulated genes, find the motif • Existing algorithms: • Gibbs sampling (MCMC) : Lawrence et al. 1993 • MEME (Expectation-Maximization) : Bailey & Elkan 94 • CONSENSUS (Greedy local search, beam search) : Hertz & Stormo • Word enumeration methods (with emphasis on statistical accuracy) • van Helden et al. 1998, Sinha & Tompa 2000 • And a hundred others

  15. Comparative Genomics

  16. species1 GCGTGATCGAGCTATAACGGAA GCGTGATCGAGCTATAACGGAA species2 CTGTGATCGTCGGGTAACGCCC CTGTGATCGTCGGGTAACGCCC species3 TGGTGATCGGAACCCCTAACGA TGGTGATCGGAACCCCTAACGA species4 AAGTGATCGATTATCCTAACGT AAGTGATCGATTATCCTAACGT EVOLUTIONARY TREE BLOCKS OF CONSERVATION More Data • Genomes of multiple species available

  17. Using multiple genomes • Functional parts of the genome evolve more slowly than non-functional parts • Identify conserved parts by sequence alignment algorithms • Look for functional features in conserved regions – this improves the signal Popular Paradigm in Computational Biology

  18. Multiple sequence alignment • Comparative genomics relies upon the ability to detect “similar” (evolutionarily related) regions in different genomes • The problem of multiple species alignment • A hard computational problem (“NP-hard”) • Several fast heuristics exist (Mlagan, TBA) • Assume this functionality exists …

  19. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF Back To Motif finding

  20. Motif finding from multiple species data • Version 2: Given promoter regions of same gene • from multiple species, find the motif Species 1 Species 2 Gene G Species 3 Species 4 Species 5 Binding sites for TF

  21. Blocks of conservation One approach • Do multiple sequence alignment of upstream regions of gene Species 1 Species 2 Gene G Species 3 Species 4 Species 5 • Look for recurring motifs in conserved blocks

  22. Blocks of conservation Another approach (alignment-free) • What if binding sites are not entirely within conserved blocks? Species 1 Species 2 Gene G Species 3 Species 4 Species 5 • Look for recurring motifs in entire upstream regions

  23. Footprinter (Blanchette et al.)The method without alignments

  24. Footprinter • The input sequences are promoter regions of the same gene, but from multiple species. • Such sequences are said to be “orthologous” to each other.

  25. Footprinter Input sequences Related by an evolutionary tree Find motif

  26. A side note: Parsimony • A guiding principle in cross-species comparison • If the data can be explained in multiple ways, prefer the one with the fewer number of events (be parsimonious) • Parsimony score = number of evolutionary events (e.g., substitutions) on the tree • Maximum parsimony principle: minimize parsimony score

  27. Phylogenetic footprinting: formally speaking Given: • phylogenetic tree T, • set of orthologous sequences at leaves of T, • length k of motif • threshold d Problem: • Find set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in Tis at most d.

  28. AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACGT...(Rabbit) GAACGGAGTACGT...(Mouse) TCGTGACGGTGAT... (Rat) Small Example Size of motif sought: k = 4

  29. AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGT ACGT ACGT ACGG Solution Parsimony score: 1 mutation

  30. … ACGG: +ACGT: 0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG: 1 ACGT: 0 ... 4k entries AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2ACGT: 1... … ACGG: 1ACGT: 1... … ACGG: 0ACGT: 2 ... … ACGG: 0 ACGT: +... An Exact Algorithm(Blanchette’s algorithm) Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s.

  31. Wu [s] =  min ( Wv [t] + d(s, t) ) • A post-order traversal algorithm v:child t ofu Recurrence

  32. Wu [s] =  min ( Wv [t] + d(s, t) ) v:child t ofu Running Time O(k 42k )timeper node

  33. Footprinter: features • One of the earliest motif-finding algorithms based on comparative genomics • Simple formulation of motif score, algorithm efficient in practice • Cannot combine evolutionary conservation information with overrepresentation information • two motifs, equally conserved, but one occurs in many co-regulated genes (promoters)

  34. PhyloCon (Stormo lab)The method with alignments

  35. The underlying single-species algorithm: CONSENSUS Final goal: Find a set of substrings, one in each input sequence Set of substrings define a PWM. Goal: This PWM should have high information content. High information content means that the motif “stands out”.

  36. The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings.

  37. The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif.

  38. ? ? ? ? The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Consider every substring in the next sequence, try adding it to current motif and scoring resulting motif

  39. The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Pick the best one ….

  40. The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. … and repeat Pick the best one ….

  41. The key: Scoring a motif The current motif. Scoring a motif:

  42. The key: Scoring a motif The current motif. Scoring a motif: Build a PWM Compute information content of PWM: For each column, Compute relative entropy relative to a “background” distribution Sum over all columns Key: to align the sites of a motif, and score the alignment

  43. Extending CONSENSUS to multiple species Final goal: Find a set of substrings, one in each input sequence

  44. Extending CONSENSUS to multiple species Final goal: Find a set of “profiles”, one in each set of orthologous input sequences

  45. Extending CONSENSUS to multiple species “Profiles”

  46. Extending CONSENSUS to multiple species “Profiles”

  47. Extending CONSENSUS to multiple species

  48. Aligning two “profiles” • Compare two profiles column by column • Each column of a profile is (nA,nC,nG,nT), and equivalently, (fA,fC,fG,fT) • Probabilistic score to capture if two columns {nbi,fbi}b and {nbj,fbj}b are from the same distribution (and different from background) • ALLR: Avg. Log Likelihood Ratio where pb is background frequency of base b

  49. One cool feature of ALLR • Expected value is negative, means very long profiles will not automatically give large ALLR scores • Therefore, can automatically detect the “right” motif length

  50. PhyloCon: features • One of the first algorithms to find motifs that are conserved across species and occur in multiple co-regulated gene promoters • Does not consider the evolutionary relationships among species (all species weighted equally)

More Related