1 / 83

Regulatory Motif Finding

Regulatory Motif Finding. Mohammed AlQuraishi. Talk Outline. Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis of Motif Finders’ Performance. Talk Outline. Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut)

donoma
Download Presentation

Regulatory Motif Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regulatory Motif Finding Mohammed AlQuraishi

  2. Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance

  3. Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance

  4. Cell = Factory, Proteins = Machines Biovisions, Harvard

  5. DNA • Instructions for making the machines “Coding” Regions “Regulatory” Regions (Regulons) • Instructions for when and where to make them

  6. Transcriptional Regulation • Regulatory regions are comprised of “binding sites” • “Binding sites” attract a special class of proteins, known as “transcription factors” • Bound transcription factors can inhibit DNA transcription

  7. DNA Regulation Source: Richardson, University College London

  8. Cell Regulation • Transcriptional regulation is one of many regulatory mechanisms in the cell Focus of Talk Source: Mallery, University of Miami

  9. Structural Basis of Interaction

  10. Structural Basis of Interaction • Key Feature: • Transcription factors are not 100% specific when binding DNA • Not one sequence, but family of sequences, with varying affinities 0.54 0.48 G G G G G G G G A A A A C G C C C C T C C C C T A G G G G G 0.32 0.25 0.11 0.08

  11. Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance

  12. Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance

  13. Motif Finding • Basic Objective: • Find regions in the genome that bind transcription factors • Many classes of algorithms, differ in: • Types of input data • Motif representation

  14. Input Data • Single sequence • Evolutionarily related set of sequences • Sequence + other data • Microarray expression profile • ChIP-chip • Others…

  15. Motif Representation • Probabilistic • Word-Based Focus of Talk

  16. Motif Representation • Structural discussion immediately raises difficulties

  17. Structural Basis of Interaction • Key Feature: • Transcription factors are not 100% specific when binding DNA • Not one sequence, but family of sequences, with varying affinities 0.54 0.48 G G G G G G G G A A A A C G C C C C T C C C C T A G G G G G 0.32 0.25 0.11 0.08

  18. Motif Representation • Structural discussion immediately raises difficulties • Least Expressive: • Single sequence • Most Expressive: • 4k-dimensional probability distribution • Independently assign probability for each possible kmer G A C C G

  19. Motif Representation • Standard Solution: • Position-Specific Scoring Matrix (PSSM) • Assuming independence of positions, assign a probability for each position • Fraught with problems… (Will revisit this)

  20. Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance

  21. Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance

  22. Reference • Authors: • Eugene Fratkin, Brian T. Naughton, Douglas L. Brutlag, and Serafim Batzoglou • Title: • MotifCut: regulatory motifs finding with maximum density subgraphs • Publication: • Bioinformatics Vol. 22 no. 14 2006, pages e150–e157

  23. Overview • Motif Finding Algorithm (“MotifCut”) • Motivation • Oversimplicity of PSSMs • Intractability of more complex models

  24. Oversimplicity of PSSMs • Assumes independence between positions • ~25% of TRANSFAC motifs have been shown to violate this assumption • Two Examples: ADR1 and YAP6

  25. Oversimplicity of PSSMs • Assumes independence between positions • Generates potentially unseen motifs

  26. Basic Features of MotifCut • Does not assume an underlying PSSM • Represents a motif with a graph structure • In principle maximally expressive • In practice not quite • Motif finding cast as maximum density subgraph • Subquadratic complexity

  27. Motif Graph Representation • Nodes are kmers • Edge weights are distances between kmers 1 AGTGCGAC AGTGGGAC 1 1 0 2 AGTGGGAC 2 AGTGCTAC • Generative model: Frequency of kmer node equal to frequency of generating kmer • Distance definition is complicated (Will come back to) • Same kmer node can appear multiple times

  28. Motif Finding • Find highest density subgraph • Density is defined as sum of edge weights per node • Somewhat limits representational power

  29. Motif Finding • Read new sequence • Generate graph as previously described • Kmers are generated by shifting one base pair • Each kmer in the sequence gets a node, including identical kmers • Graph contains as many nodes as there are base pairs • Connect edges with weights based on distances between nodes • Find densest subgraph

  30. Edge Weights • Heart of the algorithm, will focus on this • Semantics: • Edge weight is the likelihood of two kmers to be in the same motif • Use Hamming distance as a way to quantify distance between kmers G G A A C C C C G G 3 2 0 1 C T A

  31. Edge Weights • Heart of the algorithm, will focus on this • Semantics: • Edge weight is the likelihood of two kmers to be in the same motif • Use Hamming distance as a way to quantify distance between kmers • “Interpret” hamming distance as a measure of the likelihood of two kmers to be in same motif: • F(hamming distance) = likelihood of two kmers to be in same motif

  32. Edge Weights • Let’s make this a bit more precise: • But how to compute ? • Simulate it! • Way too many variables to account for analytically: Background model, kmer length, hamming distance, etc…

  33. “Genome Simulation” • Background + Motifs • No genes, promoters, signaling sequences, etc. • Background Model • 3rd order Markov model • Probability of next base depends on previous 3 bases • Modeled on the yeast genome • Incorporates GC bias • Motif Model • PSSM • Based on empirically observed information content of yeast motifs

  34. “Genome Simulation” • Use Markov model to generate 10k – 20k length sequences of background • Seed with 20 motifs generated by the PSSM • Result is a simulated genome of yeast • We know which parts are the real motifs, and which are not

  35. Edge Weights • Back to : • is number of true motifs of k-length that are l-distance away • is number of non-motifs of k-length that are l-distance away

  36. Edge Weights True Motifs G G G G G G False Motifs (Part of Background) G G G T G G G G G G G G G G G G G G G G G C G G

  37. Edge Weights Let’s perform calculation from the perspective of this motif • All ≤1 distance away (Hamming distance) • α(k = 6, l = 1) = 1 • β(k = 6, l = 1) = 1 G G G G G G G G G G G G T G G T G G G G G G G G G G G G G G G G G G G G G G G G G G C C G G G G

  38. Edge Weights • Computation provides an empirical estimate for • Parameterized by two quantities: • k, the kmer length • l, the Hamming distance between two kmers • Fit to a sigmoidal function

  39. Edge Weights • Normalization step • Won’t go into details • This covers problem formulation • How is motif finding actually done?

  40. Maximum Density Subgraph • Standard graph theory method • Max-flow / min-cut • O(nm log(n2m)) • Need faster method • Developed heuristic approach that utilizes max-flow / min-cut method with modifications

  41. Maximum Density Subgraph • Remove all edges below a certain threshold

  42. Maximum Density Subgraph • Pick one vertex (do this for every vertex)

  43. Maximum Density Subgraph • Put back all neighboring edges for that vertex

  44. Maximum Density Subgraph • Use standard algorithm to calculate densest subgraph

  45. Results • Synthetic Tests • Plenty of test cases • Measure performance as data set size grows • Avoid over biasing on empirical data • Know real answer, can unambiguously test performance • Yeast Test • Gold standard data (Harbinson et al., 2004)

  46. Synthetic Tests • Varied: • Motif length • Information content • Simulated genome (as before) • Correlated predicted PSSMs to real ones, counted as true positive if correlation > 0.7

  47. Synthetic Tests Results

  48. Yeast Test Results

  49. Performance

  50. Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance

More Related