1 / 36

DNA Motif Finding

DNA Motif Finding. Katherina Kechris Introduction to Bioinformatics BIOI 7710/7711 Lecture 12 10/6/05. DNA Motifs. Short repeating sequence elements in the genome can have important regulatory function Transcription, splicing, post-transcriptional processing, …

dermot
Download Presentation

DNA Motif Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DNA Motif Finding Katherina Kechris Introduction to Bioinformatics BIOI 7710/7711 Lecture 12 10/6/05

  2. DNA Motifs • Short repeating sequence elements in the genome can have important regulatory function • Transcription, splicing, post-transcriptional processing, … • Motifs are representations of known examples • Local multiple sequence alignment

  3. Genes to Proteins actggtacgtggaccgttacg acugguacguggaccguuacg TGTWTVT

  4. Transcription

  5. Environment yeast: Gal4 galactose-rich conditions Development drosophila: Hb, Bi, Kr, Gt embryonic-patterning Tissue-specific mammals: C/EBP b liver Examples: Transcription Factors Expression of even-skipped (eve)

  6. Transcription FactorBinding Sites CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAATCCG CGGAGGGCTGTCGCCCG CGGAGGAGAGTCTTCCG CGGAGCAGTGCGGCGCG CGCGCCGCACTGCTCCG CGGAAGACTCTCCTCCG CGGGCGACAGCCCTCCG CGGATTAGAAGCCGCCG CGGGGCGGATCACTCCG CGGCGGTCTTTCGTCCG CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG

  7. DatabasesTRANSFAC: http://www.gene-regulation.com/pub/databases.html#transfac Binding Sites

  8. MoreDatabases Species-specific: SCPD (yeast) http://rulai.cshl.edu/SCPD/ DPInteract (e. coli) http://arep.med.harvard.edu/dpinteract/ Drosophila DNase I Footprint Database (v2.0) http://www.flyreg.org/

  9. Transcription FactorBinding Sites CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAATCCG CGGAGGGCTGTCGCCCG CGGAGGAGAGTCTTCCG CGGAGCAGTGCGGCGCG CGCGCCGCACTGCTCCG CGGAAGACTCTCCTCCG CGGGCGACAGCCCTCCG CGGATTAGAAGCCGCCG CGGGGCGGATCACTCCG CGGCGGTCTTTCGTCCG CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG

  10. Motif Representations CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAATCCG ... CGGGGCAGACTATTCCG • Consensus • Frequency Matrix • Logo CGGNGCACANTCNTCCG

  11. Logos • Graphical representation of nucleotide base (or amino acid) conservation in a motif (or alignment) • Information theory • Height of letters represents relative frequency of nucleotide bases http://weblogo.berkeley.edu/

  12. Position Weight Matrix (PWM) Frequency Matrix Weight Matrix background frequencies NOTE: Use pseudo-counts for zero frequencies

  13. Predicting Motif Occurrences:Sequence Scoring a g c g g t a Sum = 13.5 Sum = -15.6

  14. Novel Motif Prediction • Goal: Characterize and predict locations of novel motif in sequences • Challenges: • Short (6-20 bases) • Degenerate • Locations not fixed • Signal to noise • eg., yeast 600-800bps

  15. Problem Data: Upstream sequences from co-regulated/co-expressed genes. Assumption: Binding site occurs in most sequences 1: actcgtcggggcgtacgtacgtaacgtacgtacggacaactgttgaccg 2: cggagcactgttgagcgacaagtacggagcactgttgagcggtacgtac 3: ccccgtaggcggcgcactctcgcccgggcgtacgtacgtaacgtacgta 4: agggcgcgtacgctaccgtcgacgtcgcgcgccgcactgctccgacgct Goals: 1) Estimate motif 2) Predict locations of motifs 1: actcgtcggggcgtacgtacgtaacgtacgtaCGGACAACTGTTGACCG 2: cggagcactgttgagcgacaagtaCGGAGCACTGTTGAGCGgtacgtac 3: ccccgtaggCGGCGCACTCTCGCCCGggcgtacgtacgtaacgtacgta 4: agggcgcgtacgctaccgtcgacgtcgCGCGCCGCACTGCTCCGacgct

  16. Strategies • Deterministic • Regular expression representation A-C-[AG]-x(2,5)-T-x(2)-A • Enumerative • Probabilistic • Statistical model • Frequency matrix

  17. RSA Toolshttp://rsat.ulb.ac.be/rsat/

  18. Strategies • Deterministic • Enumerative • Regular expression representation (consensus) A-C-[AG]-x(2,5)-T-x(2) • Probabilistic • Statistical model • Frequency matrix

  19. Model cggagcactgttgagcgacaagtaCGGAGCACTGTTGAGCGgtacgtac Positions are independent, non-identically distributed Background Positions: Positions are independent, identically distributed • Motif start-positions are missing data • Assume one motif occurrence per sequence • Goals: 1) estimate motif and 2) predict locations of motifs

  20. Basics for Estimation • Conditional on frequency matrix For each sequence k and position j e.g., sequence = “ctCGTCggggc” , j = 3, motif width W = 4 • Conditional on motif start-positions j e.g., N = number of sequences = number of b’s at motif position i cgTACGtaacg acaagtaCGGA cCCCGtaggcg cgcgCGCCgca

  21. Estimation: Method I • Gibbs Motif Sampler • Bayesian model, prior distribution • Algorithm (MCMC) Initialization: Randomly select motif start-positions in each sequence Iterations: Remove randomly selected sequence k’ • Update frequency matrix • Randomly select a motif start-position j for k’ proportional to:

  22. Gibbs Motif Samplerhttp://bayesweb.wadsworth.org/gibbs/gibbs.html

  23. Estimation: Method II • MEME • Missing data problem: Expectation-Maximization (EM) Algorithm to obtain maximum likelihood estimates • EM Algorithm Initialization: Set frequency matrix p and p0 Iterations: • E-step: Calculate probability of motif start-positions For each sequence k and position j Wkj= Pr(motif start-position = j | p) • M-step: Update frequency matrix estimate

  24. MEMEhttp://meme.sdsc.edu/meme/website/meme.html

  25. MEME Output

  26. MAST: Sequence Scoring

  27. Model Extensions • Multiple occurrences in sequence • Motif width • Multiple motifs • Alternative background models • Palindromes • Gapped motifs • Dependencies between positions Software: AlignACE (Roth et al., 1998), BioProspector (Liu et al., 2001) Sometimes predicted motifs do not look “real”. They are not reflecting structural constraints.

  28. Gal4 Motif Information Content

  29. Examples: Information Content Bi-modal yeast : gal4, abf1, pho4 E. coli : crp, purR Uni-modal

  30. Goal: Incorporate Structural Constraints into the Model • Nature of transcription factor - DNA interactions imposes constraints on the motifs …. not all motifs are equally likely! • Objective is to bias the search for motifs which reflects these types of structural constraints. CGGACAACTGATGACCG CGGAGCACAGTTGAGCG CGGCGGCTTCTAATCCG CGGAGGGCTGTCGCCCG Methods: TFEM (Kechris et al., 2004), van Zwet et al., (2005)

  31. Permuted Motif

  32. TFEM: Blocks • For each position i= 1,2,… W, assign prior distribution f on multinomial parameters pi • According to block, prior distribution: high (fh) or medium (fm) information • Prior distribution penalizes deviations from high or medium information HIGH HIGH MEDIUM Bi-modal Information (reverse for Uni-modal)

  33. TFEM: Change Points • May not know change points between blocks • Include unobserved random variable for all change point pairs (s,t):W(W+1)/2 + 1 possible pairs

  34. TFEM: Results • Application • Sequences from co-regulated/co-expressed genes • Knowledge about transcription factor (family, structure) • Use method with expected motif shape (uni/bi-modal) • Results • Extended model with prior distribution helps bias the search • Evaluated with decoy motifs and “noisy” data (longer sequences)

  35. Recent Directions • Experimental Data • Microarrays • ChIP-Chip • Phylogenetic Analysis • Cross-species comparisons • Higher organisms • Motif Modules

  36. References • Reviews • Stormo GD (2000), Bioinformatics, 16:16-23 • Bulyk (2003), Genome Biology 5:201 • Logos • Schneider & Stephens (1990), Nucleic Acids Res. 18:6097-6100 • Enumerative • Jones and Pevzner (4.4-4.6) • Brazma et al. (1998), J. Comp. Bio, 5:279-305 • Probabilistic • GMS: Lawrence et al. (1993), Science, 262:208-214 • MEME: Bailey & Elkan (1995), Machine Learning, 21:51-80 • Structural Constraints • Kechris et al. (2004), Genome Biology, 5(7):R50. • van Zwet et al. (2005), Stat. Appl. Genet. & Mol. Biol., 4(1) Article 1.

More Related