110 likes | 142 Views
Algorithms in Bioinformatics: A Practical Introduction. Project: Motif finding using ChIP-seq peak data. Transcriptional Control (I). Transcriptional Control (II). TATAAT is the motif!. Motif model. TTGACA TCGACA TTGACA TTGAAA ATGACA TTGACA GTGACA TTGACT TTGACC TTGACA. Consensus
E N D
Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data
Transcriptional Control (II) TATAAT is the motif!
Motif model TTGACA TCGACA TTGACA TTGAAA ATGACA TTGACA GTGACA TTGACT TTGACC TTGACA Consensus Pattern TTGACA Positional Weight Matrix (PWM) • Motif can be described in two ways based on the binding sites discovered
ChIP experiment • Chromatin immunoprecipitation experiment • Detect the interaction between protein (transcription factor) and DNA.
Peak data • Peak data represents the locations where a particular TF binding. • The data tells us the locations and intensities. • (Note that due to experimental error, peaks of low intensity may be noise.) ChIP-seq data for Human (MCF7) E2 treatment at 45min chr1:883,686-958,485
Our aim • Given the DNA sequences of those peaks, find motifs which occur in those peak regions. • For the example below, we have two motifs: TTGACA and GCATC. • Note that each instance has at most 1 mutation. GCACGCGGTATCGTTAGCTTGACAATGAAGAATCCCCCCGCTCGACAGT GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG CCCTCTGGAAATTAGTGCGGCATCTCACAACCCGAGGAATGACCAAATG GTATTGAAAGTAAGGCAACGGTGATCCCCATGACACCAAAGATGCTAAG CAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGC GTCTCTTGACCGCTTAATCCTAAAGGCCTCCTATTAGTATCCGCAATGT GAACAGGAGCGCGAGCCATCAATTGAAGCGAAGTTGACACCTAATAACT
Input (I) • From every peak, we get approximately +/-200 DNA sequence >cmyc_1_chr1_4842133_4842148_range_chr1_4841934_4842348_intensity_20 CCTCCATACCAGCCCCAATGTTCTGCGTTCCCGAATGAAAGACACACAACACAGCCTTTATATTTTGATATGCCTAAAACTGCTCAATGGCTGGGCCACTTCCTAGCTAGTATCCACGTGGCTATCCCACCTCTCTCTGATATTCCCAAGTCATTACTTACTAAAATCTGTAATTACATCTTTGCTGCCCTAGGCCCAATCTGGCAGCCCTCCTGTGGCCCCTCAGGCTACTACATGGCAGCTAAGCTCTCTGACCCACATCTTCTCAGGCACCGTGCCTCCTCTTCTCCACCTTATTCAAACATGGTGGCTCTCCTTCCTCCTTCTTCCTGTCTGTCCCCAGCCTGGGAATTCTAAAAGTCCCACCTCTGTCTGCCCTGTTCAGCCATTGGCTGTCGGCATCTTTATTTACGAG >cmyc_2_chr1_5073201_5073215_range_chr1_5073002_5073415_intensity_15 GGTCATAAACCAAGCTTCTTCAAAGATTTTTGGCTTTTTGGCACCAGTGGCCTGCAGGGTGGCGAGCTCTGCCAGTTTGAAGTGACCAAGTTAAGTGGCCTGGGAAAGGCCATTTGGTGCGCGGTCCAGCAGTTTTGGGCGCTCTCGGCTTCCGCCCTCAGCTGCGGTCACGTGCGGCTGCTCACGTGCCAGACGCTGCTGTCACTTCGTAGCTGTTCCGGCTTCCTCTGAGTGAGGCTCGCAACGTCTCCCACGGAGTCGCCTTCGTTCTGCTCTGGGTCTCCCGTGGCCACTGAGACCTCGGAGCTCGACCGGCGCCTGCCCGCCCGTGCGGCCCTCACTCCCCGAGGCTATCCAGGTGAGGCCGCCTGGGGTCCCTCCCCGGCTCCGGAGAGCCGACTGGTTTCCCTGCCG >cmyc_3_chr1_9530642_9530652_range_chr1_9530443_9530852_intensity_36 GTAGTCCCAACCAGGTCCTGAGCTGGTTAGCCAACCCTCAGCGCCAGTCGGGCCAACATCCGGTGACGAATCCAAGTCCCGCCTCTAAGCCCATCTGCTGTCCAATGCCGCCCTCTGCCGGTCTTTACCTCCCCGCCTAGCTGTGAGCCGCTTCCAGACAACCCGGAAGTGATCTTTCCTCTTCCGGATTACGGGTCCGGACGTCCGCACGTGGTTGCCGGTTTAGGGTGCTGCTGTAGTGGCGATACGTCCCGCCGCTGTCCCGAAGTGAGGGATCCGAGCCGCAGCGAGAGCCATGGAGGGCCAGCGCGTGGAGGAGCTGCTGGCCAAGGCAGAGCAGGAGGAGGCGGAGAAGCTGCAGCGCATCACGGTGCACAAGGAGCTGGAGCTGGAGTTCGACCTGGGCAACC ……………
Input (II) • A set of sequences which are likely containing no motif. >SEQ_1 AACAAGGGAAAGAGTAGTGAGTGCTTCTTTCTATTCAGAGGGAGGGGAAGTTGCTGTTAGCTAAGACAGTCAGGACTGAGAAGGGGGGGGGGGGTTTAACTCTCCTGGAGGGAGCTGAGAGGTAAAGGGAGGGGCGTGAGGTAGAACAAGCCGAGAACACAGGGCAGGTTGGTCTGACTCCAGAGCACAGTGCAGGAGCCCGGAAGTTGACTCAGTTCAGTTAGCAAGTATTTTCACACAAGGCGTGAACACTGAAGACAAAAGCAAGAGACACAGCTCTATCTCTAAGAAGATTTTCAGAGCCAAGATCGATGGGGCACACCTGTTAATCCCAGCACTTAGGAGGCTGAGGCAGGAGGATCCCAAGTTCAAAACCAGCCTGGACTTGTTTTAAGGAAAA >SEQ_2 AAAAAAAAAAAAAAGACTTCCAGTTTAATAAATGACCAATTCAGGAATGGAGATTAGGGCTGGATGACAAGTTTTTAATTGTCAAGGACTCAATTCTGTTTATCAGTTGGTATGGAATTATGTAAGCTTTTAGCGATATGACCGCACGGAGCAGTGTAGAGAGTGATCTGAGAGACGCTTGGGGGTCAGGATGGAGATAGAACTCCCTCTCTATTAGAAGGTGTTTGGTGGTAGGTAACCCTGGGCTAGCATGGTGGGTCTCTTCTTACTTAGGCTTCCATCTTTGTGGTTCAAATCCAAGAAGGACCTGCGTTCCCTCCCTCCTTGTGATCAGCTGATTGCTAGAGCATAACTCATCTTAACTTCTCATGTACTCTCCGGGTACAGGAAGGGAGGGGGC >SEQ_3 CCACTGCTGACAGTGGAGCATGAAACGACCGGCTTCCTGACTATGTTGGTACCCTTTCAGGAGCCTAAAACAGTGCTTTCAATACTTGTGTCTATGTCTGTTAGCCACAACTTTCTAGTTTCCCAGAGAGATTTTGAAGTGTAGTTTTGTATTTGCTCAAATATATATTCATATGGTGAGGTGCACATTTTTTATATTATATTTTTATTCATTTATTTTTGGTGCTTGGGAATTATACTCTAGGAATAAAGCGCCTGGTAGAAAGTGGCACACATCTTTAATCCCAGCACTCAGGAAGCAGAGGCAGACAAATCTCTGCGTTCCAGGACAGCCTGGTCTATAGAGCAAGGTCCAAGCCAGCCAGGTTTACACAAAGAAACCTAGTGTGGAAAAGACAAAA ……………
Output • You need to output a list of candidate (ranked) motifs. • You can model the motif as PWM or consensus sequence. • If you model the motif as a PWM, one of the answer for the previous dataset is • You may also return other significant motifs.
Aim of the project • Given a sample file and a background file, • you need to implement a method which output a list of motifs. • You need to take advantage of the fact that this is a ChIP-seq dataset • Hint: Read papers on ChIP-seq and understand its properties.