1 / 18

Bioinformatics and Data Warehousing Introduction to Bioinformatics FASTA File Format

Bioinformatics and Data Warehousing Introduction to Bioinformatics FASTA File Format Searching Gene Sequences (BLAST) Data Management in Biomedical Informatics. Michael Kane, Ph.D. Computer & Information Technology Bindley Bioscience Center Purdue University. DNA is Information Storage.

Download Presentation

Bioinformatics and Data Warehousing Introduction to Bioinformatics FASTA File Format

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics and Data Warehousing • Introduction to Bioinformatics • FASTA File Format • Searching Gene Sequences (BLAST) • Data Management in Biomedical Informatics Michael Kane, Ph.D. Computer & Information Technology Bindley Bioscience Center Purdue University

  2. DNA is Information Storage

  3. “Zipped Files” Decompression “Executable Files”

  4. DNA is Double Stranded – One strand is the “coding strand” and the other strand is there to stabilize the DNA sequence when not in use. Double-stranded DNA is very durable in our environment.

  5. CAGGACCATGGAACTCAGCGTCCTCCTCTTCCTTGCACTCCTCACAGGACTCTTGCTACTCCTGGTTCAGCGCCACCCTAACACCCATGACCGCCTCCCACCAGGGCCCCGCCCTCTGCCCCTTTTGGGAAACCTTCTGCAGATGGATAGAAGAGGCCTACTCAAATCCTTTCTGAGGTTCCGAGAGAAATATGGGGACGTCTTCACGGTACACCTGGGACCGAGGCCCGTGGTCATGCTGTGTGGAGTAGAGGCCATACGGGAGGCCCTTGTGGACAAGGCTGAGGCCTTCTCTGGCCGGGGAAAAATCGCCATGGTCGACCCATTCTTCCGGGGATATGGTGTGATCTTTGCCAATGGAAACCGCTGGAAGGTGCTTCGGCGATTCTCTGTGACCACTATGAGGGACTTCGGGATGGGAAAGCGGAGTGTGGAGGAGCGGATTCAGGAGGAGGCTCAGTGTCTGATAGAGGAGCTTCGGAAATCCAAGGGGGCCCTCATGGACCCCACCTTCCTCTTCCAGTCCATTACCGCCAACATCATCTGCTCCATCGTCTTTGGAAAACGATTCCACTACCAAGATCAAGAGTTCCTGAAGATGCTGAACTTGTTCTACCAGACTTTTTCACTCATCAGCTCTGTATTCGGCCAGCTGTTTGAGCTCTTCTCTGGCTTCTTGAAATACTTTCCTGGGGCACACAGGCAAGTTTACAAAAACCTGCAGGAAATCAATGCTTACATTGGCCACAGTGTGGAGAAGCACCGTGAAACCCTGGACCCCAGCGCCCCCAAGGACCTCATCGACACCTACCTGCTCCACATGGAAAAAGAGAAATCCAACGCACACAGTGAATTCAGCCACCAGAACCTCAACCTCAACACGCTCTCGCTCTTCTTTGCTGGCACTGAGACCACCAGCACCACTCTCCGCTACGGCTTCCTGCTCATGCTCAAATACCCTCATGTTGCAGAGAGAGTCTACAGGGAGATTGAACAGGTGATTGGCCCACATCGCCCTCCAGAGCTTCATGACCGAGCCAAAATGCCATACACAGAGGCAGTCATCTATGAGATTCAGAGATTTTCCGACCTTCTCCCCATGGGTGTGCCCCACATTGTCACCCAACACACCAGCTTCCGAGGGTACATCATCCCCAAGGACACAGAAGTATTTCTCATCCTGAGCACTGCTCTCCATGACCCACACTACAGGACCATGGAACTCAGCGTCCTCCTCTTCCTTGCACTCCTCACAGGACTCTTGCTACTCCTGGTTCAGCGCCACCCTAACACCCATGACCGCCTCCCACCAGGGCCCCGCCCTCTGCCCCTTTTGGGAAACCTTCTGCAGATGGATAGAAGAGGCCTACTCAAATCCTTTCTGAGGTTCCGAGAGAAATATGGGGACGTCTTCACGGTACACCTGGGACCGAGGCCCGTGGTCATGCTGTGTGGAGTAGAGGCCATACGGGAGGCCCTTGTGGACAAGGCTGAGGCCTTCTCTGGCCGGGGAAAAATCGCCATGGTCGACCCATTCTTCCGGGGATATGGTGTGATCTTTGCCAATGGAAACCGCTGGAAGGTGCTTCGGCGATTCTCTGTGACCACTATGAGGGACTTCGGGATGGGAAAGCGGAGTGTGGAGGAGCGGATTCAGGAGGAGGCTCAGTGTCTGATAGAGGAGCTTCGGAAATCCAAGGGGGCCCTCATGGACCCCACCTTCCTCTTCCAGTCCATTACCGCCAACATCATCTGCTCCATCGTCTTTGGAAAACGATTCCACTACCAAGATCAAGAGTTCCTGAAGATGCTGAACTTGTTCTACCAGACTTTTTCACTCATCAGCTCTGTATTCGGCCAGCTGTTTGAGCTCTTCTCTGGCTTCTTGAAATACTTTCCTGGGGCACACAGGCAAGTTTACAAAAACCTGCAGGAAATCAATGCTTACATTGGCCACAGTGTGGAGAAGCACCGTGAAACCCTGGACCCCAGCGCCCCCAAGGACCTCATCGACACCTACCTGCTCCACATGGAAAAAGAGAAATCCAACGCACACAGTGAATTCAGCCACCAGAACCTCAACCTCAACACGCTCTCGCTCTTCTTTGCTGGCACTGAGACCACCAGCACCACTCTCCGCTACGGCTTCCTGCTCATGCTCAAATACCCTCATGTTGCAGAGAGAGTCTACAGGGAGATTGAACAGGTGATTGGCCCACATCGCCCTCCAGAGCTTCATGACCGAGCCAAAATGCCATACACAGAGGCAGTCATCTATGAGATTCAGAGATTTTCCGACCTTCTCCCCATGGGTGTGCCCCACATTGTCACCCAACACACCAGCTTCCGAGGGTACATCATCCCCAAGGACACAGAAGTATTTCTCATCCTGAGCACTGCTCTCCATGACCCACACTA

  6. THEREDCAT_HSDKLSD_WASNOTHOTBUT_WKKNASDNKSAOJ.ASDNALKS_WASWET_ASDFLKSDOFIJEIJKNAWDFN_ANDMAD_WERN.JSNDFJN_YETSAD_MNSFDGPOIJD_BUTTHEFOX_SDKMFIDSJIR.JER_GOTWET_JSN.DFOIAMNJNER_ANDATEHIM.THEREDCAT_HSDKLSD_WASNOTHOTBUT_WKKNASDNKSAOJ.ASDNALKS_WASWET_ASDFLKSDOFIJEIJKNAWDFN_ANDMAD_WERN.JSNDFJN_YETSAD_MNSFDGPOIJD_BUTTHEFOX_SDKMFIDSJIR.JER_GOTWET_JSN.DFOIAMNJNER_ANDATEHIM.

  7. Add a 2 x 2 lego block… Add a 2 x 3 lego block… Add a 2 x 4 lego block… Start with a thin 2 x 4 lego block…

  8. What are the comparative genome sizes of humans and other organisms being studied? Genome size does not correlate with evolutionary status, nor is the number of genes proportionate with genome size.

  9. >gi|1924940|emb|CAA67058.1| myosin-IF [Homo sapiens] QEKLTSRKMDSRWGGRSESINVTLNVEQAAYTRDALAKGLYARLFDFLVEAINRAMQKPQEEYSIGVLDI YGFEIFQKNGFEQFCINFVNEKLQQIFIELTLKAEQEEYVQEGIRWTPIQYFNNKVVCDLIENKLSPPGI MSVLDDVCATMHATGGGADQTLLQKLQAAVGTHEHFNSWSAGFVIHHYAGKVSYDVSGFCERNRDVLFSD LIELMQSSDQAFLRMLFPEKLDGDKKGRPSTAGSKIKKQANDLVATLMRCTPHYIRCIKPNETKHARDWE ENRVQHQVEYLGLKENIRVRRAGFAYRRQFAKFLQRYAILTPETWPRWRGDERQGVQHLLRAVNMEPDQY QMGSTKVFVKNPESLFLLEEVRERKFDGFARTIQKAWRRHVAVRKYEEMREEASNILLNKKERRRNSINR NFVGDYLGLEERPELRQFLGKKERVDFADSVTKYDRRFKPIKRDLILTPKCVYVIGREKMKKGPEKGPVC EILKKKLDIQALRGVSLSTRQDDFFILQEDAADSFLESVFKTEFVSLLCKRFEEATRRPLPLTFSDTLQF RVKKEGWGGGGTRSVTFSRGFGDLAVLKVGGRTLTVSVGDGLPKNSKPTGKGLAKGKPRRSSQAPTRAAP GAPQGMDRNGAPLCPQGGAPCPLEKFIWPRGHPQASPALRPHPWDASRRPRARPPSEHNTEFLNVPDQGM AGMQRKRSVGQRPVPVGRPKPQPRTHGPRCRALYQYVGQDVDELSFNVNEVIEILMEDPSGWWKGRLHGQ EGLFPGNYVEKI FASTA File Format <?xml version="1.0"?> <!DOCTYPE TSeq PUBLIC "-//NCBI//NCBI TSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd"> <TSeq> <TSeq_seqtype value="nucleotide"/> <TSeq_gi>1924939</TSeq_gi> <TSeq_accver>X98411.1</TSeq_accver> <TSeq_taxid>9606</TSeq_taxid> <TSeq_orgname>Homo sapiens</TSeq_orgname> <TSeq_defline>Homo sapiens partial mRNA for myosin-IF</TSeq_defline> <TSeq_length>2711</TSeq_length> <TSeq_sequence>CAGGAGAAGCTGACCAGCCGCAAGATGGACAGCCGCTGGGGCGGGCGCAGCGAGTCCATCAATGT…… </TSeq> TinySeq XML

  10. FASTA File Format DATABASE (DATA WAREHOUSE) >GENE NUMBER ONE AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC........ >GENE NUMBER TWO AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC........ >GENE NUMBER THREE AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC........ >GENE NUMBER FOUR AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC........ >GENE NUMBER FIVE AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC........ >GENE NUMBER SIX AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC........ >GENE NUMBER SEVEN AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC........ . . . . . >GENE NUMBER TWENTY MILLION AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC........

  11. DYNAMIC PROGRAMMING and SEQUENCE SEARCHES 'Dynamic programming' is an efficient programming technique for solving certain combinatorial problems. It is particularly important in bioinformatics as it is the basis of sequence alignment algorithms for comparing protein and DNA sequences. In the bioinformatics application Dynamic Programming gives a spectacular efficiency gain over a purely recursive algorithm. Don't expect much enlightenment from the etymology of the term 'dynamic programming,' though. Dynamic programming was formalized in the early 1950s by mathematician Richard Bellman, who was working at RAND Corporation on optimal decision processes. He wanted to concoct an impressive name that would shield his work from US Secretary of Defense Charles Wilson, a man known to be hostile to mathematics research. His work involved time series and planning—thus 'dynamic' and 'programming' (note, nothing particularly to do with computer programming). Bellman especially liked 'dynamic' because "it's impossible to use the word dynamic in a derogatory sense"; he figured dynamic programming was "something not even a Congressman could object to.”

  12. DYNAMIC PROGRAMMING and SEQUENCE SEARCHES The following is an example of global sequence alignment using Needleman/Wunsch techniques. For this example, the two sequences to be globally aligned are: G A A T T C A G T T A (sequence #1) G G A T C G A (sequence #2) Initialization Step Since this example assumes there is no gap opening or gap extension penalty, the first row and first column of the matrix can be initially filled with 0.

  13. DYNAMIC PROGRAMMING and SEQUENCE SEARCHES Matrix Fill Step

  14. DYNAMIC PROGRAMMING and SEQUENCE SEARCHES Traceback Step (Seq #1) A | (Seq #2) A (Seq #1) T A | (Seq #2) _ A (Seq #1)T T A | (Seq #2)_ _ A There are multiple solutions to this alignment, and most dynamic programming algorithms print out only a single solution. G A A T T C A G T T A | | | | | | G G A _ T C _ G _ _ A _ G A A T T C A G T T A | | | | | | G G A _ _ T C _ G _ _ A

  15. BLAST (Basic Local Alignment Search Tool) • Why is BLAST so fast? • By preindexing all the possible 11-letter words into the database records (411 = 4,194,304). • . • . • GTCGTAGTCGATCGTAGTCG • CTCGTAGTCG • . • . • Steps: • 1) Find all the 11-letter words in your query sequence, plus a few variations. • 2) Look these up in the 11-letter-word index. • 3) Retrieve all sequences containing those words. • 4) Use a rigorous algorithm (e.g. Smith-Waterman) to extend the match in both directions

  16. http://www.ncbi.nlm.nih.gov/ >UNKNOWN GENE SEQUENCE AGGACCATGGAACTCAGCGTCCTCCTCTTCCTTGCACTCCTCACAGGACTCTTGCTACTCCTGGTTCAGCGCCACCCTAACACCCATGACCGCCTCCCACCAGGGCCCCGCCCTCTGCCCCTTTTGGGAAACCTTCTGCAGATGGATAGAAGAGGCCTACTCAAATCCTTTCTGAGGTTCCGAGAGAAATATGGGGACGTCTTCACGGTACACCTGGGACCGAGGCCCGTGGTCATGCTGTGTGGAGTAGAGGCCATACGGGAGGCCCTTGTGGACAAGGCTGAGGCCTTCTCTGGCCGGGGAAAAATCGCCATGGTCGACCCATTCTTCCGGGGATATGGTGTGATCTTTGCCAATGGAAACCGCTGGAAGGTGCTTCGGCGATTCTCTGTGACCACTATGAGGGACTTCGGGATGGGAAAGCGGAGTGTGGAGGAGCGGATTCAGGAGGAGGCTCAGTGTCTGATAGAGGAGCTTCGGAAATCCAAGGGGGCCCTCATGGACCCCACCTTCCTCTTCCAGTCCATTACCGCCAACATCATCTGCTCCATCGTCTTTGGAAAACGATTCCACTACCAAGATCAAGAGTTCCTGAAGATGCTGAACTTGTTCTACCAGACTTTTTCACTCATCAGCTCTGTATTCGGCCAGCTGTTTGAGCTCTTCTCTGGCTTCTTGAAATA

  17. Gene Cloning DB New Gene Sequences (1 per second!) >GENE Agtgctcgatagatcgctcgcata… Genomic Database (DATA WAREHOUSE) RESULTS Results DB

More Related