1 / 25

Bioinformatics Basics

Bioinformatics Basics. Cyrus Courtesy from LO Leung Yau’s original presentation. Outline. Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression Bioinformatics Sequence Analysis Phylogentic Trees Data Mining. Biological Background – Cell.

tao
Download Presentation

Bioinformatics Basics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Basics Cyrus Courtesy from LO Leung Yau’s original presentation

  2. Outline • Biological Background • Cell • Protein • DNA & RNA • Central Dogma • Gene Expression • Bioinformatics • Sequence Analysis • Phylogentic Trees • Data Mining

  3. Biological Background – Cell • Basic unit of organisms • Prokaryotic • Eukaryotic • A bag of chemicals • Metabolism controlled by various enzymes • Correct working needs • Suitable amounts of various proteins Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

  4. Biological Background – Protein • Polymer of 20 types of Amino Acids • Folds into 3D structure • Shape determines the function • Many types • Transcription Factors • Enzymes • Structural Proteins • … Picture taken from http://en.wikipedia.org/wiki/Protein http://en.wikipedia.org/wiki/Amino_acid

  5. Biological Background – DNA & RNA • DNA • Double stranded • Adenine, Cytosine, Guanine, Thymine • A-T, G-C • Those parts coding for proteins are called genes • RNA • Single stranded • Adenine, Cytosine, Guanine, Uracil Picture taken from http://en.wikipedia.org/wiki/Gene

  6. Biological Background – Genes • Genes – protein coding regions 3 nucleotides code for one amino acid There are also start and stop codons

  7. Biological Background—in a nutshell • Abstractions Functional Units: Proteins Templates: RNAs Not only the information (data), but also the control signals about what and how much data is to be sent Proteins (TFs) so help Templates: RNAs Blueprints: DNAs Blueprints: DNAs

  8. Biological Background – Sequences • Abstractions Sequences acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAacctactggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaatactggatacagggcatataaaacaggggcaaggcacagactc Annotations FT intron <1..28 FT /gene="CREB" FT /number=3 FT /experiment="experimental evidence… FT recorded" FT exon 29..174 FT /gene="CREB" FT /number=4 FT /experiment="experimental evidence… FT recorded" FT intron 175..>189 FT /gene="CREB" FT /number=4 Visualizations

  9. gene Biological Background –DNA  RNA  Protein Picture taken from http://en.wikipedia.org/wiki/Gene

  10. Biological Background –DNA  RNA  Protein Other functions Promoter regions Genes Transcription Factors Binding sites Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS).

  11. Complex Interactions between Genes, TFs and TFBSs

  12. Biological Background –DNA  RNA  Protein Other functions Promoter regions Genes Transcription Factors Binding sites Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS).

  13. Gene Expression Microarray Data • High throughput • Measures RNA level • Relies on A-T, G-C pairing • Can monitor expression of many genes Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

  14. Gene Expression Microarray Data Time points/Condiditions Genes Colors: Expression (RNA) Levels Picture taken from http://en.wikipedia.org/wiki/DNA_microarray

  15. Bioinformatics—Sequence Analysis • Alignments • a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences http://en.wikipedia.org/wiki/Sequence_alignment

  16. Bioinformatics—Sequence Analysis • Pair-wise alignments • Method: dynamic programming! No penalty for the consecutive ‘-’s before and after the sequence to be aligned \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures

  17. Bioinformatics—Sequence Analysis • Multiple (global) sequence alignment • Also dynamic programming (but can’t scale up!)

  18. Bioinformatics—Sequence Analysis • Multiple local sequence alignment • i.e. Motif (pattern) discovery >seq1 acatggccgatcagctggtttttgtgtgcctgtttctgaatc >seq2 ttctattttacgtaaatcagcttgaacatgtacctactggtg >seq3 atgcacctttgatcaataccagctagacaaacgtgtgttg >seq4 agtccaaagatcagggctggctgaatactggatcagct >seq5 cagctacagggcatataaaggggcaaggcacagactc Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes). TFBSs are the controlling key holes in gene regulation!

  19. Transcription Factor TF Transcription RNA Gene (functioning) DNA motifs • Similar DNA fragments across individuals and/or species • TFBS Motifs: DNA fragments similar to “TATAA” are common in order to make genes functioning • Expensive and time-consuming to try a large set of candidates in biological experiments Protein Translation DNA TATAA TFBS (controlling)

  20. … … A T f f C G … … Motif discovery TFBS Motif Discovery Similar controlled functions e.g. cancer gene activities CGATTGA Maximized SNP (single nucleotide polymorphism) Motif Discovery A T A Normal T distinguish DNA from different people Normal A T Disease! Maximized C G Disease! C G C G

  21. Bioinformatics—Data mining • Classification • To predict! • Pre-processing—tidy up your materials! • Feature selection—the key points to go over • Classifier—the thinking style/manner of how to combine the key points and get some answer • Training—your practice of your thinking manner with answers known • Validation—mock quiz to evaluate what you’ve learnt from the training • Testing—your examination! Underfitting & Overfitting \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf

  22. 1 Modeling:statistical models, representations, Markov chains; Discovery:stochastic searching, indexing (suffix trees) 2 Relationship:TF-TFBS; TFBS-Gene… (understanding, prediction) Mining:text mining, approximate matching 3 Annotations: accurate wet-lab candidates (reduced labor and costs); Computation:large scale data processing; parallel computing TRANSFAC Project Representative Publications [1] Gang Li, Tak-Ming Chan, Kwong-Sak Leung and Kin-Hong Lee, A Cluster Refinement Algorithm for Motif Discovery, IEEE/ACM Transaction on Computational Biology and Bioinformatics (accepted) [2] Tak-Ming Chan, Kwong-Sak Leung, Kin-Hong Lee, TFBS identification based on genetic algorithm with combined representations and adaptive post-processing. Bioinformatics, 2008, 24(3), pp. 341-349 TF-Transcription Factors, important regulators TFBS-Transcription Factor Binding Site, major regulatory elements TRANSFAC-The most representative DB for TFs and TFBSs

  23. Bioinformatics—Data mining • Evaluation (scores!) • Confusion Matrix • Binary Classification • Performance Evaluation Metrics • Accuracy • Sensitivity/Recall/TP Rate • Specificity/TN Rate • Precision/PPV • … \\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf

  24. Bioinformatics—Data mining • Evaluation • ROC (Receiver Operating Characteristics) • Trade-off between positive hits (TP) and false alarms (FP)

  25. Not The End • Your corresponding tutor will have more project-specific stuff to tell you • Thanks • Q & A

More Related