Introduction to Bioinformatics

Introduction to Bioinformatics ChBi406/506 Ozlem Keskin For today’s lectures Many slides from gersteinlab.org/courses/452 And Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8).

EMails Ozlem Keskin okeskin@ku.edu.tr Engin Cukuroglu ecukuroglu@ku.edu.tr

Who is taking this course? • People with very diverse backgrounds in • biology, chemical engineering(MS/BS) • People with diverse backgrounds in computer • science -please visit Attila Hoca’s office! • Most people have a favorite gene, protein, or disease

What are the goals of the course? • To provide an introduction to bioinformatics with • a focus on the National Center for Biotechnology • Information (NCBI) and EBI • To focus on the analysis of DNA, RNA and proteins • To introduce you to the analysis of genomes • To combine theory and practice to help you • solve research problems

Themes throughout the course Textbooks Web sites Literature references Gene/protein families Computer labs

Textbook The course textbook is J. Pevsner, Bioinformatics and Functional Genomics (Wiley, 2009). Several other bioinformatics texts are available: Baxevanis and Ouellette Mount Durbin et al. Lesk In our library you will find (e-book) Bioinformatics [electronic resource] : sequence and genome analysis / David W. Mount.ImprintCold Spring Harbor, N.Y. : Cold Spring Harbor Laboratory Press, c2001. Bioinformatics : a practical guide to the analysis of genes and proteins / editedby Andreas D. Baxevanis, B.F. Francis Ouellette.ImprintHoboken, N.J. : John Wiley, 2005. (SOON)

Themes throughout the course: Literature references You are encouraged to read original source articles. Although articles are not required, they will enhance your understanding of the material. You can obtain articles through PubMed and Web of Science.

Web sites The course website is reached via: http://pevsnerlab.kennedykrieger.org/bioinfo_course.htm (or Google “pevsnerlab”  courses) This site contains the powerpoints for each lecture. The textbook website is: http://www.bioinfbook.org This has 1000 URLs, organized by chapter This site also contains the same powerpoints. You will also find the lecture slides at F-folder.

Grading Midterm 30% Final 35% HWs 20% Project 15% (might change, the course will evolve)

Themes throughout the course: gene/protein families We will use beta globin and retinol-binding protein 4 (RBP4) as model genes/proteins throughout the course. Globins including hemoglobin and myoglobin carry oxygen. RBP4 is a member of the lipocalin family. It is a small, abundant carrier protein. We will study globins and lipocalins in a variety of contexts including • --sequence alignment • --gene expression • --protein structure • --phylogeny • --homologs in various species

The HIV-1 pol gene encodes three proteins Aspartyl protease Reverse transcriptase Integrase PR RT IN

Outline for today (chapters 1 and 2) Definition of bioinformatics Overview of the NCBI website Accessing information about DNA and proteins --Definition of an accession number --Four ways to find information on proteins and DNA Access to biomedical literature

Biological Data Computer Calculations + Bioinformatics

What is bioinformatics? • Interface of biology and computers • Analysis of proteins, genes and genomes • using computer algorithms and • computer databases • Genomics is the analysis of genomes. • The tools of bioinformatics are used to make • sense of the billions of base pairs of DNA • that are sequenced by genomics projects. • Protein coordinates, DNA array data, annotated gene sequences • Biological information is being generated now days in parallel. • We can easily run 10,000 simultaneous experiments on a single DNA microarray. • To cope with this much data we really need computers. • So Bioinformatics is that field that combines biology and computers.

Where does Bioinformatics come from? Data from the Human Genome Project has fueled the development of new bioinformatics methods

HGP

What is Bioinformatics? • (Molecular)Bio - informatics • One idea for a definition?Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organizethe information associated with these molecules, on a large-scale.

Interface of biology and computers Analysis of proteins, genes and genomesusingcomputer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.

Top ten challenges for bioinformatics [1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli [4] Determining protein:DNA, protein:RNA, protein:protein recognition codes [5] Accurate ab initio protein structure prediction

Top ten challenges for bioinformatics [6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula Source: Ewan Birney, Chris Burge, Jim Fickett

Simulating the cell

On bioinformatics “Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.

On bioinformatics However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave. Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome.” Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1, introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation Assessment Project

bioinformatics medical informatics public health informatics algorithms databases infrastructure Tool-users Tool-makers

Three perspectives on bioinformatics The cell The organism The tree of life Page 4

After Pace NR (1997) Science 276:734 Page 6

Time of development Body region, physiology, pharmacology, pathology Page 5

DNA RNA protein phenotype Page 5

DNA RNA protein phenotype

Growth of GenBank Base pairs of DNA (billions) Sequences (millions) Fig. 2.1 Page 17 1982 1986 1990 1994 1998 2002 Updated 8-12-04: >40b base pairs Year

Growth of GenBank 70 60 50 Base pairs of DNA (billions) 40 Sequences (millions) 30 20 10 0 1985 1990 1995 2000 December 1982 June 2006

Growth of the International Nucleotide Sequence Database Collaboration Base pairs of DNA (billions) Base pairs contributed by GenBank EMBL DDBJ http://www.ncbi.nlm.nih.gov/Genbank/

genome transcriptome proteome Central dogma of bioinformatics and genomics Central dogma of molecular biology DNA RNA protein

Central Dogmaof Molecular BiologyDNA -> RNA -> Protein -> Phenotype -> DNA Molecules Sequence, Structure, Function Processes Mechanism, Specificity, Regulation Central Paradigmfor BioinformaticsGenomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype Large Amounts of Information Standardized Statistical What is the Information?Molecular Biology as an Information Science • Most cellular functions are performed or facilitated by proteins. • Primary biocatalyst • Cofactor transport/storage • Mechanical motion/support • Immune protection • Control of growth/differentiation • Information transfer (mRNA) • Protein synthesis (tRNA/mRNA) • Some catalytic activity • Genetic material (idea from D Brutlag, Stanford, graphics from S Strobel)

Proteins fold into 3D structures with specific functions which are reflected in a pheonotype. • These functions are selected in a Darwinian sense by the environment of the phenotype. • Which drives the evolution of the DNA sequence. • Many Bioinformatics techniques address this flow of molecular biology information inside the organism hoping to understand the organization and control of genes even predicting protein structure from sequence. • There is a second flow of information that bioinformatics seeks to address is the large amount of data generated by new high through methods. Bioinformatics owes its lively hood to the availability of large data sets that are too complex to allow manual analysis.

DNA RNA protein phenotype protein sequence databases cDNA ESTs UniGene genomic DNA databases Fig. 2.2 Page 20

There are three major public DNA databases EMBL GenBank DDBJ The underlying raw DNA sequences are identical Page 16

There are three major public DNA databases EMBL GenBank DDBJ Housed at EBI European Bioinformatics Institute Housed at NCBI National Center for Biotechnology Information Housed in Japan Page 16

>100,000 species are represented in GenBank all species 128,941 viruses 6,137 bacteria 31,262 archaea 2,100 eukaryota 87,147 Table 2-1 Page 17

The most sequenced organisms in GenBank Homo sapiens (6.9 million entries) Mus musculus (5.0 million) Zea mays (896,000) Rattus norvegicus (819,000) Gallus gallus (567,000) Arabidopsis thaliana (519,000) Danio rerio (492,000) Drosophila melanogaster (350,000) Oryza sativa (221,000)

National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov

Taxonomy nodes at NCBI 8/06 http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi

The most sequenced organisms in GenBank Homo sapiens 10.7 billion bases Mus musculus6.5b Rattus norvegicus5.6b Danio rerio1.7b Zea mays 1.4b Oryza sativa0.8b Drosophila melanogaster0.7b Gallus gallus 0.5b Arabidopsis thaliana0.5b Table 2-2 Page 18 Updated 8-12-04 GenBank release 142.0

The most sequenced organisms in GenBank Homo sapiens 11.2 billion bases Mus musculus7.5b Rattus norvegicus5.7b Danio rerio2.1b Bos taurus 1.9b Zea mays 1.4b Oryza sativa (japonica)1.2b Xenopus tropicalis 0.9b Canis familiaris 0.8b Drosophila melanogaster0.7b Table 2-2 Page 18 Updated 8-29-05 GenBank release 149.0

The most sequenced organisms in GenBank Homo sapiens 12.3 billion bases Mus musculus8.0b Rattus norvegicus5.7b Bos taurus 3.5b Danio rerio2.5b Zea mays 1.8b Oryza sativa (japonica)1.5b Strongylocentrotus purpurata 1.2b Sus scrofa 1.0b Xenopus tropicalis 1.0b Table 2-2 Page 18 Updated 7-19-06 GenBank release 154.0

Molecular Biology Information - DNA • Raw DNA Sequence • Coding or Not? • Parse into genes? • 4 bases: AGCT • ~1 K in a gene, ~2 M in genome • ~3 Gb Human atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca gacgctggtatcgcattaactgattctttcgttaaattggtatc . . . . . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

Molecular Biology Information: Protein Sequence • 20 letter alphabet • ACDEFGHIKLMNPQRSTVWY but not BJOUXZ • Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain • >1M known protein sequences (uniprot) d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESI d3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKA d3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKA d3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV

Molecular Biology Information:Macromolecular Structure • DNA/RNA/Protein • Almost all protein (RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page)

Introduction to Bioinformatics