Introduction to BioInformatics

BSC4933/5936Florida State UniversityDepartment of Biologywww.bio.fsu.edu Introduction to BioInformatics October 2, 2003

Steven M. Thompson Florida State University School of Computational Science and Information Technology (CSIT) Genomics What sort of information can be determined from a genomic sequence?

Lecture Topics — Easy — restriction digests and associated mapping; e.g. software like the Wisconsin Package’s (Genetics Computer Group [GCG]) Map, MapSort, and MapPlot. Harder — fragment assembly and genome mapping; such as packages from the University of Washington’s Genome Center (http://www.genome.washington.edu/), Phred/Phrap/Consed (http://www.phrap.org/) and SegMap, and The Institute for Genomic Research’s (http://www.tigr.org/) Lucy and Assembler programs. Very hard — gene finding and sequence annotation. This will be the bulk of today’s lecture and is a primary focus in current genomics research. Easy— forward translation to peptides. Hard again — genome scale comparisons and analyses.

Nucleic Acid Characterization: Recognizing Coding Sequences. • Three general solutions to the gene finding problem: • 1) all genes have certain regulatory signals positioned in or about them, • 2) all genes by definition contain specific code patterns, • 3) and many genes have already been sequenced and recognized in other organisms so we can infer function and location by homology if our new sequence is similar enough to an existing sequence. • All of these principles can be used to help locate the position of genes in DNA and are often known as “searching by signal,” “searching by content,” and “homology inference” respectively.

URFs and ORFs — definitions • URF: Unidentified Reading Frame — any potential string of amino acids encoded by a stretch of DNA. Any given stretch of DNA has potential URFs on any combination of six potential reading frames, three forward and three backward. • ORF: Open Reading Frame — by definition any continuous reading frame that starts with a start codon and stops with a stop codon. Not usually relevant to discussions of genomic eukaryotic DNA, but very relevant when dealing with mRNA/cDNA or prokaryotic DNA.

Signal Searching:locating transcription and translation affecter sites. One strategy — One-Dimensional Signal Recognition. • Start Sites: • Prokaryote promoter ‘Pribnow Box,’ TTGACwx{15,21}TAtAaT; • Eukaryote transcription factor site database, TFSites.Dat; • Shine-Dalgarno site, (AGG,GAG,GGA)x{6,9}ATG, in prokaryotes; • Kozak eukaryote start consensus, cc(A,g)ccAUGg; • AUG start codon in about 90% of genomes, exceptions in some prokaryotes and organelles.

Signal Searching:locating transcription and translation affecter sites.One-Dimensional Approaches, cont. • End Sites: • ‘Nonsense’ chain terminating, stop codons — UAA, UAG, UGA; • Eukaryote terminator consensus, YGTGTTYY; • Eukaryote poly(A) adenylation signal AAUAAA; • but exceptions in some ciliated protists and due to eukaryote suppresser tRNAs.

Signal Searching:locating transcription and translation affecter sites.Another Strategy — Two-Dimensional Weight Matrix. • Exon/Intron Junctions. • Donor Site Acceptor Site •  ExonIntronExon • A64G73G100T100A62A68G84T63 . . . 6Py74-87NC65A100G100N The splice cut sites occur before a 100% GT consensus at the donor site and after a 100% AG consensus at the acceptor site, but a simple consensus is not informative enough.

Signal Searching:locating transcription and translation affecter sites.Two-Dimensional Weight Matrices describe the probability at each base position to be either A, C, U, or G, in percentages. • The Donor Matrix. • CONSENSUS from: Donor Splice site sequences • from Stephen Mount NAR 10(2) 459;472 figure 1 page 460 • Exon cutsite Intron • %G 20 9 11 74 100 0 29 12 84 9 18 20 • %A 30 40 64 9 0 0 61 67 9 16 39 24 • %U 20 7 13 12 0 100 7 11 5 63 22 27 • %C 30 44 11 6 0 0 2 9 2 12 20 28 • CONSENSUS sequence to a certainty level of 75 percent. • VMWKGTRRGWHH • The matrix begins four bases ahead of the splice site!

Signal Searching:locating transcription and translation affecter sites.Two-Dimensional Weight Matrices, cont. • The Acceptor Matrix. • CONSENSUS of: Acceptor.Dat. IVS Acceptor Splice Site Sequences • from Stephen Mount NAR 10(2); 459-472 figure 1 page 460 • Intron cutsite Exon • %G 15 22 10 10 10 6 7 9 7 5 5 24 1 0 100 52 24 19 • %A 15 10 10 15 6 15 11 19 12 3 10 25 4 100 0 22 17 20 • %T 52 44 50 54 60 49 48 45 45 57 58 30 31 0 0 8 37 29 • %C 18 25 30 21 24 30 34 28 36 35 27 21 64 0 0 18 22 32 • CONSENSUS sequence to a certainty level of 75.0 percent at each position: • BBYHYYYHYYYDYAGVBH • The matrix begins fifteen bases upstream of the splice site!

Signal Searching:locating transcription and translation affecter sites.Two-Dimensional Weight Matrices, cont. • The CCAAT site — occurs around 75 base pairs upstream of the start point of eukaryotic transcription, may be involved in the initial binding of RNA polymerase II. • Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578. • Preferred region: motif within -212 to -57. Optimized cut-off value: 87.2%. • %G 7 25 14 40 57 1 0 0 12 9 34 30 • %A 32 18 14 58 29 0 0 100 68 10 13 66 • %U 30 27 45 1 11 1 1 0 15 82 2 1 • %C 31 30 27 1 3 99 99 0 5 0 51 3 • CONSENSUS sequence to a certainty level of 68 percent at each position: • HBYRRCCAATSR

Signal Searching:locating transcription and translation affecter sites.Two-Dimensional Weight Matrices, cont. • The TATA site (aka “Hogness” box) — a conserved A-T rich sequence found about 25 base pairs upstream of the start point of eukaryotic transcription, may be involved in positioning RNA polymerase II for correct initiation and binds Transcription Factor IID. • Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578. • Preferred region: center between -36 and -20. Optimized cut-off value: 79%. • %G 39 5 1 1 1 0 5 11 40 39 33 33 33 36 36 • %A 16 4 90 1 91 69 93 57 40 14 21 21 21 17 20 • %U 8 79 9 96 8 31 2 31 8 12 8 13 16 19 18 • %C 37 12 0 3 0 0 1 1 11 35 38 33 30 28 26 • CONSENSUS sequence to a certainty level of 61 percent at each position: • STATAWAWRSSSSSS

Signal Searching:locating transcription and translation affecter sites.Two-Dimensional Weight Matrices, cont. • The GC box — may relate to the binding of transcription factor Sp1. • Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578. • Preferred region: motif within -164 to +1. Optimized cut-off value: 88%. • %G 18 41 56 75 100 99 0 82 81 62 70 13 19 40 • %A 37 35 18 24 0 1 20 17 0 29 8 0 7 15 • %U 30 12 23 0 0 0 18 1 18 9 15 27 42 37 • %C 15 11 2 0 0 0 62 0 1 0 6 61 31 9 • CONSENSUS sequence to a certainty level of 67 percent at each position: • WRKGGGHGGRGBYK

Signal Searching:locating transcription and translation affecter sites.Two-Dimensional Weight Matrices, cont. • The cap signal — a structure at the 5’ end of eukaryotic mRNA introduced after transcription by linking the 5’ end of a guanine nucleotide to the terminal base of the mRNA and methylating at least the additional guanine; the structure is 7MeG5’ppp5’Np. • Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578. • Preferred region: center between 1 and +5. Optimized cut-off value: 81.4%. • %G 23 0 0 38 0 15 24 18 • %A 16 0 95 9 25 22 15 17 • %U 45 0 5 26 43 24 33 33 • %C 16 100 0 27 31 39 28 32 • CONSENSUS sequence to a certainty level of 63 percent at each position: • KCABHYBY

Signal Searching:locating transcription and translation affecter sites.Two-Dimensional Weight Matrices, cont. • The eukaryotic terminator weight matrix. • Base frequencies according to McLauchlan et al. • (1985) N.A.R. 13:1347-1368. • Found in about 2/3's of all eukaryotic gene sequences. • %G 19 81 9 94 14 10 11 19 • %A 13 9 3 3 4 0 11 13 • %U 51 9 89 3 79 61 56 47 • %C 17 1 0 0 3 29 21 21 • CONSENSUS sequence to a certainty level of 68 percent at each position: • BGTGTBYY

Content Approaches. Strategies for finding coding regions based on the content of the DNA itself. • Searching by content utilizes the fact that genes necessarily have many implicit biological constraints imposed on their genetic code. This induces certain periodicities and patterns to produce distinctly unique coding sequences; non-coding stretches do not exhibit this type of periodic compositional bias. These principles can help discriminate structural genes in two ways: • 1) based on the local “non-randomness” of a stretch, and • 2) based on the known codon usage of a particular life form. • The first, the non-randomness test, does not tell us anything about the particular strand or reading frame; however, it does not require a previously built codon usage table. The second approach is based on the fact that different organisms use different frequencies of codons to code for particular amino acids. This does require a codon usage table built up from known translations; however, it also tells us the strand and reading frame for the gene products as opposed to the former.

Content Approaches, cont.“Non-Randomness” Techniques.GCG’s TestCode. Relies solely on the base compositional bias of every third position base. The plot is divided into three regions: top and bottom areas predict coding and noncoding regions, respectively, to a confidence level of 95%, the middle area claims no statistical significance. Diamonds and vertical bars above the graph denote potential stop and start codons respectively.

Content Approaches, cont. Codon Usage Techniques.GCG’s CodonPreference. • Genomes use synonymous codons unequally sorted phylogenetically. Each forward reading frame indicates a red codon preference curve and a blue third position GC bias curve. The horizontal lines within each plot are the average values of each attribute. Start codons are represented as vertical lines rising above each box and stop codons are shown as lines falling below the reading frame boxes. Rare codon choices are shown for each frame with hash marks below each reading frame.

Internet World Wide Web servers. • Many servers have been established that can be a huge help with gene finding analyses. Most of these servers combine many of the methods previously discussed but they consolidate the information and often combine signal and content methods with homology inference in order to ascertain exon locations. Many use powerful neural net or artificial intelligence approaches to assist in this difficult ‘decision’ process. • A wonderful bibliography on computational methods for gene recognition has been compiled by Wentian Li (http://www.nslij-genetics.org/gene/), • and the Baylor College of Medicine’s Gene Feature Search (http://searchlauncher.bcm.tmc.edu/seq-search/gene-search.html) is another nice portal to several gene finding tools.

World Wide Web Internet servers, cont. • Five popular gene-finding services are GrailEXP, GeneId, GenScan, NetGene2, and GeneMark. • The neural net system GrailEXP (Gene recognition and analysis internet link–EXPanded http://grail.lsd.ornl.gov/grailexp/) is a gene finder, an EST alignment utility, an exon prediction program, a promoter and polyA recognizer, a CpG island locater, and a repeat masker, all combined into one package. • GeneId (http://www1.imim.es/software/geneid/index.html) is an ‘ab initio’ Artificial Intelligence system for predicting gene structure optimized in genomic Drosophila or Homo DNA. • NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/), another ‘ab initio’ program, predicts splice site likelihood using neural net techniques in human, C. elegans, and A. thaliana DNA. • GenScan (http://genes.mit.edu/GENSCAN.html) is perhaps the most ‘trusted’ server these days with vertebrate genomes. • The GeneMark (http://opal.biology.gatech.edu/GeneMark/) family of gene prediction programs is based on Hidden Markov Chain modeling techniques; originally developed in a prokaryotic context the programs have now been expanded to include eukaryotic modeling as well.

Homology Inference. • Similarity searching can be particularly powerful for inferring gene location by homology. This can often be the most informative of any of the gene finding techniques, especially now that so many sequences have been collected and analyzed. Wisconsin Package programs such as the BLAST and FastA families, Compare and DotPlot, Gap and BestFit, and FrameAlign and FrameSearch can all be a huge help in this process. But this too can be misleading and seldom gives exact start and stop positions. For example: 805 GCCATCGCCCGGGGCCGAGGGAAGGGCCCGGCAGCTGAGGAGCCG...CT 851 ||| |||||| ::: ||||||...|||||| || 46 AlaAlaAlaArgCysLysAlaAlaGluAlaAlaAlaAspGluProAlaLe 62 . . . . . 852 GAGCTTGCTGGACGACATGAACCACTGCTACTCCCGCCTGCGGGAACTGG 901 | ||| ||||||||| |||||||||||||||||| |||| 63 uCysLeuGlnCysAspMetAsnAspCysTyrSerArgLeuArgArgLeuV 79 . . . . . 902 TACCCGGAGTCCCGAGAGGCACTCAGCTTAGCCAGGTGGAAATCCTACAG 951 ||||| :::||| ... |||...||||||||||||||| 80 alProThrIleProProAsnLysLysValSerLysValGluIleLeuGln 95 . . . 952 CGCGTCATCGACTACATTCTCGACCTGCAGGTAGTCCTG 990 ||||||||||||||||||||||||||| ||| 96 HisValIleAspTyrIleLeuAspLeuGlnLeuAlaLeu 108

Summary. The combinatorial approach. Get all your data in one place. GCG’s SeqLab is a great way to do this due to its advanced annotation capabilities:

The Objective: a complete protein. Now what?

Beyond just finding genes: Genome scale analyses. Unfortunately much ’traditional’ sequence analysis software can’t do it, but there are some very good Web resources available for these types of ‘global view’ analyses. Let’s run through a few examples. NCBI’s Genome pages (http://www.ncbi.nlm.nih.gov/) present a good starting point in North America:

Beyond just finding genes: Genome scale analyses, cont. That can lead to neat places like the Genome Browser at the University of California, Santa Cruz (http://genome.ucsc.edu/) and the Ensembl project at the Sanger Center for BioInformatics (http://www.ensembl.org/):

Beyond just finding genes: Genome scale analyses, cont. And sites like the the University of Wisconsin’s E. coliGenome Project (http://www.genome.wisc.edu/) and The Institute for Genomic Research’s (http://www.tigr.org/) MUMMER package.

A perplexing variety of techniques exist for the identification and analysis of protein coding regions in genomic DNA. Knowing which to use when and how to combine their inferences will go a long way in your genomic analyses! References. Bucher, P. (1990). Weight Matrix Descriptions of Four Eukaryotic RNA Polymerase II Promoter Elements Derived from 502 Unrelated Promoter Sequences. Journal of Molecular Biology 212, 563-578. Bucher, P. (1995). The Eukaryotic Promoter Database EPD. EMBL Nucleotide Sequence Data Library Release 42, Postfach 10.2209, D-6900 Heidelberg. Ghosh, D. (1990). A Relational Database of Transcription Factors. Nucleic Acids Research 18, 1749-1756. Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A. Hawley, D.K. and McClure, W.R. (1983). Compilation and Analysis of Escherichia coli promoter sequences. Nucleic Acids Research 11, 2237-2255. Kozak, M. (1984). Compilation and Analysis of Sequences Upstream from the Translational Start Site in Eukaryotic mRNAs. Nucleic Acids Research 12, 857-872. McLauchen, J., Gaffrey, D., Whitton, J. and Clements, J. (1985). The Consensus Sequences YGTGTTYY Located Downstream from the AATAAA Signal is Required for Efficient Formation of mRNA 3’ Termini. Nucleic Acid Research 13 , 1347-1368. Proudfoot, N.J. and Brownlee, G.G. (1976). 3’ Noncoding Region in Eukaryotic Messenger RNA. Nature 263, 211-214. Stormo, G.D., Schneider, T.D. and Gold, L.M. (1982). Characterization of Translational Initiation Sites in E. coli. Nucleic Acids Research 10, 2971-2996. von Heijne, G. (1987a) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, CA. von Heijne, G. (1987b). SIGPEP: A Sequence Database for Secretory Signal Peptides. Protein Sequences & Data Analysis 1, 41-42.

Introduction to BioInformatics

Introduction to BioInformatics

Presentation Transcript

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics