Genome analysis

Bioinformatics Genome analysis Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Contents • Genome annotation • Comparative genomics • Phylogenetic profiles • Gene fusion analysis • Phylogenetic footprinting

Bioinformatics From sequences to genomes Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

From sequences to genomes • Before the 1990’s, DNA sequencing represented an important investment in terms of human work. A PhD student could spend a significant fraction of his thesis to sequence a single gene. • Genome projects stimulated the development of automatic sequencing methods, and led to important technological improvement. • There are currently (2008) several hundreds of publicly available fully sequenced genomes. • The NCBI genome distribution (ftp://ftp.ncbi.nih.gov/genomes/) contains • >650 prokaryotes (Bacteria and Archaea) • Insects (Drosophila melanogaster, Apis mellifera) • Plants (Arabidopsis thaliana, rice, maize) • A worm (Caenorhabditis elegans) • Some fungi (Saccharomyces cerevisiae, Schizosaccharomyces pombe, … ) • Some mammals (Homo sapiens, Mus musculus, Rattus norvegicus) • Other genome centres give acces to other genomes. • ENSEMBL (http://www.ensembl.org/) maintains many vertebrate genomes • UCSC (http://genome.ucsc.edu/) maintains genomes of metazoan + insects • Sanger Institute (http://www.sanger.ac.uk/genbiol/) • Integr8 ~800 of genomes in 2008. • Many other genomes were sequenced by commercial companies, and are not available to the public.

Gene organization Source: Mount (2000)

Gene function >PHO4,SPBC428.03C : THIAMINE-REPRESSIBLE ACID PHOSPHATASE PRECURSOR: Q01682;Q9UU70; Length = 463 Score = 161 bits (408), Expect = 1e-40 Identities = 138/473 (29%), Positives = 223/473 (46%), Gaps = 47/473 (9%) Query: 9 ILAASLVNAGTIPLGKLSDIDKIGTQTEIFPFLGGSGPYYSFPGDYGISRDLPESCEMKQ 68 +LAAS+V+AG S + + LG Y+ P G + PESC +KQ Sbjct: 10 LLAASIVHAGK------SQFEAFENEFYFKDHLGTISVYHE-PYFNGPTTSFPESCAIKQ 62 Query: 69 VQMVGRHGERYPT-------VSKAKSIMTTWYKLSNYTGQFSGALSFLNDDYEFFIRDTK 121 V ++ RHG R PT VS A+ I KL N G S+ + F T Sbjct: 63 VHLLQRHGSRNPTGDDTATDVSSAQYIDIFQNKLLN--GSIPVNFSYPENPLYFVKHWTP 120 Query: 122 NLEMETTLANSVNVLNPYTGEMNAKRHARDFLAQYGYMVENQTSFAVFTSNSNRCHDTAQ 181 ++ E S + G + R +Y Y + + + + T+ R D+A+ Sbjct: 121 VIKAENADQLSSS------GRIELFDLGRQVFERY-YELFDTDVYDINTAAQERVVDSAE 173 Query: 182 YFIDGL-GDKFN--ISLQTISEAESAGANTLSAHHSCPAWDDDVNDDILKK-----YDTK 233 +F G+ GD + + E +SAGAN+L+ ++SCP ++D+ D+ + + Sbjct: 174 WFSYGMFGDDMQNKTNFIVLPEDDSAGANSLAMYYSCPVYEDNNIDENTTEAAHTSWRNV 233 Query: 234 YLSGIAKRLNKE-NKGLNLTSSDANTFFAWCAYEINARGYSDICNIFTKDELVRFSYGQD 292 +L IA RLNK + G NLT SD + + C YEI R SD C++FT E + F Y D Sbjct: 234 FLKPIANRLNKYFDSGYNLTVSDVRSLYYICVYEIALRDNSDFCSLFTPSEFLNFEYDSD 293 Query: 293 LETYYQTGPGYDVVRSVGANLFNASVKLLKE--SEVQDQKVWLSFTHDTDILNYLTTIGI 350 L+ Y GP + ++G N L++ + D+KV+L+FTHD+ I+ +G Sbjct: 294 LDYAYWGGPASEWASTLGGAYVNNLANNLRKGVNNASDRKVFLAFTHDSQIIPVEAALGF 353 Query: 351 IDDKNNLTAEH-VPFMENTF----HRSWYVPQGARVYTEKFQCS-NDTYVRYVINDAVVP 404 D +T EH +P +N F S +VP + TE F CS N YVR+++N V P Sbjct: 354 FPD---ITPEHPLPTDKNIFTYSLKTSSFVPFAGNLITELFLCSDNKYYVRHLVNQQVYP 410 Query: 405 IETCSTGPGFS----CEINDFYDYAEKRVAGTDFLKVCNVSSVSNSTELTFFW 453 + C GP + CE++ + + + + + ++ + N ++ST +T ++ Sbjct: 411 LTDCGYGPSGASDGLCELSAYLNSSVRVNSTSNGIANFNSQCQAHSTNVTVYY 463 • After having localized genes on the sequence, we have to predict their function. • Some genes have already been characterized before the genome project, but these are generally a minority of those found in the genome. • For the majority of the genes, one tries to predict function on the basis of similarities between the sequence of the newly sequenced gene and some previously known genes (function assignation by sequence similarity). • Example: yeast genome (1996): there are still 2500 genes (39%) whose function is completely unknown. However • Yeast is among the best known model organisms (genetics, molecular biology). • The full genome is available since 1996. • When the first draft of the Human genome has been published, 60% of the predicted genes were of unknwown function.

Some milestones

Genes and genome size • In prokaryotes, the number of genes increases linearly with genome size • In eukaryotes, this is not the case: the genome size increases faster than the number of genes

Genes and genome size • Beware: the axes are logarithmic. • This plot represents the same data as the previous one, but in logarithmic scale, in order to see Mammals as well.

Gene spacing • Gene spacing increases considerably with the complexity off the organisms. • Note: the X axis si logarithmic, not the Y axis -> the increase seems grossly exponential.

Proportion of intergenic regions • Beware: the X axis is logarithmic. • The proportion of intergenic regions increases with the complexity of an organism. • In addition (not shown here), introns represent an increasing fraction of the genome. • For example, the exonic fraction represents <5% of the human genome.

Protein size versus genome size • Protein sequences are shorter in prokaryotes than in eukaryotes. • Among eukaryotes, the increase in genome size is not correlated to an increase in protein size • higher eukaryotes have a much larger genome than fungi, without increase in protein size

Bioinformatics Genome annotation Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Gene prediction • Starting from a completely sequenced genome, predict the positions of genes • Elements of prediction • Open Reading Frames • Start and stop codons, separated by a a continuous set of non-stop codons. • Region content • Hexanucleotide composition • Codon adaptation index (CAI). • Signals • In prokaryotes: Shine-Delgarno boxes. • In eukaryotes: intron/exon boundary elements (splicing signals). • Similarity with known genes.

Gene prediction - limitations • Typical problems: • Gene prediction programs are trained for a specific organism, and can give very bad results with other organisms (e.g., the first rounds of annotations of A.thaliana were done with programs trained for mammals). • Any gene prediction program will unavoidably predict false genes, and miss some true genes. • The prediction of intron/exon boundaries is particularly difficult. • For prokaryotes, the predicted start codons are sometimes imprecise. • Example: genome of the yeast Saccharomyces cerevisiae • For the yeast genomes, the gene detection protocol used in 1996 was over-predictive. • The program essentially relied on ORF, and predicted 6400 gene. • Some researchers estimated that ~1,000 ORFs might be false predictions. • Since 1996, the reality of the predicted genes has been tested by combining several methods of functional genomics (expression studies, mutant phenotypes, comparative genomics between closely related species, …). • A few hundreds of the initially predicted genes have been removed from the annotations.

Non-coding genes • There are many types of non-coding genes • tRNA transfer RNA • rRNA ribosomial RNA • snRNA small nuclear RNA (elements of spliceosome) • snoRNA methylation guides • ... • Detection of non-coding RNA • generally transcribed by polymerase I and III and have different promoters

Annotation of gene function • Once a genomic region has been predicted to contain a gene, the next step is to predict the function of this gene. • The translated product is compared with all known proteins, and a putative function can be assigned on the basis of high similarity matches. • Problems • Sequence similarity is not always sufficient to confer the same function • Where to put the threshold ? • Some proteins might have similar function with different sequences (convergent evolution). • Once a gene has been assigned some putative function, this will be used to assign the same function to other genes  expansion of errors. • We should thus be aware that gene annotations have to be taken with caution.

Genes with unknown function • When genomes of model organisms were sequenced, about 40% of the predicted genes could not be associated to any known function • These genes are annotated as "hypothetical proteins". • Note • In the yeast genome, many of these hypothetical proteins have been removed from the annotations since 1996, because they were false predictions.

Bioinformatics Comparative genomics Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Phylogenetic footprinting Genome 1 Genome 2 conserved exon conserved non-coding region • One of the main reasons for sequencing the mouse genome was to detect conserved regions between mouse and human, which will reveal exons and regulatory regions. • The fact that an unknown gene is found in different genomes gives more confidence in the existence of this gene. • Another important goal was to detect conserved regions in non-coding regions. • On the basis of a few known cases, it has been shown that conserved non-coding regions contain a high concentration in regulatory elements. • The detection of conserved non-coding sequences gives thus indications about regions potentially involved in regulation. • Such conserved regions are called phylogenetic footprints.

Phylogenetic profiles • For each gene of the query genome (e.g. E.coli), orthologs are searched in all the sequenced genomes • Each gene is characterized by a profile of presence/absence in all the sequenced genomes • Groups of genes having similar phylogenetic profiles are likely to be functionally related Pellegrini et al. (1999). Proc Natl Acad Sci U S A96(8), 4285-8.

Gene fusion analysis Query genome A B E.coli 2 components Reference genomes A^B B.subtilis 1 composite H.pylori 1 composite Query genome A B C D E E.coli 5 components Reference genomes C^D^A^B^E Yeast 1 composite • It is quite frequent to observe that two genes of a given organism are fused into a single gene in another organism. • Fusions between more than 2 genes are occasionally observed. • Fused genes are likely to be functionally related. References Marcotte, et al. (1999). Science 285(5428), 751-3. Marcotte, et al. (1999). Nature 402(6757), 83-6. Enright, et al. (1999). Nature 402(6757), 86-90.

Bioinformatics Conclusion Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

The genome challenge • Despite the availability of several hundreds of genomes, we are far from understanding the organization and function of a single genome. • In particular, a lot of work remains to be done to decipher genomes of higher organisms. • Genome sequence by itself is far from sufficient for this. • Since 1997, several high-throughput methods have been invented to give complementary information about gene function (see courses on transcriptome, proteome and interactome).

Quelques jalons concernant la taille des génomes

Genome analysis

Genome analysis

Presentation Transcript

Genome Analysis

Genome analysis and annotation

Genome Variation Bovine genome analysis

Genome analysis.

Melampsora Genome Annotation and Genome Structure Analysis

Vibrio genome analysis

Cancer Genome Analysis

The Genome Analysis Centre

Genome analysis

Genome Analysis Research Group

POTATO GENOME ANALYSIS

The Genome Analysis Centre

Genome databases and webtools for genome analysis

Genome Databases and Analysis

Genome analysis

Part 12 Genome Analysis

Genome databases and webtools for genome analysis

Comparative genome analysis

Genome Analysis

Genome Resequencing analysis

Genome Analysis

Whole genome analysis