The H uman G enome , impact in the biomedical domain

The Human Genome, impact in the biomedical domain Sonia ABDELHAK, PhD Molecular Investigation of Genetic Orphan Disorders Institut Pasteur de Tunis

Human Genome Project • Historical context. • Goals of the HGP. • Strategy. • Results. • Impact on Biomedical domain. • Discussion.

February 2001 « Finished » sequence April 1953-April 2003

Brief history of HGP 1984 to 1986 – first proposed at US DOE meetings 1988 – endorsed by US National Research Council (Funded by NIH and US DOE $3 billion set aside) 1990 – Human Genome Project started (NHGRI) Later – UK, France, Japan, Germany, China 1998. Celera announces a 3-year plan to complete the project years early First draft published in Science and Nature in February, 2001 Finished Human Genome sequence published in Nature 2003.

Challenges • Genome Attributes • Size • Polymorphism • Repeats (Smaller repeats are technically difficult to sequence, some sequences are repeated all over the genome: How can these be placed?). • Available Technology • 600 bp per “read”(Sequencing works by extension from a primer/ gel electrophoresis. Limited by resolution of gel). • Error (~1 error per 600. Sequencing multiple times decreases error; same error unlikely in multiple reads. 10x Coverage = error rate ~1/10,000). • Relies on cloning (Some regions are difficult to clone Heterochromatin; some sequences rearrange or are deleted when cloned)

Goals of HGP • Create a genetic and physical map of the 24 human chromosomes (22 autosomes, X & Y) • Identify the entire set of genes & map them all to their chromosomes • Determine the nucleotide sequence of the estimated 3 billion base pairs • Analyze genetic variation among humans • Map and sequence the genomes of model organisms

Model organisms • Bacteria (E. coli, influenza, several others) • Yeast (Saccharomyces cerevisiae) • Plant (Arabidopsis thaliana) • Roundworm (Caenorhabditis elegans) • Fruit fly (Drosophila melanogaster) • Mouse (Mus musculus)

Goals of HGP (II) • Develop new laboratory and computing technologies to make all this possible • Disseminate genome information • Consider ethical, legal, and social issues associated with this research

Time-line large scale genomic analysis

IL-12p35AC F tggtggcagaaatcattgtctgaaaagtaattgttttacttttattcttttcgtgtgtgtgtgtgt gtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgcatgtgccagatttcttgtttgaaaggcaat gagcttcatccaagtatcaa 78.57% IL-12p35AC R IL-12p40AC F atttcaggtgtgagccactgtgcctggccagaactttttcaatgaatattcaagataattgtatacacattttatatatatatatatatatacacacacacacacacacacatatgtatacacacattatatatataatccatgttatatacatctctacattatatatatccactatatatattttacttatacatatagattttatttttatgaactaggatcaaattgta 69.23% 1 2 3 4 5 IL-12p40AC R 174 170 166 Identification de Polymorphismes de type microsatellites par analyse de séquence:

sequence1 ESTs TAGTCA clone xyz sequence2 CGTACT - isolate unique clones - sequence once from each end make cDNA library 80-100,000 unique cDNA clones in library EST Division: Expressed Sequence Tags dbEST http://www.ncbi.nlm.nih.gov/dbEST/ >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus 80-100,000 genes >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC 80-100,000 RNA gene products

Chimie de séquençage Dye Terminator (6) Taq A G C T A T ... amorce T C G A T A ADN réaction de séquence Electrophorèse Gel plat / capillaire A G C T A T Analyse automatique A G C T A A G C T A T A G C T dépot détection A G C A G A

Two Competing Strategies for Human Genome • (Hierarchical shotgun) [Public human genome project] • Whole-genome Shotgun [Celera project]

Sequencing BAC: Bacterial Artificial Chromosome clone Contig: joined overlapping collection of sequences or clones.

Whole-genome shotgun sequencingPrivate company Celera used to sequence whole human genome • Whole genome randomly sheared three times • Plasmid library constructed with ~ 2kb inserts • Plasmid library with ~10 kb inserts • BAC library with ~ 200 kb inserts • Computer program assembles sequences into chromosomes • No physical map construction • Only one BAC library • Reduces problems of repeat sequences

A G C T A T Différentes étapes d’analyse de séquence Vérification de la qualité de séquence Elimination des séquences contaminantes Blastn contre des banques de vecteurs, de bactéries, levures,… Assemblage, Phred, Phrap, Consed Identification des séquences potentiellement codantes Comparaison avec les banques de données, Logiciels de prédictions d’exons.

EBI Entrez NIH NCBI GenBank • Submissions • Updates • Submissions • Updates EMBL DDBJ CIB NIG • Submissions • Updates SRS EMBL getentry

phase 1 HTG Acc = AC008701 gi = 6601005 phase 2 HTG Acc = AC008701 gi = 6671909 PRI phase 3 Acc = AC008701 gi = 7328720 HTG Division: High Throughput Genome Records 40,000 to > 350,000 bp

2.88 Gbp 2,851,330,913

Gene prediction • Easy for procaryotes (single cell) – one gene, one protein • More difficult for eukaryotes (multicell) – one gene, many proteins • Very difficult for Human – short exons separated by non-coding long introns

Gene recognition • Coding region and non-coding region have different sequence profiles • coding region is “protected” from mutation and is less random • Gene recognition by sequence alignment • Gene prediction by Hidden Markov Model trained by set of known genes • Many genes are homologs – similar in vastly different organisms

Two predictions disagree “…predicted transcripts collectively contain partial matches to nearly all known genes, but the novel genes predicted by both groups are largely non-overlapping.” John B. Hogenesch, et al Cell, Vol. 106, 413–415 August 24, 2001

Human genome content • The Human Genome • Total length 3000 Mb • ~ 40,000 genes (coding seq) • Gene sequences < 5% • Exons ~ 1.5% (coding) • Introns ~ 3.5% (noncoding) • Intergenic regions (junk) > 95% • Repeats > 50%

Global properties • Pericentromeric and subtelomeric regions of chromosomes filled with large recent transposable elements • Marked decline in the overall activity of transposable elements or transposons • Male mutation rate about twice female • most mutation occurs in males • Recombination rates much higher in distal regions of chromosomes and on shorter chromosome arms • > one crossover per chromosome arm in each meiosis

Fig 17 transposables Interspersed repeats: fixed transposable elements copied to non-homologous regions. Total 45% Classes of transposable elements. LINE, long interspersed element. SINE short interspersed element.

Fig 21 Genes are sometimes protected from repeats Two regions of about 1 Mb on chromosomes 2 and 22. Red bars, interspersed repeats; blue bars, exons of known genes. Note the deficit of repeats in the HoxD cluster, which contains a collection of genes with complex, interrelated regulation.

Important features of Human proteome • 30,000–40,000 protein-coding genes • Proteome (full set of proteins) more complex than those of invertebrates. • pre-existing components arranged into a richer architectures. • Hundreds of genes seem to come from horizontal transfer from bacteria questionable • Dozens of genes seem to come from transposable elements.

Noncoding RNA genes • Transfer RNAs (tRNAs) – adaptors that translate triplet code of RNA into amino acid sequence of proteins • Ribosomal RNAs (rRNAs) – components of ribosome • Small nucleolar RNAs (snoRNAs) – RNA processing and base modification in nucleolus • Small nuclear RNAs (sncRNAs) - spliceosomes

Human races have similar genes • Genome sequence centers have sequenced significant portions of at least three races • Range of polymorphisms within a race can be much greater than the range of differences between any two individuals of different race • Very few genes are race specific

Genome Sizes (MegaBases)

Fig 35a Size distributions of exons in Human, Worm and Fly. Human have shorter exons.

Fig 35c Size distributions of intons in Human, Worm and Fly. Human have longer introns.

Complexity of proteome increase from yeast to humans • More genes • Shuffling, increase, or decrease of functional modules • Alternative RNA splicing – humans exhibit significantly more • Chemical modification of proteins is higher in humans

Combinatorial strategies • At DNA level – T-cell receptor genes are encoded by a multiplicity of gene segments Fig. 10.21 • At RNA level – splicing of exons in different orders

Yeast • 70 human genes are known to repair mutations in yeast • Nearly all we know about cell cycle and cancer comes from studies of yeast • Advantages: • fewer genes (6000) • few introns • 31% of yeast genes give same products as human homologues

Drosophila • nearly all we know of how mutations affect gene function come from Drosophila studies • We share 50% of their genes • 61% of genes mutated in 289 human diseases are found in fruit flies • 68% of genes associated with cancers are found in fruit flies • Knockout mutants • Homeobox genes

C. elegans • 959 cells in the nervous system • 131 of those programmed for apoptosis • apoptosis involved in several human genetic neurological disorders • Alzheimers • Huntingtons • Parkinsons

Mouse • known as “mini” humans • Very similar physiological systems • Share 90% of their genes

Questions Remain about the Human Genome • Difficult to precisely estimate number of genes at this time • Small genes are hard to identify • Some genes are rarely expressed and do not have normal codon usage patterns – thus hard to detect

Impact of HG on Biomedical domain

Applications to medicine and biology • Disease genes • human genomic sequence in public databases allows rapid identification of disease genes in silico • Drug targets • pharmaceutical industry has depended upon a limited set of drug targets to develop new therapies • now can find new target in silico • Basic biology • basic physiology, cell biology…

Hérédité liée au chromosome X

Hérédité autosomique dominante

The H uman G enome , impact in the biomedical domain