Concepts, historical milestones & the central place of bioinformatics in modern biology:

Concepts, historical milestones & the central place of bioinformatics in modern biology: a European perspective Teresa K.Attwood University of Manchester

Overview • Where the term bioinformatics originated • Where the ‘modern’concept originated • Some key events & folk • Its place in‘the new biology’ Teresa K.Attwood University of Manchester

Origin of Bioinformatics • The origin of the term‘bioinformatics’ has been attributed to Paulien Hogeweg • Dutch theoretical biologist • She & colleague Ben Hesper coined the term in the early ‘70s, defining it as • “the study of informatic processes in biotic systems” • Hogeweg, P. (2011) The roots of bioinformatics in theoretical biology. PLoS Computational Biology, 7(3), e1002021 • The term failed to gain traction for ~20 years Teresa K.Attwood University of Manchester

Origin of Bioinformatics • The origins of the ‘modern’concept of bioinformatics are rooted in sequenceanalysis • Driven by the desire to • collect • annotate • & analyse sequence data • systematically (i.e., using computers)! This concept of‘bioinformatics’was barely known pre 1990… Teresa K.Attwood University of Manchester

GIVEQCCASVCSLYQLENYCN Key milestones FVNQHLCGSHLVEALYLVCGERGFFYTPKA CSD 1950 1960 1970 1980 1990 2000 2010 insulin ribonuclease Dayhoff Atlas

Margaret Dayhoff1925-1983 • Pioneer of computer methods to compare proteins • & to derive evolutionary histories from alignments • Particular interest in deducing evolutionary connections from sequence evidence Teresa K.Attwood University of Manchester

Margaret Dayhoff • Collected all the known protein sequences • made them available to the scientific community • In 1965, she compiled a book • Atlas of Protein Sequence & Structure Teresa K.Attwood University of Manchester

Margaret Dayhoff “There is a tremendous amount of information regarding the evolutionary history and biochemical function implicit in each sequence andthe number of known sequences is growing explosively. We feel it is important to collect this significant information, correlate it into a unified whole and interpret it” M.O.Dayhoff to C.Berkley, February 27, 1967 Strasser, B. (2008) “GenBank – Natural history in the 21st century?” Science, 322, 537-538 Teresa K.Attwood University of Manchester

Key milestones CSD PDB ARPAnet Exam 1 What pernicious, life-changing development occurred in 1971? 1950 1960 1970 1980 1990 2000 2010 insulin DNA sequencing ribonuclease Dayhoff Atlas Auto DNA sequencing Auto protein sequencers 65 7

Data overload in the USA “the rate limiting step in the process of nucleic acid sequencing is now shifting from data acquisition towards the organization and analysis of that data” Gingeras, T.R. & Roberts, R.J. (1980) “Steps toward Computer Analysis of Nucleotide Sequences,” Science, 209, 1322-1328 Teresa K.Attwood University of Manchester

Data overload in the USA “a centralized data bank [is] essential for the efficient use of nucleic acid sequence information” C.Anderson, Minutes, 1980 Teresa K.Attwood University of Manchester

Data overload in Europe • While the US debated where to locate a new centralised resource, EMBL acted… • The 1st internationally funded, public ‘central’ nucleotide sequence database was thus European • the EMBL data library, Heidelberg • preceded the 1st release of GenBank by ~6 months Attwood, T.K. et al. (2011) Concepts, Historical Milestones & the Central Place of Bioinformatics in Modern Biology: A European Perspective In Bioinformatics - Trends & Methodologies, Intech Online Publishers, Teresa K.Attwood University of Manchester

Data overload in Europe • Copies of the EMBL data library & GenBank were being maintained in Cambridge • together with their search tools, etc. • An integrated system gave access to the dbs & tools • “this system is presently being used by over 30 researchers in 8 departments in the University & in local research institutes. These users can keep in touch with each other via the MAIL command”! Teresa K.Attwood University of Manchester

Key milestones PIR EMBL, GenBank CSD PDB ARPAnet Internet email 1950 1960 1970 1980 1990 2000 2010 insulin DNA sequencing ribonuclease Dayhoff Atlas Auto DNA sequencing Auto protein sequencers 568 65 859 7

Enter Amos Bairoch • A crazy postgrad student in Switzerland • interested in space exploration & the search for ET life • His project was to develop s/w to analyse protein & nucleotide sequences • PC/Gene Teresa K.Attwood University of Manchester

Amos Bairoch • Published his 1st paper in 1982 • a letter to the BJ • Suggested use of checksums • “tofacilitate detection of typographical & keyboard errors” Teresa K.Attwood University of Manchester

Amos Bairoch • Why? • Alongside PC/Gene, he needed to supply a db • The Atlas wasn’t available electronically • typed in >1,000 protein sequences • some from the literature • most from the Atlas • by 1981, this was a large book, plus several supplements, listing 1,660 proteins Teresa K.Attwood University of Manchester

Amos Bairoch • In 1983, he acquired a computer tape of the EMBL Data Library • version 2, with 811 sequences • In 1984, he received the 1st available computer tape copy of the Atlas • (which became known as the PIR-PSD) • but… he disliked the PIR format Teresa K.Attwood University of Manchester

Amos Bairoch • So he converted the PIR database into the semi-structured format of EMBL • part manually & part automatically • The result was PIR+ • & was distributed as part of PC/Gene (now commercial) • In summer 1986, he finally released the database independently of PC/Gene • to make it available to all, free of charge Teresa K.Attwood University of Manchester

Amos Bairoch • This new database was called Swiss-Prot • 1st released on 21 July 1986 • the exact number of entries is unknown, as he lostthe original floppy disks! Teresa K.Attwood University of Manchester

Amos Bairoch • As part of his work on PC/Gene, he created another key database • diagnostic tool for characterising protein families • 1st released March1989, with 58 entries • this was PROSITE • Philosophy of his approach • coupling high quality data analysis with manual annotation Teresa K.Attwood University of Manchester

Characterising protein families PROSITE [IVM]-[AS]-L-W-S-L-V2-L-A-[IV]-E-R-Y-[IV]3-C-K-P-M PRINTS Teresa K Attwood University of Manchester

The burden of maintenance • Database annotation… Database Maintenance Nirvana Database annotation Teresa K Attwood University of Manchester

Amos Bairoch’s lament “It is quite depressive to think that we are spending millions in grants for people to perform experiments, produce new knowledge, hide this knowledge in often badly written text and then spend some more millions trying to second guess what the authors really did and found” Bairoch, A. (2009) The future of annotation/biocuration Nature Precedings Teresa K Attwood University of Manchester

Key milestones PRINTS PROSITE Swiss-Prot PIR EMBL, GenBank CSD PDB ARPAnet Internet email 1950 1960 1970 1980 1990 2000 2010 insulin DNA sequencing ribonuclease Dayhoff Atlas Auto DNA sequencing Auto protein sequencers 568 65 859 7 3,900

Global data overload • The number of sequences was growing • The number of structures was growing • The number of protein family signatures was growing Exam 2 Two extraordinary developments had yet to take place. What were they? Teresa K.Attwood University of Manchester

Key milestones PRINTS PROSITE Pfam InterPro Swiss-Prot TrEMBL FlyBase PIR EMBL, GenBank CSD PDB ARPAnet Internet email www 1950 1960 1970 1980 1990 2000 2010 insulin DNA sequencing H.sapiensgenome C.elegansgenome ribonuclease Dayhoff Atlas S.cerevisaegenome HT DNA sequencing H.influenzae genome Auto DNA sequencing D.melanogastergenome Auto protein sequencers 568 65 859 7 2,423 3,900 105,000

Prosite HAMAP PIRSF PRINTS ProDom InterPro Gene3D SUPERFAMILY TIGRFAM PANTHER Pfam Profiles SMART

Key milestones EMBnet ELIXIR NCBI SIB EBI PRINTS PROSITE Pfam InterPro Swiss-Prot TrEMBL FlyBase UniProt ENA PIR EMBL, GenBank CSD PDB ARPAnet Internet email www 1950 1960 1970 1980 1990 2000 2010 insulin DNA sequencing H.sapiensgenome C.elegansgenome ribonuclease Dayhoff Atlas S.cerevisaegenome HT DNA sequencing H.influenzae genome Auto DNA sequencing D.melanogastergenome Auto protein sequencers 568 65 859 7 2,423 3,900 105,000 >500B 36.0M

Key milestones EMBnet ELIXIR NCBI SIB EBI PRINTS PROSITE Pfam InterPro Swiss-Prot TrEMBL FlyBase UniProt ENA PIR EMBL, GenBank CSD PDB hundreds more ARPAnet Internet email www 1950 1960 1970 1980 1990 2000 2010 insulin DNA sequencing H.sapiensgenome C.elegansgenome ribonuclease Dayhoff Atlas thousands more S.cerevisaegenome HT DNA sequencing H.influenzae genome Auto DNA sequencing D.melanogastergenome Auto protein sequencers billions more 568 65 859 7 2,423 3,900 105,000 >500B 36.0M

Red Line Growth of EMBL since its inception Scary monsters! 282 M By2020, NGS & 3Gen technologies will be producing data a million times faster than the current rate Green Line Growth of manually annotated Swiss-Prot 35 M 540 K 84 K Blue Line Growth of PDB

The central place of bioinformatics in modern biology • Hopefully, this potted history speaks for itself • In the last 30 years, bioinformatics has given us • the first ‘complete’ catalogues of DNA & protein sequences • including genomes & proteomes of organisms across the Tree of Life • software to analyse biological data on an unprecedented scale • & hence tools to help understand • more about evolutionary processes in general • our place on the Tree of Life in particular • &, ultimately, more about health & disease • It isn’t a panacea, but its contribution has been huge Teresa K.Attwood University of Manchester

Recommended reading Richon, A.B. A short history of bioinformatics (http://www.netsci.org/Science/Bioinform/feature06.html) Bairoch, A. (2000) Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times. Bioinformatics, 16(1), 48-64. Ashburner, M. (2006) Won for all – How the Drosophila genome was sequenced. Cold Spring Harbor Lab. Press Strasser, B.J. (2008) GenBank – Natural history in the 21st century? Science, 322, 537-538. Attwood, T.K., Gisel, A., Eriksson, N-E. & Bongcam-Rudloff, E. (2011) Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: A European Perspective Teresa K.Attwood University of Manchester

Concepts, historical milestones & the central place of bioinformatics in modern biology: