BioInformatics Consultation Practice 9 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22

BioInformatics Consultation Practice 9 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler@t-online.hu

Content of the Practice • Genome Browsers • Basic terms • GUI • Database engine: On-Line Analitical Processing • Methodology: On-Line Analitical Processing • Relational Database System • The Star Schema • Data Cubes • Dimensions • Hierarchies: Aggregate/Drill • Measures • Formulas • Software: NCBI Genome • References

Genome Browsers: Basic terms 1 • Genome Browsers (Genom Böngésző)are integrated software tools containing: • The visible part: Graphical User Interface (GUI) for: • Display entire genomes with annotated data in graphic format organized by genomic nucleotide position coordinate axis: • Gene prediction and structure, • Proteins, Domains, Motives • Expressionregulation, • Alternative Splicing variations • Display comparison of multiple genomes and facilitate Comparative genomic Analysis (Összehasonlító Genom Elemzés)

Genome Browsers: Basic terms 2 • Hidden inside the software: Genomic browsers (should) contain a Database Engine (Adatbázis Motor) enabled for On-Line Analytical Processing (OLAP) (On-Line Analitikus Feldolgozás): • Stores genomic data and annotations from multiple (even partially conflicting) sources in a Data Warehouse (Adattárház): Relational Database (Relációs Adatbázis) with standardized design • Can generate multi-dimensional (Többdimenziós) reports • On an easy to use Graphic User Interface (GUI), where modifying breakup levels and aggregation of genomic data can be done for mouseclick, without any manual coding • Why is it important (besides it looks nice for IT people) for a biologist? • Even if genomes are large data sets stored as 1-dimensional data (nucleotide sequences in 5’-3’ or 3’-5’ DNA strands) • But their logical structure is multi-dimensional (Több dimenziós) data set packed in 1 dimensional storage, where each dimension is hierarchic (Dimenzionális hierarchia) with multiple breakup levels (Dimenzió Szint): • Philogenic groups: Kingdoms > Families > Clusters > Species • Physical layout: Genome > Chromosome > DNA Strand > Gene • Gene structure: Gene > Expression factors > Introns/Exons • Proteomics: Protein > Domain > Motives • Sequencing+Assembly: Sample > Dictionary > Contig > Fragment > EST • A biotech researcher deals with Ill-structured Problems (Rosszul Strutúrált Probléma) in research: algorithm and Input/Output data structures of the problem is not clearly defined • Therefore he/she may require viewing genomic data aggregated (Aggregálva) in any possible combination of breakup levels of different dimensions very quickly (eg. in what final protein products can be formed from alternative splicing of gene x, what are the all possible alternative spliced cDNAs from gene x in all paralog sequences, etc.) • The point is that the researcher usually cannot tell predefined „standard breakup level combinations”, they depend on iterative research process

Genome Browsers: Analogy with dirty business: Managerial reporting • This problem has interesting analogy with a very far area of science: Top managerial reporting of corporate business data: • The first Corporate Data Management applications started to work at large multinational companies at 1970s: • They were incredibly expensive (both software and mainframe fardware of the day) • They enabled to service customers faster and more reliably at operative level • But they were incredibly useless at strategic level: they flooded top managers with tons of predefined standardized reports every week, what they instantly threw at the garbage can, and complained „How expensive it was, and it cannot give even the simplest info what I need in time!”. • The reason was that early relational database systems could do only static structured reporting which conflicted the ill-structured nature of top managerial decisions: • A top manager needs only 5-6 numbers every week, but several million dollars can live or die on that numbers • Therefore: the deadline is yesterday – he needs them damn quickly! • He cannot tell even in the previous week what he will need on next week – it depends on! • He has no time to read lenghty standard reports containing all possible breakup of the data: an average top manager can read 4 pages/day (if he can read at all…), all other time is invested in lobbying for the company through social relations: tennis or dinner with politicians of governing party, group orgies with representatives of opposition party (just in any case…) • Introduction of OLAP dynamic structured reporting tools solved this problem at the beginning of 1990s, and probably will be heavily involved in genome browsers in the future. Therefore, let us see its theoretic basics:

Basics of OLAP: Relational Database Management (RDBM) 1 • All genomic browsers translate and store annotations of genome data in Relational Data-base (Relációs Adatbázis) instead of simply using orginal FASTA or EBI records. Why? • Large data sets can be searched fast if they are stored on hard drive in fixed record lenght data tables: we can compute start of nth record instead of read through all data • Data structure of genomes called Empirical Data Structure (EDS) is non fixed lenght by definition:1 chromosome can contain many genes,1 gene can contain many exons,etc. • RDBM resolves this conflict decomposing (Szétbont) EDS into Entities (Egyedek): an object which can have numerous occourences (Előfordulás) described with the very same attribu-tes(Tulajdonság) can be stored in fixed lenght space: eg.Exon: Start/EndPos, SpliceNum • Decomposition is made by Cardinality Analysis, CA(Számosság elemzés): it examines how much occourences of one entity related to other entity examined between them 2 directions: • 1Genecan containmanyExons, but1Exonbelongs to1Gene =1:many relation of 2 entities (we denote Entity, Attributes, their relation, its cardinality with color codes) • 1Genecan codemanyProteins, and1Proteincan be coded bymanyGenes=many:many relation of 2 separate entities • 1Exonhas only1StartPos, 1EndPos, 1SpliceSequenceNumber=1:1 relation, these are attributes of the very same entity

Basics of OLAP: Relational Database Management (RDBM) 2 • To preserve original data of EDS, decomposed Entities should be connected by relations (Relációk): referential connection with cardinality 1:many among the following attributes: • Primary key (Elsődleges kulcs) attribute: uniquely identifies occourences of an entity, therefore it will be the 1 side of the relation. It is denoted with orange eg. GeneID • Foreign key (Idegen kulcs) attribute: reference to primary key of another entity with the same name and type, it will be the many side of the relation, denoted olive eg. GeneID • For example:1Genecan containmanyExons, but1Exonbelongs to1Gene  allways the many side (Exon) references to uniquely identified 1 side (Gene): • Many:Many relations are assembled from two 1:many relations and a Relation entity: 1Gene can CodemanyProteins, and1Protein can be Coded by manyGenes

Basics of OLAP: Relational Database Management (RDBM) 3 Cluster ClusterID ClusterName FamilyID EntityName EntityNameID Text Integer Fraction Binary Date Time Image Sound Movie ReqForeignKey OptForeignKey Modifier Modified Status MasterEntity MasterID MasterName Sample SampleID SampleName Dictionary DictionaryID Restrictase SampleID Contig ContigID ContigName DictionaryID Fragment FragmentID FragmentName ContigID Species SpeciesID SpeciesName ClusterID Protein ProteinID ProteinName Sequence SequenceID SeqenString StartPos EndPos Strand Date ESTID MotiveID ExonID GeneID SpeciesID Motive MotiveID MotiveName DomainID Genome GenomeID GenomeName Chromosome ChromosomeID ChromosomeName GenomeID Strand StrandID Direction ChromosomeID Exon ExonID ExonName GeneID Gene GeneID GeneName StrandID Kingdom KingdomID KingdomName Family FamilyID FamilyName KingdomID EST ESTID ESTName FragmentID Domain DomainID DomainName ProteinID • One can see that it is hard to overwiev relations of a difficult database with dozens of entities from little sample tables. Therefore relational database design is represented at Entity Relationship (Egyedkapcsolati) Diagram, ERD: • Entites are rounded corner boxes with EntityName at the top. Blue background denotes codetable/master entities with minimal data change in time, yellow denotes relational/transaction entities: rapid, irrevocable data changes in time • Attributes are listed with their data type icons:( , , , , , , , , ) and names: italic means optional-, normal means required-, bold means auto-filled attribute • Data attributes are purple, primary keys are orange prompted by ( ), foreign keys are olive prompted by( ), auto-filled system logging attributes are black • 1:many relations are denoted by ( ) connecting primary- and foreign keys OLAP systems can work only with database design called Star(Csillag)Schema: • In the „center”, there are transaction entity observed sequences: • In the „arms” there are master data entities of dimension levels: Dimension:Sequencing+Assembly Dimension:Philogeny Dimension:Gene structure Dimension:Proteomics

Content of the Practice • Genome Browsers • Basic terms • GUI • Database engine: On-Line Analitical Processing • Methodology: On-Line Analitical Processing • Relational Database System • The Star Schema • Data Cubes • Dimensions • Hierarchies: Aggregate/Drill • Measures • Formulas • Software: NCBI Genome • References

OLAP: Terminology Seqn. Count: 0 Seqn. Count: 4 Protein2 Protein1 Seqn. Count: 7 Seqn. Count: 0 Gene1 Seqn. Count: 3 Seqn. Count: 1 Seqn. Count: 2 Splice2 Gene2 Splice1 • OLAP systems import data from star schema based relational databases, but use slightly modified basic terms and more advanced – but far higher computing resource consumption - tools for data storage: • Dimension (Dienzió): variable by which data can be grouped: eg. GeneStructure • Level (Szint):hierarchic internal structure of a dimension based on chain of 1:m relationships: eg. Genome > Choromosome > Strand > Gene > Exon • Position (Pozíció): possible values of a dimension level: eg. Strand:(5’-3’,3’-5’) • Data Cube (Adatkocka): multi dimensional data storage and aggregation object consisting: • Cells (Cella): Data storage formed by Cartesian product of positions of selected levels of dimensions: eg. Protein(P001, P002) × Gene(G001, G002) = (P001, G001), (P001, G002), (P002, G001), (P002, G002) They store: • Measures (Mérték): Data aggregated or computed from transaction data records: eg. Count of sequences • View (Nézet): As data cubes are multi-dimensional objects, they cannot be fully represented on a 2D display (screen or printout). View is a dynamically selected part defined by user with: • Row Dimension|Level • Column Dimension|Level • For any more dimensions:Page filter positions • Cell data content measures: • Aggregated from original data (Count, Sum, Avg, Min, Max) • Calculated by mathematical formula (Ln, Sin, Cos, Exp,..)

Software: NCBI Genome • http://www.ncbi.nlm.nih.gov/sites/genome

References • Theory of Genome Projects and Annotation: • http://www.plosone.org/article/info:doi/10.1371/journal.pone.0006291 • http://en.wikipedia.org/wiki/Genome_project • http://www.arabidopsis.org/portals/genAnnotation/genome_annotation_tools/index.jsp • http://vega.sanger.ac.uk/index.html • http://www.ensembl.org/index.html • Genom Browser Software: • http://www.ncbi.nlm.nih.gov/sites/genome • http://genome.ucsc.edu/ • http://genome.ucsc.edu/cgi-bin/hgGateway • http://www.bioviz.org/igb/ • http://genoviz.sourceforge.net/ • http://www.affymetrix.com/partners_programs/programs/developer/tools/download_igb.affx • http://genoviz.sourceforge.net/ • http://apollo.berkeleybop.org/current/index.html

BioInformatics Consultation Practice 9 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22