1 / 28

The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective

The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective. Ugis Sarkans European Bioinformatics Institute. Outline. Microarray data and standards overview ArrayExpress overall principles ArrayExpress architecture AE repository AE data warehouse

oria
Download Presentation

The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

  2. Outline • Microarray data and standards overview • ArrayExpress overall principles • ArrayExpress architecture • AE repository • AE data warehouse • Future plans and conclusions

  3. Sample annotations problem 1 Gene expression levels – problem 2 Gene annotations Gene expression data and annotation Samples Gene expression matrix Genes

  4. Platform comparison (Tan et al, PNAS, 2003) ‘Our conclusion was very straightforward: there was very little overlap in the types of data in terms of differential expression’ (Margareth Cam, NIH)

  5. labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid Microarray array array array Gene expression data matrix Protocol Protocol Protocol Protocol Protocol Protocol normalization integration Experiment genes Sample Sample Sample Sample Sample Array design RNA extract RNA extract RNA extract RNA extract RNA extract hybridisation labelled nucleic acid hybridisation array hybridisation hybridisation hybridisation

  6. Array scans Quantitations Samples Spots Genes A B D C Different processing levels of MA data

  7. MGED standards • MIAME – minimum information about a microarray experiment • MAGE-OM and MAGE-ML – microarray gene expression object model and mark-up language • MO – microarray ontology • Data normalisation and transformations (and quality control)

  8. UML Packages of MAGE results what was done what was used HigherLevelAnalysis Experiment BioMaterial BioAssayData BioAssay Array QuantitationType ArrayDesign miscellaneous AuditAndSecurity Measurement DesignElement Protocol Description BQS BioSequence BioEvent

  9. MAGE – an example diagram

  10. ArrayExpress aims • An archive for microarray data supporting scientific publications • Providing easy access to public gene expression and other to microarray data in a structured format • Facilitating the sharing of microarray designs and protocols • Facilitating the establishment of infrastructure for microarray data sharing

  11. AE users • Experimentalists • “Single-gene” biologists • Bioinformaticians; genome-wide studies • Bioinformaticians – algorithm developers • Software developers

  12. EBI Submissions Submissions ArrayExpres Array Manufacturers (Affymetrix, Agilent) www MIAMExpress MAGE-ML External MIAMExpress installations (Camb. U., EMBL) MAGE-ML Submission tracking/ curation tool Other Microarray Databases (SMD, TIGR, Utrecht, RZPD) MAGE-ML ArrayExpress repository Queries, analysis MAGE-ML Analysis Warehouse (Biomart) www Data Analysis Software (R/Bioconductor, J-Express, Resolver) Expression Profiler External Databases (EMBL, UniProt, Ensemble) Data analysis ArrayExpress infrastructure

  13. AE: overall principles • Adherence to community standards • Data captured in a granular, formalized manner • Modern but proven software technologies • Incremental development

  14. AE design considerations • Separate data archiving from the query-optimized data warehouse • Generate default implementation, then refine • ~2 full-time developers • pressure to bring system online quickly • Use object abstraction layer • deal with performance overhead on case-by-case basis

  15. Repository architecture overview MAGE-ML (doc) MAGE-ML (doc) MAGE-ML DTD MAGE-ML document Tomcat Web page template Web page template error.log Velocity Curationenvironment MAGE validator Java servlets MAGE-OM MAGE loader object/ relational mapping Castor MAGE unloader Oracle DB

  16. AE schema • Why auto-generated? • AE must be able to import any valid MAGE-ML and not lose information • good for navigating through data in terms of object model • if some queries don’t work well, add something to the schema • Experiment-Biomaterial, Experiment-Protocol links • so far works for 400Gb of data

  17. Auto-generated web pages

  18. To ontologize ornot to ontologize At the beginning: At the end:

  19. To ontologize ornot to ontologize At the beginning: At the end:

  20. Model vs. ontology • Model – stable; ontologies – flexible • Adding/modifying/deleting attributes – easy; adding/modifying/deleting associations – hard • Therefore: attributes and their types in ontologies, domain structure (classes + associations) in the model

  21. >15 000 000 000 data points • Experiment1 • type • performer • …. • Hybridization data 1 • Experimental factors • Quantitation type definitions • … NetCDF

  22. Data warehouse schema

  23. What BioMart gives to AEDW • Query language abstraction • Joins automatically generated • Schema optimized for performance • Clear database integration roadmap

  24. ArrayExpress environment

  25. Future plans • Data management environment automation • Flexible data warehouse interface • Programmatic interface (HTTP/XML based) • Distributed infrastructure??

  26. Distributed data infrastructure Users query ArrayExpress deliverdata Query broker find resource A local database A local database A local database

  27. Conclusions • Conceptual object modeling works well for complex life sciences domains • Many software infrastructure components can be auto-generated from object models • A range of approaches can be used for modeling, e.g., UML framework + ontologies • Repository and data warehouse – different aims and different implementation principles

  28. Acknowledgements • MGED collaborators • Stanford, TIGR, Affymetrix, EMBL, …. • BioMart team • Gonzalo Garcia Lara - web interface • Ahmet Oezcimen - DBA • Anjan Sharma - curation tool • Sergio Contrino, Richard Coulson – data warehouse • Niran Abeygunawardena – webmaster • Mohammadreza Shojatalab – MIAMExpress • Misha Kapushesky – Expression Profiler • Curation team: • Helen Parkinson, Ele Holloway, Gaurab Mukherjee, Anna Farne, Tim Rayner • Domain-specific projects: • Susanna Sansone, Philippe Rocca-Serra • Alvis Brazma

More Related