1 / 56

Functional Genomics Data and Expression look-up tools: ArrayExpress and Expression Atlas

Functional Genomics Data and Expression look-up tools: ArrayExpress and Expression Atlas. Sarah Morgan, PhD sarahm@ebi.ac.uk Training Programme Manager, EMBL-EBI Girona Workskshop 1 st July 2014. In this session…. What do we mean by “functional genomics data”?

anja
Download Presentation

Functional Genomics Data and Expression look-up tools: ArrayExpress and Expression Atlas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Functional Genomics Data and Expression look-up tools: ArrayExpress and Expression Atlas Sarah Morgan, PhD sarahm@ebi.ac.uk Training ProgrammeManager, EMBL-EBI GironaWorkskshop 1st July 2014

  2. In this session… • What do we mean by “functional genomics data”? • Two databases: ArrayExpress and Expression Atlas Why not just one database? • What’s in each database? • How to search, interpret & download data? fastq txt CEL bam 2 ArrayExpress

  3. ArrayExpress What is functional genomics (FG)? • The aim of FG is to understand the function of genes and other parts of the genome • FG experiments typically utilize genome-wide assays to measure and track many genes (or proteins) in parallel under different conditions • High-throughput technologies such as microarrays and high-throughput sequencing (HTS) are frequently used in this field to interrogate the transcriptome

  4. ArrayExpress What biological questions is FG addressing? • When and where are genes expressed? • How do gene expression levels differ in various cell types and states? • What are the functional roles of different genes and in what cellular processes do they participate? • How are genes regulated? • How do genes and gene products interact? • How is gene expression changed in various diseases or following a treatment?

  5. The two databases: ArrayExpress www.ebi.ac.uk/arrayexpress (daily release at 6am UK time) Expression Atlas www.ebi.ac.uk/gxa (monthly release) wwwdev.ebi.ac.uk/gxa (updated more frequently)

  6. The two databases: how are they related? Direct submission Expression data sets Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to analysis software, e.g. Links to other databases, e.g. 6 ArrayExpress

  7. The two databases: how do they compare? 7 ArrayExpress

  8. Data volume in ArrayExpress ~50,511 experiments, ~1/5 direct submissions, the rest imported Microarray vs HTS RNA-, DNA-, ChIP-seq breakdown (Pie charts as of 19 February 2014) 8 ArrayExpress

  9. Data content in ArrayExpresswww.ebi.ac.uk/arrayexpress • Curated data from direct submissions, available in a structured and standardised format – essential for easy data sharing • Submissions are curated to Functional Genomics Data Society (FGED)’s standards: • MIAME guidelines & MAGE-TAB format for microarray • MINSEQE guidelines & MAGE-TAB format for HTS data • Many experiments have supporting publications 9 ArrayExpress

  10. Community standards for data requirement • MIAME = Minimal Information About a Microarray Experiment (http://www.mged.org/Workgroups/MIAME/miame_2.0.html) • MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencingExperiment (http://www.mged.org/minseqe) • The checklist: 10 ArrayExpress

  11. Reporting standards - MAGE-TAB format A simple spreadsheet format that uses a number of tab-delimited text files • Array Design Format file • Probe names, sequence, genomic mapping location • Investigation Description Format file • Experiment title + description • Submitter’s details • All protocols ADF (microarray only) IDF • Raw and processed data files • Sample Data Relationship Format file * SDRF /Seq lib Normalized.txt * Data1.txt Data2.txt .CEL Hyb/seq assays A1.CEL 2.fq.gz 1.fq.gz MAGE-TAB in FGED: http://www.mged.org/mage-tab/index.html 11 ArrayExpress

  12. Example IDF: expt. info and protocols Row headings from MAGE-TAB spec, often with controlled vocab Submitter-supplied information

  13. Example SDRF: workflow from samples to data

  14. ArrayExpress ArrayExpress Archive – when to use it? • Find FG experiments that might be relevant to your research • Download data and re-analyze it. Often data deposited in public repositories can be used to answer different biological questions from the one asked in the original experiments. • Submit microarray or HTS data that you want to publish. Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process.

  15. ArrayExpress Browsing ArrayExpresswww.ebi.ac.uk/arrayexpress

  16. ArrayExpress “Experiments” (= GEO “Series”) Sortable headings Data for all samples

  17. ArrayExpress Feature (1): Ontology-based search extension Term suggestions from Experimental Factor Ontology (EFO, www.ebi.ac.uk/efo)

  18. Expt. factor: “intent” of the study • The main variable(s) studied, related to the hypothesis or intent of the experiment. E.g. “disease” (diabetes patients vs healthy individuals) • Values of a factor among samples should vary (e.g. “p53-/-”, “wild type”). 18 ArrayExpress

  19. ArrayExpress Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo • A way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship) • Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) • Combine terms from a subset of well-maintained and compatible ontologies, e.g. • Gene Ontology (cellular component + biological process terms) • NCBI Taxonomy  Ontology in layman terms: http://jamesmaloneebi.blogspot.co.uk/2012/06/common-ontology-questions-1-what-is-it.html

  20. EFO in ArrayExpress datahttp://www.ebi.ac.uk/efo • expand on search terms when querying ArrayExpress (and Expression Atlas – coming soon) • using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) • using child terms (e.g. “bone”  “rib” and “vertebra”) • promote consistency (e.g. F/female/, 1day/24hours) • avoid ambiguity (e.g. “m” = ) • facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) ? or ? 20 ArrayExpress

  21. EFO marked-up search results Exactmatch to search term Matched EFO synonyms to search term Matched EFO child term of search term

  22. Combining search terms

  23. EFO mark-up continues on single-expt page

  24. ArrayExpress More examples of EFO terms • Sample attributes and experimental factor / factor values: • “genetic modification” “kidney” “diabetes” • “keratinocyte” “arsenic oxide” “potassium bromate” • “RNA-seq of coding RNA” • ArrayExpress accession number, e.g. “E-MEXP-568” • Secondary accession number e.g. GEO series “GSE5389” • Experiment title, description, e.g. “TG-GATEs” • Submitter's email address • Publication title, authors and journal name, PubMed ID What other search terms can I use?

  25. Feature (2): Advanced search (i.e. filters) Task: Find experiments with rat liver samples and look at the effect of compounds ????

  26. Feature (2): Advanced search (i.e. filters) • Format of search term: field_name:search_term • Hints: • Some examples: • https://www.ebi.ac.uk/arrayexpress/help/how_to_search.html#AdvancedSearchExperiment

  27. Feature (2): Advanced search (i.e. filters) sa:”liver” AND ef:”compound” AND organism:”Rattusnorvegicus”

  28. Feature (3): Samples table with expt “factor” 28 ArrayExpress

  29. Feature (3): Samples table with expt “factor” Sortable headings: very handy for these 8105 rows! Data download links for each sample/assay 29 ArrayExpress

  30. Feature (4): programmatic access options http://www.ebi.ac.uk/arrayexpress/help/programmatic_access.html “I want to download data for 250 experiments in one go…” 1. REST / XML: http://www.ebi.ac.uk/arrayexpress/xml/v2/experiments?keywords=“breast cancer cell line” 2. JSON: http://www.ebi.ac.uk/arrayexpress/json/v2/experiments?keywords=“breast cancer cell line” 3. R/Bioconductor: “ArrayExpress” R package 4. FTP: ftp.ebi.ac.uk/pub/databases/microarray/data (5.) MAGE-TAB Parsers: Limpopo (Java, Sourceforge) and Bio::MAGETAB (Perl, CPAN) 30 ArrayExpress

  31. ArrayExpress Questions about ArrayExpress?

  32. The two databases Direct submission Expression data sets Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to analysis software, e.g. Links to other databases, e.g. 32 ArrayExpress

  33. Expression Atlaswww.ebi.ac.uk/gxa / wwwdev.ebi.ac.uk/gxa • All manually curated, high-quality data sets, standard analysis pipeline. 33 ArrayExpress

  34. Experiment with a broad selection of tissues/cell lines/conditions covered preferred * Presence of good quality rawfastq files(QC) Reference genome build in GenBank/ENA/DDBJ for read alignment Biological replicates preferred Baseline atlas selection criteria * Long term: Pool samples from multiple studies, report summarised expression per gene per condition per species. ArrayExpress

  35. Baseline Atlas construction RNA-seq data only! @read_name/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @read_name/2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 fastq fastq NNNACTNNN 1. Data quality control Low quality reads Contamination 2. Align with TopHat Reference genome from Ensembl 3. Cufflinks FPKMs bam Mapped reads ArrayExpress

  36. Clear contrast(s). At least 3 replicates for each factor value. Maximum 4 factors Adequate sample annotation using EFO terms Adequate array (platform) design to map probes to genes and to external references (e.g. Ensembl gene ID, Uniprot ID) Good quality rawdata files: e.g. CEL (Affy), fastq(HTS) RNA-seqexpt: reference genome build in GenBank/ENA/DDBJ Differential atlas selection criteria ArrayExpress

  37. Differential atlas: how many contrasts per expt? E-MTAB-800 (rat compound treatment experiment, TG-GATEs) Simple case “diabetes” vs “normal” • ~130 compounds • 4 doses: (none), low, medium, high • Time of sacrifice: 4, 8, 15, 29 days • 2 tissues: liver, kidney >1000 contrasts!! ArrayExpress

  38. Differential Atlas construction (microarray) CEL Normalised expression values per probe set CEL 1. RMA Normalization norm. 2. Moderated t-test (limma) 3. False discovery rate adjustment for p-values (Benjamini & Hochberg, 1995) fold-change, p-values Manually curated “contrast” disease:”diabetes” vs “normal” ArrayExpress

  39. Differential Atlas construction (HTS) @read_name/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @read_name/2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 fastq fastq “contrast” NNNACTNNN 1. Data quality control Low quality reads Contamination 2. Align with TopHat Reference genome from Ensembl 3. HTSeq 4. DESeq Fold-change and p-values bam Mapped reads ArrayExpress

  40. Mapping microarray probes to genes • Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. • From Ensembl genes, we also get: • Compara genes • External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms Probe identifiers Expression data per probe Ensembl genes 40 ArrayExpress

  41. Baseline Atlas use case: KCC2 gene Scenario: You study the health impact of BisphenolA (BPA) BPA: common additive in household plastic items. Negative health effects have been linked to BPA, e.g. on foetal and neonatal brain development. potassium chloride cotransporter 2 (Kcc2) mRNA levels ↓ in mouse Epigenetic downregulation BPA + • PNAS paper (Yeo et al., 2013) BisphenolA delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter. Your question: What is the general expression profile of KCC2 in human tissues? ArrayExpress

  42. Expression Atlas: Front pagewww.ebi.ac.uk/gxa/home

  43. Baseline Atlas use case: KCC2 gene ArrayExpress

  44. Baseline Atlas use case: KCC2 gene ArrayExpress

  45. Baseline Atlas use case: KCC2 gene ArrayExpress

  46. Human KCC2 gene in Baseline Atlas Analysis method, experiment design FPKM threshold slider Tool tips! ArrayExpress

  47. Baseline Atlas: ENCODE cell lines Scenario: You study the role of the apoptosis pathway (Reactome accession: REACT_578) in hepatoma cell line HepG2. Your question: What genes in the apoptosis pathway are expressed in HepG2? ArrayExpress

  48. Baseline Atlas: Apoptosis genes in HepG2 ArrayExpress

  49. Baseline Atlas: Apoptosis genes in HepG2 Ensembl * * * ArrayExpress

  50. Differential Atlas use case: human primary hepatocyte and drug Trovafloxacin Analytics, experiment design, data download FDR and fold-change cut-offs Curated experimental factor and contrast Colour gradient showing significance of differential expression See fold-changes MA plots ArrayExpress

More Related