1 / 31

Galaxy: Integrative, Reproducible Analysis of Genomics Data

Galaxy: Integrative, Reproducible Analysis of Genomics Data. Genomic and Proteomic Approaches to Heart, Lung, Blood and Sleep Disorders Jackson Laboratories Ross Hardison September 10, 2008 Galaxy is developed and maintained by Anton Nekrutenko (PSU) and James Taylor (Emory U).

azana
Download Presentation

Galaxy: Integrative, Reproducible Analysis of Genomics Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Galaxy: Integrative, Reproducible Analysis of Genomics Data Genomic and Proteomic Approaches to Heart, Lung, Blood and Sleep Disorders Jackson Laboratories Ross Hardison September 10, 2008 Galaxy is developed and maintained by Anton Nekrutenko (PSU) and James Taylor (Emory U)

  2. Types of data in genomics • Sequences • Comparisons of DNA and protein sequences • Expression data • Chromosomes and chromatin data • Experimental manipulation • Variation and phenotypes • Protein structure and function • Stored in databases and browsers (e.g. UCSC Genome Browser) • Many analysis tools (Galaxy)

  3. Some major web resources in genomics • UCSC Genome Browser and Table Browser • http://genome.ucsc.edu/ • Ensembl and EnsMart/BioMart • http://www.ensembl.org/ • TIGR Comprehensive Microbial Resource • http://cmr.tigr.org/ • NCBI for Blast server, PubMed, Gene Expression Omnibus, dbSNP, etc. • http://www.ncbi.nlm.nih.gov/ • dCode for alignments and other • http://dcode.org • HapMap for haplotype and variation • http://hapmap.org • Galaxy for data retrieval and analysis • http://galaxy.psu.edu

  4. Sequences • DNA sequences • Whole genomes and chromosomes • Genes • Transcripts • Protein-coding and noncoding transcripts • Full-length or partial (expressed sequence tags or ESTs) • Protein sequences • Known • Predicted • Repeats • Variants

  5. Sequences from CFTR: Browser view

  6. Regulation-related features around T2D risk variants Reg Pot

  7. Browsers vs Data Retrieval • Browsers are designed to show selected information on one locus or region at a time. • UCSC Genome Browser • Ensembl • Run on top of databases that record vast amounts of information. • Sometimes need to retrieve one type of information for many genomics intervals or genome-wide. • Access this by querying on the tables in the databases or “data marts” • UCSC Table Browser • EnsMart or BioMart

  8. Retrieve all the protein-coding exons in humans

  9. Challenges in genomic data analysis • We have great browsers and data warehouses • But most lack facilities for performing sophisticated analysis • Many useful computational tools have been developed in bioinformatics • But they are not well integrated, they have different user interfaces, different data formats, etc.

  10. Some common solutions • Glue it all together with Excel • Until you realize Excel cannot handle that much data and the match isn’t coming out right anyway… • Glue it all together with Perl • But that leads to duplication of effort, duplication of bugs, ….

  11. A better solution • Build a framework that: • Defines a common format for describing the interfaces of different computational tools and databases • Provides the infrastructure to adapt those interfaces into standard form • Defines common data types and standards for integrating the results

  12. Two faces of Galaxy • A web site where you can easily perform complex analysis integrating various data sources and computational tools • A framework to easily build similar sites that integrate your choice of tools and data sources

  13. Galaxy: Data retrieval and analysis • Flexible data retrieval • From multiple external sources • Upload from user’s computer • Upload as URL from any site • Hundreds of computational tools • Data editing, filter, sort • File format conversion • Extract sequences and alignments • Operations: merge, intersection, complement, cluster … • Get conservation and other scores for intervals • Statistics • Graphs and displays • EMBOSS tools for sequence analysis • HyPhy tools for molecular evolutionary analysis • Workflows: run multiple steps reproducibly

  14. Welcome to Galaxy News Welcome screen, changes periodically When tools are invoked, displays information on the tool and allows user to chose parameters

  15. Tool choice Titles are toggles; more options are displayed when you click on them

  16. History “Refresh” to get results if they have not appeared or to get status of query Titles are toggles; more information is displayed when you click on them Click on the “eye” to see all the data on another page Click on the “pencil” to edit the attributes Click on the “x” to delete Use “options” next to “History” to save, rename, move to or share histories. Must be logged in to do this.

  17. Galaxy delegates request to external site Proxy based tools (e.g. UCSC Table Browser) User makes request to Galaxy

  18. Proxy based tools External site generates response - If data, Galaxy determines data type, processes it and adds it to the history - Otherwise, response is returned to user

  19. Command line tools Pick one of the programs from the left “Tools” bar

  20. User chooses parameters for tool

  21. Command is run

  22. Background jobs in Galaxy

  23. Web page with datasets on transcriptional regulation

  24. Data uploads to Galaxy: use the URL

  25. How many DHS overlap with high RP intervals?

  26. Overlaps of DHS with high RP segments (25%) and highly constrained segments (43%) 41,000/95,709 = 0.428 24,330/95,709 = 0.254

  27. Get constraint scores for intervals

  28. Histogram of phastCons scores

  29. Mean vs Maximum phastCons mean max Distribution of phastCons scores in DHS that are also occupied by CTCF n=7000

  30. Many thanks … Yong Cheng, Demesew Abebe, Christine Dorman, …, Ying Zhang, David King, Swathi Ashok Kumar James Taylor, Anton Nekrutenko, Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU

More Related