220 likes | 239 Views
Learn about R, an open-source language for statistical computing, and Bioconductor, software for biomedical data analysis, including packages and functionalities. Explore statistical and graphical methods for genomic data analysis.
E N D
Introduction to R and BioconductorBMI 731 Winter 2005 Catalin Barbacioru Department of Biomedical Informatics Ohio State University
References • R Project(www.r-project.org): open-source language and environment for statistical computing and graphics. Comprehensive R Archive Network, CRAN (cran.r-project.org): source code and precompiled binary distributions for Linux, Windows, MacOS; base and contributed packages. • Bioconductor Project(www.bioconductor.org) open-source software for the analysis of biomedical and genomic data, mainly R packages.
R Project • R is a language and environment for statistical computing and graphics. It is a open source project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. • R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
R Project • R can be extended (easily) via packages. • An R package is a structured collection of code (R, C, or other), documentation, and/or data for performing specific types of analyses. • Packages only need to be installed once, but ... they must be loaded with each new R session. • Loading: R function library, e.g., library(Biobase); • Various functions are available to obtain information on a package. • For example, packageDescription returns the content of the DESCRIPTION file and .find.package returns the directory where the package was installed. > packageDescription("hgu95av2")
R Packages • Analysis packages: implementation of statistical and graphical methods. E.g. cluster , glm, graph, hexbin, lattice, rpart. • Data packages: Biological metadata packages consisting of environment objects for mappings between dierent gene identifiers (e.g., Aymetrix ID, GO ID, LocusLink ID, PubMed ID), CDF and probe sequence information for Aymetrix chips. E.g. GO, hgu95av2 , humanLLMappings, KEGG. • Specialized/custom packages: code, data, documentation, and exercises, for a particular project, article, or course. E.g. EMBO03 : Bioconductor course package; golubEsets: Golub et al. (2000) ALL/AML dataset; yeastCC: Spellman et al. (1998) yeast cell cycle dataset.
R Packages • Base packages (CRAN). E.g. base, graphics, RPackmethods, stats. • Contributed packages (CRAN). E.g. ellipse, XML. • Bioconductor packages. E.g. annotate, affy, marray, multtest, hgu95av2 , ALL, EMBO03 .
Bioconductor Project • Bioconductor is an open-source and open-development software project for the analysis of biomedical and genomic data. • The project was started in the Fall of 2001 and includes 25 core developers in the US, Europe, and Australia. • Provide access to powerful statistical and graphical methods for the analysis of biomedical and genomic data. • Facilitate the integration of biological metadata from WWW in the analysis of experimental data. E.g. GenBank, GO, LocusLink, PubMed. • Provide training in computational and statistical methods.
Bioconductor Packages • Statistical methods: cluster analysis, estimation and (multiple) testing for linear and non-linear models (with possibly censored continuous and polychotomous outcomes), resampling, visualization, etc. • Biological assays: cell-based assays, DNA microarrays (transcript levels, DNA copy number from CGH), proteomics, SAGE, SELDI-TOF, SNP, etc. • Biological metadata from WWW: GenBank, GO, KEGG, PubMed,etc • Interfaces with other languages: C, Java, Perl, Python, XML, etc. – Omega Project (www.omegahat.org). • Interactions with other projects: BGL, GeneSpring, Graphviz, MAGE-ML, Resourcerer, etc.
Bioconductor Packages • Analysis packages: e.g., annotate, affy, marray, multtest. • Data packages: • Biological metadata: mappings between dierent gene identifiers (e.g., AffyID, GO ID, LocusID, PMID), CDF and probe sequence information for Affymetrix chips. E.g. hgu95av2 , GO, KEGG. • Experimental data: code, data, and documentation for specific experiments or projects. ALL: Chiaretti et al. (2004) ALL dataset. golubEsets: Golub et al. (2000) ALL/AML dataset. yeastCC: Spellman et al. (1998) yeast cell cycle dataset.
Bioconductor Packages • General infrastructure: Biobase, Biostrings, DynDoc, reposTools, rhdf5 , ruuid, tkWidgets, widgetTools. • Annotation: annotate, AnnBuilder + metadata packages. • Graphics: geneplotter, hexbin. • Pre-processing Aymetrix oligonucleotide chip data: affy, affycomp, affydata, affylmGUI , affyPLM, annaffy, gcrma, makecdfenv, vsn. • Pre-processing two-color spotted DNA microarray data: arrayMagic, arrayQuality, limma, limmaGUI , marray, vsn. • Other assays: aCGH, DNAcopy, prada, PROcess, RSNPer, SAGElyzer. • Dierential gene expression: EBarrays, edd, factDesign, genefilter, limma, limmaGUI , multtest, ROC. • Graphs and networks: graph, RBGL, Rgraphviz . • Gene Ontology: GOstats, goTools.
Microarray data analysis • Pre-processing of – spotted array data with marray packages; – Affymetrix chip data with affy packages. • List of differentially expressed genes from genefilter, limma, or multtest packages. • Prediction of tumor class using randomForest package. • Clustering of genes using cluster or hopach packages. • Use of annotate package – to retrieve and search PubMed abstracts; – to generate an HTML report with links to LocusLink and PubMed for each gene.
affy Package • To load the necessary packages, > library(affy) > library(affydata) • One of the main functions for reading in Affymetrix data is ReadAffy. It reads in data from CEL files and creates objects of class AffyBatch. • In this lab we will work mainly with the Dilution dataset, which is included in the affydata package. To load the dataset, type >data(Dilution) For a description of Dilution, type >? Dilution
affy classes and methods • One of the main classes in affy is the AffyBatch class. >class(Dilution) [1] “AffyBatch” > slotNames(Dilution) [1] "cdfName“ "nrow“ "ncol" "exprs" "se.exprs“ "phenoData" [7]"description" "annotation" "notes“ >Dilution AffyBatch object size of arrays=640x640 features (12805 kb) cdf=HG_U95Av2 (12625 affyids) number of samples=4 number of genes=12625 annotation=hgu95av2
affy classes and methods • The exprs slot contains a matrix with columns corresponding to chips and rows to individual probes on the chip. To obtain the matrix of intensities for all four chips, > e <- exprs(Dilution) • Probe-level PM and MM intensities can be accessed using the pm and mm methods. > PM <- pm(Dilution)
affy classes and methods > PM[1:5, ] 20A 20B 10A 10B [1,] 468.8 282.3 433.0 198.0 [2,] 430.0 265.0 308.5 192.8 [3,] 182.3 115.0 138.0 86.3 [4,] 930.0 588.0 752.8 392.5 [5,] 171.0 128.0 152.3 97.8
affy classes and methods To get the probe-set names (Ay IDs), > gnames <- geneNames(Dilution) > length(gnames) [1] 12625 > gnames[1:5] [1] "1000_at" "1001_at" "1002_f_at" "1003_s_at" [5]"1004_at"
affy classes and methods To produce boxplots plots of log base 2 probe intensities, > boxplot(Dilution, col = c(2, 2, 3, 3))
affy classes and methods • The boxplots show that the Dilution data needs normalization. As described in the dataset help file and in the phenoData slot (pData(Dilution)), two concentrations of mRNA were used and, for each concentration, two scanners were used. From the plots, we note that scanner effects seem stronger than concentration effects (different colors). In other words, chips that should be the same are different; chips that should be different are similar. • Because different mRNA concentrations were used, we perform normalization within concentration groups. The default procedure implemented in the normalize method is probe-level quantile normalization.
affy classes and methods > Dil20 <- normalize(Dilution[, 1:2]) > Dil10 <- normalize(Dilution[, 3:4]) > normDil <- merge(Dil20, Dil10) >boxplot(normDil, col=c(2,2,3,3))
affy classes and methods We view the process of going from probe-level intensities to gene-level expression measures as a three-step procedure consisting of: (i) background adjustment; (ii) normalization; (iii) summarization. The affy package provides implementations for a number of methods for each of these steps: (i) background correction: e.g., none, MAS 5.0, convolution; (ii) normalization: e.g., probe-level quantile, cyclic loess, contrast loess; (iii) summarization: e.g., MAS 4.0, MAS 5.0, MBEI (Li & Wong, 2001), median polish for additive linear model (Irizarry et al., 2003). The Robust Multichip Average (RMA) method refers to the sequence: convolution background adjustment, probe-level quantile normalization, and median polish summarization for gene-specific additive models with probe and chip effects. > rmaDil <- rma(Dilution)
affy classes and methods CDF data packages Data packages providing CDF information can be download from www.bioconductor.org. These packages contain environment objects which provide mappings between AffyIDs and matrices of probe locations, with rows corresponding to probe-pairs and columns to PM and MM cells. The CDF environment for the HGU95Av2 chip is already in the package. For information on the environment object type >? hgu95av2cdf