Intro to R / BioConductor

Intro to R / BioConductor

Intro to R / BioConductor Contacts: Kim Gernert gernert@emory.edu Lance Waller lwaller@sph.emory.edu Steve Pittard wsp@emory.edu Website: www.bimcore.emory.edu/bbseries

Intro to R / BioConductor

Intro to R / BioConductor “Companies as diverse as Google, Pfizer, Merck, Bank of America, and the InterContinental Hotels Group and Shell use R.” “R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.” “ R has really become the second language for people coming out of grad school now, and there’s an amazing amount of code being written for it,” said Max Kuhn, associate director of nonclinical statistics at Pfizer. “ Close to 1,600 different packages reside on just one of the many Web sites devoted to R, and the number of packages has grown exponentially.

What is R ? • R is an interactive framework for data and statistical analysis that also happens to have a programming language. • Compare this to languages such as Perl, Python, and Java that have data analysis addons • Which language to use ? Exploit the strengths of all of them but if data analysis is a big part of the work then consider using R as part of the “pipeline”. • R can be well integrated with Excel, existing C,C++, Perl, Python, XML, and FORTRAN programs. See www.omegahat.com • Most of the effort in using R relates to shaping the data for analysis and understanding the available functions and packages - deep programming skills are not required but are definitely rewarded.

Where to Find Software & Documentation Project Home page: http://www.r-project.org/ Comprehensive R Archive Network: http://cran.cnr.berkeley.edu/ Manuals: http://cran.cnr.berkeley.edu/manuals.html Good Intro Manual: http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf FAQs: http://cran.cnr.berkeley.edu/faqs.html The R Journal: http://journal.r-project.org/ Contributed Documentation: http://cran.cnr.berkeley.edu/other-docs.html BioConductor: http://www.bioconductor.org (400+ addons for life scientists) R Graphical User Interface (GUI): http://sciviews.org/_rgui R Programming Jobs: http://www.programmingr.com/category/stype/r-job-listings Spreadsheet Addiction: http://www.burnsstat.com/pages/Tutor/spreadsheet_addiction.html

Additional Resources - Books/Courses R in a Nutshell - Jospeh Adler (2009) Data Manipulation with R - Phil Spector (2008) Ggplot2: Elegant Graphics for Data Analyses A Handbook of Statistical Analyses Using R - Everitt and Hothorn (2006) BioConductor Case Studies - Hahne, Huber, Gentleman, Falcon (2008) Introductory Statistics with R - Peter Dalgaard Software for Data Analysis: Programming with R - John Chambers (2008) BioConductor Courses & Conferences: www.bioconductor.org/help/events/

Additional Resources - CRAN To obtain information on the wide variety of addons available for then visit http://cran.cnr.berkeley.edu Here are some of the available package categories. Use the “search” function to drill down:

Clash of The Titans - R vs MATLAB vs SAS Let the debate begin….. But there are no easy answers here. There are lots of good things about all three.

To Use A Graphical Interface or Not ? R is provided with a command line interface (CLI), which is the preferred user interface for power users because it allows direct control on calculations and is flexible. However, good knowledge of the language is required. CLI is thus intimidating for beginners. Some common R GUIs include: R Commander, JGR, SciViews-R, R.app (OSX), ECLIPSE with STATET, EMACs

Commerically Supported R ? Revolution Analytics (www.revolutionanalytics.com) claims to provide an enhanced version of R with additional features such as a Visual IDE & Debugger and on-call technical support both for end-user questions and statistical consultation. I have no practical experience with this product and can’t say if its good or not. However for those seeking production support for R this is one possibility.

Ever try reading code written by someone else ? Style Guides for R Programming Take some “style” hints from the big guys: See “Google’s R Style Guide” and Hadley’s Style Guide at: http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html https://github.com/hadley/devtools/wiki/Style

R - The Work Flow The typical R workflow is: Data Cleanup/Reshaping --> Model/Analyze/Statistics --> Visualize Repeat until it makes sense. Data Cleanup can frequently require more time than the second two combined ! Beware. It takes TIME to get used to transforming data. Data sets rarely arrive at your door in the ideal format. Spend time with the R input transformation functions.

Your First R Session $ R > help.start() > ?mean #Get help on the mean function > example(kmeans) #Run an example of kemans (if it exists) > ??clust #Search for info on clust > x <- rnorm(50) > y <- rnorm(x) #Generate two pseudo-random normal vectors > plot(x, y) #Plot the points in the plane. > ls() #See R objects that are in the R workspace. > x <- 1:20 #Make x = (1, 2, ..., 20). > w <- 1 + sqrt(x)/2 > dummy <- data.frame(x=x, y= x + rnorm(x)*w) > fm <- lm(y ~ x, data=dummy) > q() Save workspace image? [y/n/c]:y $

R As a “Calculator”

Reading Data Into R >Biomarkers <- read.csv(“bio.csv”, header=TRUE,sep=“,”) >library(ODBC) >sheet <- “c:\\Documents and Settings\\user\sheet.xls” >con <- odbcConnectExcel(sheet) Can also use library(XML) to read MS Office docs since they are zipped XML files. >library(RMySQL); drv <- dbDriver(“MySQL”) >con <- dbConnect(drv,dbname=,user,password=,host=) >mydata <- dbGetQuery(con, “SELECT * FROM mydata”) >wpage <- readLines("http://www.programr.com/list.html") >author_lines <- wpage[grep("<I>", wpage)]

Reading/Writing Data with R >load(“My_Saved_WorkSpace.Rdata”) # Binary format >save(x,y,z,file=“precious_data.rda”) # Binary format >write.csv(x,file=“co2.txt”,sep=“,”) # CSV, text >library(R2HTML) >dir.create(file.path(tempdir(),"R2HTML")) >tgt <- HTMLInitFile(file.path(tempdir(),"R2HTML"),filename="sample", BackGroundColor="#BBBBEE") HTML("<br>This is a test,file=target) >tmp <- as.data.frame(matrix(rnorm(100),ncol=10)) >HTML(tmp,file=tgt) >HTMLEndFile()

Reading/Writing Data with R

Concepts for Using R - Functions Functional Programming: “Computations are based around functions, which can encapsulate and return specific, meaningful results” * > rnorm(10) [1] 0.005746913 0.137718058 -0.350398360 -1.258142029 -1.549439542 [6] 0.390550679 -0.368581797 1.050090167 -0.351906139 -1.273418525 > avg <- mean(rnorm(10)) > avg [1] -0.1479458 > quantile(rnorm(10)) 0% 25% 50% 75% 100% -1.1462175 0.3044227 0.4740667 0.8506002 1.3871263 > * John Chambers, Software for Data Analysis: Programming with R

Concepts for Using R - Objects Everything, (vector, factor, matrix, array, list , data.frame), is an object, which also has a type and belongs to a class: > 3+5 [1] 8 > typeof(3) [1] "double" > class(3) [1] "numeric" > typeof(`+`) [1] "builtin" > > as.Date("12/11/10", "%m/%d/%y") + 1 [1] "2010-12-12" >

Concepts for Using R - Vectors R implements vectors quite well. It is a fundamental object type so get familiar with stashing data in vectors. They can be combined into matrices > X <- c(1,3,2,10,5) #create a vector x with 5 components > Y <- 1:5 #create a vector of consecutive integers > y+2 #scalar addition [1] 3 4 5 6 7 > 2*y #scalar multiplication [1] 2 4 6 8 10 > y^2 #raise each component to the second power [1] 1 4 9 16 25 > X <- c(1,3,2,10,5); y<-1:5 #two or more statements are separated by semicolons > x*y [1] 1 6 6 40 25 > diff(x) [1] 2 -1 8 -5 > max(x) [1] 10 >x[3:5] #returns elements 3 to 5 inclusively [1] 2 10 5 > x>3 #Returns a locical vector [1] FALSE FALSE FALSE TRUE TRUE > x[x>3] # Returns those elements greater than 3 [1] 10 5

Concepts for Using R - Vectors > seq(10) [1] 1 2 3 4 5 6 7 8 9 10 > seq(0,1,length=10) [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667 [8] 0.7777778 0.8888889 1.0000000 > seq(0,1,by=0.1) [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 > rep(1,3) [1] 1 1 1 > c(rep(1,3),rep(2,2),rep(-1,4)) [1] 1 1 1 2 2 -1 -1 -1 -1 > g <- sample(c(12,17,24),50,prob=c(20,30,50),replace=TRUE) > g [1] 17 24 17 24 24 24 17 24 12 17 24 12 24 17 24 17 24 24 24 24 17 17 12 17 24 [26] 24 24 24 24 12 17 24 17 17 12 24 12 17 12 17 24 17 17 24 24 24 12 24 24 >length(g[g>=24]) [1] 25

Concepts for Using R - Matrices R also supports matrices, which are objects that typically refer to a numeric array of rows and columns. An easy way to create a matrix is to combine vectors of equal length using the “cbind()” function (column bind). > x [1] 1 3 2 10 5 > y [1] 1 2 3 4 5 > mymat <- cbind(x,y); mymat # Create the matrix x y [1,] 1 1 [2,] 3 2 [3,] 2 3 [4,] 10 4 [5,] 5 5 > t(mymat) # Transpose it [,1] [,2] [,3] [,4] [,5] x 1 3 2 10 5 y 1 2 3 4 5 > mymat[2,] # Get the second row of mymat x y 3 2

Concepts for Using R - Matrices R also supports matrices, which are objects that typically refer to a numeric array of rows and columns. An easy way to create a matrix is to combine vectors of equal length using the “cbind()” function (column bind). > ex1 <- matrix(1:4,ncol=2) # Use the “matrix” command > ex2 <- matrix(c(20,40,40,40),ncol=2) > > ex1 [,1] [,2] [1,] 1 3 [2,] 2 4 > ex2 [,1] [,2] [1,] 20 40 [2,] 40 40 > solve(ex1) # Find the inverse [,1] [,2] [1,] -2 1.5 [2,] 1 -0.5 > diag(ex2) # Get the diagonal of ex2 [1] 20 40 >

Concepts for Using R - Data Frames A data frame is an object that contains data in a format that allows for easier manipulation, reshaping, and open-ended analysis. Data frames are tightly coupled collections of variables. It is one of the more important constructs you will encounter when using R so learn all you can about it. > head(clinical.trial) patient age treatment center 1 1 67.28203 Treatment Center E 2 2 65.74676 Treatment Center B 3 3 72.69997 Treatment Center A 4 4 58.33269 Treatment Center E 5 5 67.31442 Treatment Center B > clinical.trial$age > head(clinical.trial$age) [1] 67.28203 65.74676 72.69997 58.33269 67.31442 70.84661

Concepts for Using R - Data Frames • > info <- data.frame(gender = c("M", "M", "F"), ht = c(172, 186.5, 165), wt = c(91, 99, 74)) • > info • gender ht wt • 1 M 172.0 91 • 2 M 186.5 99 • 3 F 165.0 74 • > info[1,2] • [1] 172 • > names(info) • [1] "gender" "ht" "wt" • > info$ht • [1] 172.0 186.5 165.0 • > rownames(info) <- c("S1","S2","S3") > info gender ht wt • S1 M 172.0 91 • S2 M 186.5 99 • S3 F 165.0 74

Concepts for Using R - Data Frames • > height <- info$ht > height • 1] 172.0 186.5 165.0 • > info$age = c(28,55,43) • > info • gender ht wt age • S1 M 172.0 91 28 • S2 M 186.5 99 55 • S3 F 165.0 74 43 • > subset(info,age > 50 ) • gender ht wt age • S2 M 186.5 99 55 • >

R - The Work Flow The typical R workflow is: Data Cleanup/Reshaping --> Model/Analyze/Statistics --> Visualize Repeat until it makes sense. Data Cleanup can frequently require more time than the second two combined ! Beware. It takes TIME to get used to transforming data. Data sets rarely arrive at your door in the ideal format. Spend time with the R input transformation functions.

R - Data CleanUp Data Cleanup/Reshaping:cbind, rbind, merge, cut > top.salaries name.last name.first team position salary 1 Manning Peyton Colts QB 18700000 2 Brady Tom Patrots QB 146267720 3 Pepper Julius Panthers DE 14137500 4 Palmer Carson BEngals QB 13980000 >year <- c(2008,2008,2008,2008) >rank <- c(1,2,3,4) >more.cols <- data.frame(year,rank) >cbind(top.salaries,more.cols) name.last name.first team position salary year rank 1 Manning Peyton Colts QB 18700000 2008 1 2 Brady Tom Patrots QB 146267720 2008 2 3 Pepper Julius Panthers DE 14137500 2008 3 4 Palmer Carson BEngals QB 13980000 2008 4

R - Data CleanUp • Data Cleanup/Reshaping:cbind, rbind, merge, cut • > next.three • name.last name.first team position salary • 1 Favre Brett Packers QB 12899999 • Bailey Champ Broncos CB 12690050 • Harrison Marvin Colts WR 12000000 • > rbind(top.salaries,next.three) • name.last name.first team position salary • 1 Manning Peyton Colts QB 18700000 • 2 Brady Tom Patrots QB 146267720 • 3 Pepper Julius Panthers DE 14137500 • 4 Palmer Carson BEngals QB 13980000 • 5 Favre Brett Packers QB 12899999 • 6 Bailey Champ Broncos CB 12690050 • 7 Harrison Marvin Colts WR 12000000

R - Data CleanUp Data Cleanup/Reshaping: rbind,cbind, merge, cut: > tb1 indiv_id SNP1 SNP2 1 1 1 1 2 2 1 1 3 3 0 0 4 4 1 0 > tb2 indiv_id cov1 cov2 1 1 1.14 74.6 2 3 4.50 79.4 3 4 0.80 48.2 4 6 1.39 68.1 > merge(tb1,tb2,by="indiv_id",all=TRUE) indiv_id SNP1 SNP2 cov1 cov2 1 1 1 1 1.14 74.6 2 2 1 1 NA NA 3 3 0 0 4.50 79.4 4 4 1 0 0.80 48.2 5 6 NA NA 1.39 68.1

R - Data CleanUp Data Cleanup/Reshaping: rbind,cbind, merge, cut: > clinical.trial patient age year.enroll 1 1 47.41745 1995 2 2 64.92090 1989 3 3 49.66592 1987 4 4 68.23656 1985 . . 100 49.45632 1988 > c1 <- cut(clinical.trial$age, breaks = 4) > table(c1) c1 (41.1,50] (50,58.8] (58.8,67.6] (67.6,76.4] 9 34 41 16

R - Data Modeling Data Modeling: Linear Regression >ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) >trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) >group <- gl(2,10,20, labels=c("Ctl","Trt")) >weight <- c(ctl, trt) >m.D9 <- lm(weight ~ group) >summary(resid(m.D9)) Min. 1st Qu. Median Mean 3rd Qu. Max. -1.071e+00 -4.938e-01 6.850e-02 2.498e-17 2.462e-01 1.369e+00 >opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0)) >coef(m.D9) (Intercept) groupTrt 5.032 -0.371 >plot(m.D9, las = 1)

R - Data Modeling The typical R workflow is to 1) clean, massage, and reshape data, 2) model it, and 3) visualize it:

R - Data Modeling Data Modeling: K-Means Clustering > x <- rbind(matrix(rnorm(1000, sd = 0.3), ncol = 2), matrix(rnorm(1000, mean = 1, sd = 0.3), ncol = 2)) # Create two populations > colnames(x) <- c("x", "y") > cl <- kmeans(x, 2) > plot(x, col = cl$cluster, main=“Rigged Example”) > c1 <- kmeans(x, 4)

R - Graphics > fuel row.names Weight Disp. Mileage Fuel Type 1 Eagle.Summit.4 2560 0.97 33 3.030303 Small 2 Ford.Escort.4 2345 114.00 33 3.030303 Small 3 Ford.Festiva.4 1845 0.81 37 2.702703 Small 4 Honda.Civic.4 2260 0.91 32 3.125000 Small 5 Mazda.Protege.4 2440 113.00 32 3.125000 Small 6 Mercury.Tracer.4 2285 0.97 26 3.846154 Small . . 59 Nissan.Axxess.4 3185 146.00 20 5.000000 Van 60 Nissan.Van.4 3690 146.00 19 5.263158 Van > names(fuel) [1] "row.names" "Weight" "Disp." "Mileage" "Fuel" "Type" > hist(fuel$Mileage,col="green", border="black",main="R Histogram Demo",breaks=25,xlab="Auto Fuel Mileage")

R - Graphics >hist(fuel$Mileage,col="green", border="black",main="R Histogram Demo",breaks=25,xlab="Auto Fuel Mileage")

R - Graphics >boxplot(Weight~Mileage,data=fuel,main="Box Plot + Example",xlab="Mileage",ylab="Weight in lbs")

R - Graphics > plot(fuel$Fuel, type="l") > plot(fuel$Fuel, type="b"); title("both") > plot(fuel$Fuel, type="o"); title("overstruck") > plot(fuel$Fuel, type="h"); title("high density")

ggplot2 - A Better Graphics Package ? >ggplot(diamonds, aes(depth, fill = cut)) + geom_density(alpha = 0.2) + xlim(55, 70)

BioConductor

An Example - Affymetrix Data Affymetrix data is distributed in .CEL files. The Bioconductor “affy” package has methods to read these files and do some intelligent processing on them. In particular the “ReadAffy()” method. >library(affy); library(CLL) #Read in some example data from CLL >data(CLLbatch) #Read in some example data from CLL > CLLBatch AffyBatch object size of arrays=640x640 features (91215 kb) cdf=HG_U95Av2 (12625 affyids) number of samples=24 number of genes=12625 annotation=hgu95av2 notes= >sampleNames(CLLbatch) [1] "CLL10.CEL" "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" "CLL15.CEL" [7] "CLL16.CEL" "CLL17.CEL" "CLL18.CEL" "CLL19.CEL" "CLL1.CEL" "CLL20.CEL" [13] "CLL21.CEL" "CLL22.CEL" "CLL23.CEL" "CLL24.CEL" "CLL2.CEL" "CLL3.CEL" [19] "CLL4.CEL" "CLL5.CEL" "CLL6.CEL" "CLL7.CEL" "CLL8.CEL" "CLL9.CEL"

An Example - Affymetrix Data > class(CLLbatch) [1] "AffyBatch" attr(,"package") [1] "affy” > slotNames(CLLbatch) [1] "cdfName" "nrow" "ncol" [4] "assayData" "phenoData" "featureData" [7] "experimentData" "annotation" "protocolData" > CLLbatch@annotation [1] "hgu95av2” > CLLbatch@phenoData An object of class "AnnotatedDataFrame" sampleNames: CLL10.CEL, CLL11.CEL, ..., CLL9.CEL (24 total) varLabels and varMetadata description: sample: arbitrary numbering > CLLbatch@featureData An object of class "AnnotatedDataFrame" featureNames: 1, 2, ..., 409600 (409600 total) varLabels and varMetadata description: none

An Example - Affymetrix Data > CLLbatch[1,1:3] AffyBatch object size of arrays=640x640 features (24012 kb) cdf=HG_U95Av2 (12625 affyids) number of samples=3 number of genes=12625 annotation=hgu95av2 notes= > CLLbatch[1,1:3]@assayData # Look at the expression data $exprs CLL10.CEL CLL11.CEL CLL12.CEL 1 183.0 113.0 119.0 2 11524.8 6879.8 7891.3 .. > library(affyQCReport) # Load up some QC methods > saqc = qc(CLLbatch) > slotNames(saqc) [1] "scale.factors" "target" "percent.present" [4] "average.background" "minimum.background" "maximum.background" [7] "spikes" "qc.probes" "bioBCalls" [10] "arraytype"

An Example - Affymetrix Data > plot(saqc) # Look at the Quality Control Plot

An Example - Affymetrix Data Determine if one (or more) arrays are significantly different from the others. Computer the median of the absolute values of differences between pairs of arrays > dd = dist2(log2(exprs(CLLbatch))) > dd.row = as.dendrogram(hclust(as.dist(dd))) > plot(dd.row)

An Example - Affymetrix Data > library(arrayPLM) # Load up some more diagnostics > sampleNames(CLLbatch) = sub("\\.CEL$", "", sampleNames(CLLbatch)) > dataPLM = fitPLM(CLLbatch) #Plot NUSE - Normalized Unscaled Standard Error > boxplot(dataPLM, main=“NUSE”,ylim=c(0.95,1.22), outline=FALSE, las=3, whisklty=0, staplety=0)

An Example - Affymetrix Data #Now plot RLE - Relative Log Expression >Mbox(dataPLM, ylim=c(-0.4,0.4), main="RUSE",col=”red",las=3,whisklty=0,staplelty=0)

Intro to R / BioConductor