1 / 49

Introduction to R

Introduction to R. Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator. R. R is a free software environment for statistical computing and graphics Object-oriented It runs on a wide variety of platforms Highly extensible Command line and GUI

Download Presentation

Introduction to R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator

  2. R • R is a free software environment for statistical computing and graphics • Object-oriented • It runs on a wide variety of platforms • Highly extensible • Command line and GUI • Conflict between extensible and GUI

  3. R Studio Datasets Scripts Results Files, plots, packages, & help

  4. Creating a project • Project > Create Project… Store all R scripts and data in the same folder or directory by creating a project

  5. Script • # CO2 parts per million for 2000-2009 • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) # a range of values • # show values • co2 • year • #compute mean and standard deviation • mean(co2) • sd(co2) • plot(year,co2) • A script is a set of R commands • A program • c is short for combine in c(369.40, …)

  6. Exercise Smart editing Copy each column to a word processor Convert table to text Search and replace commas with null Search and replace returns with commas Edit to put R text around numbers Plot kWh per square foot by year for the following University of Georgia data.

  7. Datasets • A dataset is a table • One row for each observation • Columns contain observation values • Same as the relational model • R supports multiple data structures and multiple data types

  8. Data structures • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) • co2[2] # get the second value • m <- matrix(1:12, nrow=4,ncol=3) • m[4,3] • Vector • A single row table where data are all of the same type • Matrix • A table where all data are of the same type

  9. Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18

  10. Data structures • a <- array(1:24, c(4,3,2)) • a[1,1,1] • gender <- c("m","f","f") • age <- c(5,8,3) • df <- data.frame(gender,age) • df[1,2] • df[1,] • df[,2] • Array • Extends a matrix beyond two dimensions • Data frame • Same as a relational table • Columns can have different data types • Typically, read a file to create a data frame

  11. Data structures • l <- list(co2,m,df) • l[[3]] # list 3 • l[[1]][2] # second element of list 1 • List • An ordered collection of objects • Can store a variety of objects under one name

  12. Objects • Anything that can be assigned to a variable • Constant • Data structure • Function • Graph • …

  13. Types of data • Classification • Nominal • Sorting or ranking • Ordinal • Measurement • Interval • Ratio

  14. Factors • Nominal and ordinal data are factors • By default, strings are treated as factors • Determine how data are analyzed and presented • Failure to realize a column contains a factor, can cause confusion • Use str() to find out a frame’s data structure

  15. Missing values • sum(c(1,NA,2)) • sum(c(1,NA,2),na.rm=T) Missing values are indicated by NA (not available) Arithmetic expressions and functions containing missing values generate missing values Use the na.rm=T option to exclude missing values from calculations

  16. Missing values • gender <- c("m","f","f","f") • age <- c(5,8,3,NA) • df <- data.frame(gender,age) • df2 <- na.omit(df) You remove rows with missing values by using na.omit()

  17. Packages • R’s base set of packages can be extended by installing additional packages • Over 4,000 packages • Search the R Project site to identify packages and functions • Install using R studio • Packages must be installed prior to useand their use specified in a script • require(packagename)

  18. Packages # install ONCE on your computer # can also use Rstudio to install install.packages("knitr") # require EVERY TIME before using a package in a session # loads the package to memory require(knitr)

  19. Compile a notebook • A notebook is a report of an analysis • Interweaves R code and output • File > Compile Notebook … • Select html, pdf, or Word output • Install knitr before use • Install suggested packages

  20. PDF

  21. Reading a file • R can read a wide variety of input formats • Text • Statistical package formats (e.g., SAS) • DBMS

  22. Reading a text file It will not find this local file on your computer. • t <- read.table("~/Dropbox/ Documents/R/Data/centralparktemps.txt", header=T, sep=',') • Delimited text file, such as CSV • Creates a data frame • Specify as required • Presence of header • Separator • Row names

  23. Reading a text file • url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" • t <- read.table(url, header=T, sep=',') Can read a file using a URL

  24. Learning about an object Click on the name of the file in the top-right window to see its content Click on the blue icon of the file in the top-right window to see its structure • url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object

  25. Referencing data url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) Column Data set datasetName$columName

  26. Creating a new column <url <- "http://people.terry.uga.edu/rwatson/data/centralparktemps.txt" t <- read.table(url, header=T, sep=',') # computeCelsius t$Ctemp = round((t$temperature-32)*5/9,1)

  27. Reshaping Melt Cast • Converting data from one format to another • Wide to narrow

  28. Reshaping require(reshape) url <- 'http://people.terry.uga.edu/rwatson/data/meltExample.csv' s <- read.table(url, header=F, sep=',') colnames(s) <- c('year', 1:12) m <- melt(s,id='year') colnames(m) <- c('year','month','co2') c <- cast(m,year~month, value='co2')

  29. Writing files The file is stored in the project's folder url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt’ t <- read.table(url, header=T, sep=',') # computeCelsiusandroundtoonedecimalplace t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # renamethirdcolumntoindicateFahrenheit write.table(t,"centralparktempsCF.txt")

  30. sqldf A R package for using SQL with data frames Returns a data frame Supports MySQL

  31. Subset • trow <- t[t$year== 1999,] • trowSQL<- sqldf("select * from t where year = 1999") • tcol <- t[,c(1:2,4)] • tcolSQL <- sqldf("select year, month, Ctemp from t”) • trowcol <- t[(t$year > 1989 & t$year < 2000) ,c(1:2,4)] • trowcolSQL <- sqldf("select year, month, Ctemp from t where year > 1989 and year < 2000") Selecting rows Selecting columns Selecting rows and columns

  32. Sort • Sorting on column name • s <- t[order(-t$year, t$month),] • sSQL <- sqldf("select * from t order by year desc, month”) • s <- t[order(-t[,1], t[,2]),] Sort on column 1 descending and column 2 ascending

  33. Recoding t$Category<- 'Other' t$Category[t$Ftemp>= 30] <- 'Hot' • Some analyses might be facilitated by the recoding of data • Split a continuous measure into two categories

  34. Deleting a column t$Category <- NA Assign NULL

  35. Exercise • Download the spreadsheet of monthly mean CO2 measurements (PPM) taken at the Mauna Loa Observatory from 1958 onwards http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html • Export a CSV file that contains three columns: year, month, and average CO2 • Read the file into R • Recode missing values (-99.99) to NA • Plot year versus CO2

  36. Tabulating data url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt’ t <- read.table(url, header=T, sep=',') # tabulate temperatures in integer format table(round(t$temperature,0)) sqldf("select round(temperature,0) as temperature, count(*) as frequency from t group by round(temperature,0)”) # tabulate temperatures by month table(round(t$temperature,0),t$month) sqldf("select month, round(temperature,0) as temperature, count(*) as frequency from t group by month, round(temperature,0)”) Report counts

  37. Aggregate data • # average temperate for each year • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'mean') • a • sqldf("select year, avg(temperature) as mean from t group by year") Summarize data using a specified function Compute the mean monthly temperature for each year

  38. Merging files • url <- 'http://people.terry.uga.edu/rwatson/data/centralparktemps.txt' • t <- read.table(url, header=T, sep=',') • # averageyearlytemperature • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'meanTemp') • # readcarbon data (http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html) • url <- 'http://people.terry.uga.edu/rwatson/data/carbon1959-2011.txt' • carbon <- read.table(url, header=T, sep=',') • m <- merge(carbon,a,by='year') • mSQL <- sqldf("selecta.year, CO2, meanTempfrom a, carbonwherea.year = carbon.year") There must be a common column in both files

  39. Correlation coefficient • cor.test(m$meanTemp,m$CO2) Pearson's product-moment correlation data: m$meanTemp and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percentconfidenceinterval: 0.1454994 0.6049393 sampleestimates: cor 0.4000598 Significant

  40. Concatenating files • Taking a set of files of with the same structure and creating a single file • Same type of data in corresponding columns • Files should be in the same directory

  41. Concatenating files # read the file names from a local directory filenames <- list.files("homeC-all/homeC-power", pattern="*.csv", full.names=TRUE) # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read.table(filenames[i], header=F, sep=',') } else { temp <-read.table(filenames[i], header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','watts') Local directory

  42. Takes a while to run Concatenating files # read the file names from a remote directory (FTP) require(RCurl) url <- "ftp://watson_ftp:bulldawg1989@http://people.terry.uga.edu/rwatson/data/Mauna%20Loa%20CO2.csvpeople.terry.uga.edu/rwatson/power/" dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filenames # append the files one after another for (i in 1:length(filenames)) { file <- paste(url,filenames[i],sep='') # concatenate for url if (i == 1) { cp <- read.table(file, header=F, sep=',') } else { temp <-read.table(file, header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','kwh') Remote directory with FTP

  43. Database accessServer Might need to edit require(rJava) require(RJDBC) drv <- JDBC("com.mysql.jdbc.Driver", "$:/usr/share/java/mysql-connector-java-5.1.16.jar") # connect to the database # change user and pwd for your values conn <- dbConnect(drv, "jdbc:mysql://wallaby.terry.uga.edu/Weather", "student", "student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) MySQL access

  44. Database accessPersonal computer • MySQL access • You need the appropriate Java ARchive (JAR) for MySQL access file installed on your computer • http://dev.mysql.com/downloads/connector/j/

  45. Database access Personal computer * xx is the release number • Decompress and move mysql-connector-java-5.1.xx*-bin.jar • OS X • Macintosh HD/Library/Java/Extensions • Windows • c:\jre\lib\ext

  46. Database accessPersonal computer Change path for Windows require(RJDBC) # Load the driver – Change path to point to your jar file drv <- JDBC("com.mysql.jdbc.Driver", "Macintosh HD/Library/Java/Extensions/mysql-connector-java-5.1.33-bin.jar") # connect to the database # change user and pwd for your values conn <- dbConnect(drv, "jdbc:mysql://wallaby.terry.uga.edu/Weather", "student", "student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) MySQL access

  47. Exercise • Using the Atlanta weather database and the lubridate package • Compute the average temperature at 5 pm in August • Determine the maximum temperature for each day in August for each year

  48. Resources R books Reference card Quick-R

  49. Key points • R is a platform for a wide variety of data analytics • Statistical analysis • Data visualization • HDFS and MapReduce • Text mining • Energy Informatics • R is a programming language • Much to learn

More Related