1 / 22

Exploratory Data Analysis

Exploratory Data Analysis. Observations of a single variable. Example In 1798 Cavendish made 29 determinations of the density of the Earth, relative to that of water. His results are stored in R: > density [1] 5.50 5.57 5.42 5.61 5.53 5.47 4.88 5.62 5.63 4.07 5.29 5.34 5.26 5.44 5.46

pegeen
Download Presentation

Exploratory Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploratory Data Analysis Observations of a single variable

  2. Example In 1798 Cavendish made 29 determinations of the density of the Earth, relative to that of water. His results are stored in R: > density [1] 5.50 5.57 5.42 5.61 5.53 5.47 4.88 5.62 5.63 4.07 5.29 5.34 5.26 5.44 5.46 [16] 5.55 5.34 5.30 5.36 5.79 5.75 5.29 5.10 5.86 5.58 5.27 5.85 5.65 5.39 Source: The Data and Story Library: http://lib.stat.cmu.edu/DASL.Note that these are observations of a continuous variable, as in general are measurements of all kinds.

  3. Of interest, of course, is to estimate the true density of the Earth. A useful simple display is given by stem(density), while simple summary statistics are produced by the use of the functions mean, median, sd, summary, etc. In particular we have > mean(density) [1] 5.42 > median(density) [1] 5.46 The standard deviation of the observations is 0.34.

  4. A histogram is given by > hist(density,breaks=seq(4,6,0.2), xlab = "relative density of Earth")

  5. Clearly there is at least one low outlier in the data. Thus the median may give a better estimate than the mean of the true density. Now, let us investigate the extent to which the data can be modelled as a random sample from some underlying normal distribution..

  6. A normal Q-Q plot can be used to examine this. Recall that this is a plot of the sorted observations against what is effectively a idealised sample of the same size from the N(0, 1) distribution. The fitted line corresponds to the normal distribution with the same first and third quartiles as the data.

  7. The plot and the fitted line are constructed with > qqnorm(density) > qqline(density)

  8. The line has intercept 5.46 and slope 0.23 which provide a reasonable estimate of the mean and standard deviation of the best fitting normal distribution. The plot again suggests thatat least the lowest observation should be ignored.

  9. An approximate 95% confidence interval for the true mean of the underlying distribution of the data, based on using all the data, is given by >mean(density)+c(-1,1)*qnorm(0.975) *sqrt(var(density)/length(density))

  10. This gives a response of: [1] 5.30 5.54 To correct for the fact that the sample variance is an estimate of the underlying true variance, we can use t.test(density) which gives a 95% confidence interval of [5.29 5.55]. The generally accepted modern day true value for the relative density of Earth is 5.52.

  11. Example The R variable photons contains a count of the number of photons produced in each of 60 successive seconds by a very weak light source. > photons [1] 1 4 1 0 0 1 0 1 2 1 2 2 1 2 4 1 4 5 1 2 1 4 4 1 1 2 4 2 0 3 4 4 4 4 3 1 1 3 [39] 1 2 6 2 1 2 0 3 0 2 1 2 4 6 1 2 0 1 1 0 3 4 Here the variable photons is a count (and so discrete). We have 60 observations of it.

  12. In addition to the usual R summary functions, the R function table gives a frequency table: > table(photons) photons 0 1 2 3 4 5 6 8 19 13 5 12 1 2

  13. A histogram can be produced with >hist(photons,breaks=seq(-1,6)). However, since this variable is a count it is interesting to compare its distribution with that of the Poisson distribution with the same mean (2.08). The appropriate diagrams are produced with the commands:

  14. >barplot(table(photons),xlab="photon count", ylab="frequency" ylim=c(0,20))

  15. >barplot(60*dpois(0:8,2.08), names=0:8, xlab="photon count", ylab="Poisson expected frequency", ylim=c(0,20))

  16. The Chi-squared distribution, 2 ,can be used to check whether there is a significant difference between the observed and the expected frequencies.

  17. The sum of the last column is 12.7558

  18. This value of can then be compared with tabulated values of chi-squares for a particular degree of freedom (here 6).

  19. It can be shown that the two distributions are not significantly different at a 5% level of significance.

More Related