Data analysis using R

Data analysis using R Jasminka Dobša Faculty of Organization and Informatics University of Zagreb Workshop on Data Analysis 2018

Outline • Introduction to R • Descriptive statistics and graphical representation of data • Random variables and probability distributuions • Binomial distribution • Normal distribution • Testing the normality of distribution • Statistical testing • Testing of means for two populations • T-test and similar nonparametric tests • Testing of dependence of two qualitative variables • Chi-square test Workshop on Data Analysis 2018

What is R? • R is an integrated package of program contents for data analysis • Enables: • Handling data and storing them • Graphic representation, visualization of data • Operation over the sequences and matrices • Data analysis using statistical and data mining methods • Contains a simple and efficient programming language (S) • It can be considered as implementing S programming language Workshop on Data Analysis 2018

Instalation and links • We will use • R, RCommander (graphical interface for R-use) • RStudio: environment for R-use • Instalation: • 1st step: R project (https://www.r-project.org/) • 2nd step: RStudio (https://www.rstudio.com/products/rstudio/download/ • Tutorials for R i RStudio (47 videos) by Mike Marin https://www.youtube.com/watch?v=cX532N_XLIs&list=PLqzoL9-eJTNATicffatWXTEjwMq6N0Sf3 Workshop on Data Analysis 2018

Variables • Qualitative and quantitative • Qualitative • Takes a smaller number of values (modalities) • Example: colour of eyes (green, bue, brown) • Graphical representation by • Bar chart • Pie chart • Quantitative (numerical) • Takes values on a interval of real numbers • Example: high, weight • Graphical representation by • Histogram • Box-plot Workshop on Data Analysis 2018

Descriptive statistics • Number which • Is calculated using data set • Gives summary information about data set • Examples • Minimum, maximum • Mean: gives average value) • Median: number at middle position of sorted series of numbers • 1st quantile Q1 : devides sorted series of numbers on two parts : 25%/75% • 3rd quantil Q3 : devides sorted series of numbers on two parts : 75%/25% • Standard deviation: measure of dispersion od data • Interquartile (IQR): measure of dispersion of data , range of 50% of the central data Workshop on Data Analysis 2018

Example: Data set Movies • Variables • Run time: numerical • Budget: numerical • Dramas: qulitative (takes values 0 and 1) • Stars (evaluation): numerical (1-5) • Rating: qulitative • Genre: qulitative (action, adventure, comedy, drama, horror, thriller) • USGross (earn): numerical Workshop on Data Analysis 2018

Descriptive statistics and graphical representation: R Commander • Graphical representation • Graphs → Histogram ... • Graphs → Boxplot ... • Graphs → Bar graph ... • Graphs → Pie chart ... • Descriptive statistics • Statistics → Summaries → Numerical summaries .... Workshop on Data Analysis 2018

Numerical variable: Run time Outliers! Min(Q3+3/2IQR,max) Q3 Me Q1 Max(Q1-3/2IQR, min) mean sd IQR 0% 25% 50% 75% 100% n 109.4083 19.60685 22 68 97 106 119 187 120 Workshop on Data Analysis 2018

Numerical variable: Budget Box-plot Histogram mean sd IQR 0% 25% 50% 75% 100% n 46774375 35675471 35500000 1e+06 24500000 3.5e+07 6e+07 1.8e+08 120 Workshop on Data Analysis 2018

Qualitative variable: Genre Pie chart Bar chart Workshop on Data Analysis 2018

Qualitative variable: Genre • Descriptive statistics for qualitative variable is given by frequency table • Rcmdr • Statistics → Summaries → Frequency distributions ... counts: Genre Action Adventure Comedy Drama Horror Thriller 20 15 38 28 13 6 percentages: Genre Action Adventure Comedy Drama Horror Thriller 16.67 12.50 31.67 23.33 10.83 5.00 Workshop on Data Analysis 2018

Descriptive statistics by group • Statisticscanbecalculatedbyfactorsofqualitativevariable • Example: MeanbyfatorsofvariableGenre tapply(Movies$Run.Time..minutes., list(Genre=Movies$Genre), mean, + na.rm=TRUE) Genre Action Adventure Comedy Drama Horror Thriller 114.4000 104.8667 101.4737 123.6786 102.2308 103.3333 Workshop on Data Analysis 2018

Graphs by group: Budget by Genre, boxplot Workshop on Data Analysis 2018

Graphs by group: Run time by Genre, boxplot Workshop on Data Analysis 2018

Random variables • A random event is an event that may or may not occur in a given set of conditions • A random variable is a function of a set of elemental events in a set of real numbers • It is indicated in large letters X, Y, Z ... • Discrete random variable • A random variable that takes a final or countable infinite set of values • Coutinous random variable • A variable that takes values on interval of real numbers Workshop on Data Analysis 2018

Discrete random variables • If random variable takes finite number of values then it is described by Where x1, …, xn are events and p1, p2, …, pn are probabilities of these events Workshop on Data Analysis 2018

Binomial distribution • Example of a discrete random variable • A random variable has a binomial distribution if it scores the number of successes in a series of Bernulli's attempts (eg, the number of letters received in consecutive coin throws) • Bernulli's attempt is a experiment whose outcome can be success or failure (throwing of a coin) • Binomial distribution has two parameters • n number of trials • p probability of succes in a single experiment Workshop on Data Analysis 2018

Binomial distribution • The random variable X has a binomial distribution with the parameters nand pif X gets values in a set {0,1,…,k,..,n} with probabilities p probability of success q = 1-p probability of failure at each Bernulli's experiment n - number of repetitive experiments Workshop on Data Analysis 2018

Binomial distribution: example • Let the dice drop 5 times and let X be a random variable representing the number of six obtained • Questions: • A) What is the probability that six falls 2 times? • B) What is the probability that the six falls at least 2 times? • A) RCommander • Distributions → Discrete distrubutions → Binomial distribution → Binomial probabilities Probability 0 0.4020383488 1 0.4018453858 2 0.1606610062 3 0.0321167790 4 0.0032101364 5 0.00012834385 Workshop on Data Analysis 2018

Binomial distribution: example • B) Rcommander • Distributions → Discrete distrubutions → Binomial distribution → Binomial tail probabilities... • pbinom(c(1), size=5, prob=0.166, lower.tail=FALSE) [1] 0.1949599 Plot of binomial distribution Workshop on Data Analysis 2018

Normal distribution • Normal distribution is the distribution of continuous random variables with the density function given by where µ is expectation and σ standard deviation Central value Measure of dispersion Workshop on Data Analysis 2018

Standardized random variable • For computing original random variable is often transformed in standardized random variable Xgiven by • Expectation of standardized random variable is 0, while variance is 1 • Density function of standardized normal random variable is given by Workshop on Data Analysis 2018

The area below the normal curve is 1 The curve is asymptotically approaching the x axis The curve is symmetrical with respect to the direction and the surface on each side of this line is ½ Properties of the normal distribution Standardized normal distribution P(0<X<z) Workshop on Data Analysis 2018

Example: normal distribution • Let assume that time needed for pizza delivery is subject to normal distribution with expectation of 30 minutes μ= 30 minutes, σ=10 minutes Workshop on Data Analysis 2018

Example: normal distribution • Questions • A) Compute the probability that delivery will last more than 45 minuites • B) Compute the time such that 90% of deliveries will last less than it • Answers • A) pnorm(c(45), mean=30, sd=10, lower.tail=FALSE) • 0.0668072 • R Commander: Distributions → Continuous distributions → Normal distribution → Normal probabilities .... • B) qnorm(c(0.9), mean=30, sd=10, lower.tail=TRUE) • 42.81552 • R Commander: Distributions → Continuous distributions → Normal distribution → Normal quantiles .... Workshop on Data Analysis 2018

QQ plot • Quantile is a cutpoint which devides a graph of density function in a certain ratio of probabilities • QQ plot is a graphical method for comparing probability distriubutions by ploting their quantiles agains each other 10% 90% Workshop on Data Analysis 2018

QQ plot • In statistical testing we usually compare empirical distribution of our data set with normal distribution Movies: Budget Workshop on Data Analysis 2018

QQ plot Movies: Run time Workshop on Data Analysis 2018

QQ plot Movies: USGross Workshop on Data Analysis 2018

Testing the normality of distribution • Some of statistical tests for testing of normality of distribution are • Chi-square test • Kolmogorov – Smirnov test (KS test) • Shapiro – Wilk’s test • Hypothesis for testing H0... the distribution of the populationis subject to the normal distribution H1 ... the distribution of the the population is not subject to the normal distribution Workshop on Data Analysis 2018

Example: test of the normalitiy of distribution with(Movies, shapiro.test(Budget....)) Shapiro-Wilk normality test data: Budget.... W = 0.86391, p-value = 4.133e-09 p-value < level of significance → H1 p-value > level of significance → H0 Shapiro-Wilk normality test data: Run.Time..minutes. W = 0.95586, p-value = 0.0005964 Level of significance: 0.05 (5%) 0.01 (1%) None of these distributions is subject to the normality! Shapiro-Wilk normality test data: USGross.... W = 0.70229, p-value = 2.808e-14 Workshop on Data Analysis 2018

Statistical testing • The statistical hypothesis is a claim that refers to the whole population • It is proven on the sample • Hypothesis testing: the procedure or the rule according to which hypothesis is accepted or rejected based on random sample • Statistical tests are divided into parametric and nonparametric • Parametric tests require satisfying conditions on the shape and characteristics of distribution of numeric variables in the population • Nonparametric tests do not require compliance with such conditions Workshop on Data Analysis 2018

Hypothesis and errors in inference • Statistical testing starts by stating null hypothesis (H0) and alternative hypothesis (H1) • It is customary to put the claim we want to prove in an alternative hypothesis • Decision about testing is not categorical • Two possible type of errors • Type I error • Incorrect rejection of true null hypothesis • α – level of significance, limit value of probability of rejection of true null hypothesis • Type II error • Retaining a false true null hypothesis • β – limit probability of retaining a false null hypothesis • Power of te test 1-ß: probability of rejecting a false null hypothesis Workshop on Data Analysis 2018

Decision making • The most comon way to make decision about statistical testing is using p-value • p-value is • empirical value of significance • a probability of rejecting the true null hypothesis calculated by the data from the sample • a measure of the degree of discrediting null hypotheses based on sample data • In testing • If the p-value is greater than the degree of significance α, the null hypothesis is retained • If the p-value is less than the significance level α, the null hypothesis is rejected Workshop on Data Analysis 2018

Testing of means for two populations • Statistical testing can be carried out by • Parametric T – test • Nonparametric tests • We distinguish two types of samples: dependent and independent • Independent samples • the results of observing or measuring in one sample do not depend on the results of observations and measurements in the second sample • Dependent samples • sample values are obtained by re-observation or by measuring selected variables on the same statistical sample, eg. before and after the experiment (before/after training, before/after treatment) • sample values are given in pairs Workshop on Data Analysis 2018

T – test: Hypothesis • Two-sided test • One-sided test at the upper limit or • One-sided test at the lower limit or where μ1 and μ2 are means of populations. Workshop on Data Analysis 2018

T-test: Conditions • For independent samples there is two cases • Variances of populations σA2 and σB2 are equal • Variances of populations σA2 and σB2 are not equal • To test equality of variances we will use F-test, Leven test and Bartlett test • Hypothesis for testing equality of variaces on two populations are • For small samples we will use t-test • Condition of normality of distribution of samples • For large samples this condition can be relaxed, but but we are always checking it out Workshop on Data Analysis 2018

Similar nonparametric tests • Nonparametric tests we test difference of medains for two populations • Testing of difference of medians for two independent samples • Mann-Whitney-Wilcoxonov test for independet samples (MWW-Test, Rank Sum Test) • Testing of difference of medians for two dependet samples • Sign test • Wilcoxon matched-pairs signed rank test Workshop on Data Analysis 2018

Testing by nonparametric tests • Conditions for nonparametric test are weaker than for parametric tests • The data is represented in the form of signs or rankings • Part of the information is lost • Power of nonparametric tests is lower than power of parametric tests Workshop on Data Analysis 2018

Example: independent samples 1/4 Professor has two groups of students (A and B). The exam was held together for both groups of students. The table shows the number of students per group: Is it possible to conclude, at the level of significance of 5%, that group A has poorer written the exam of group B? Test using the t-test for independent samples and using the MWW test. Workshop on Data Analysis 2018

Example: independet samples 2/4 • Using the F-test, the Leven test, and the Bartlett test, it is found that variants do not differ significantly, so we apply a t-test with the assumption of equality variance • Using Shapiro-Wilk's test, it is established that data is normally distributed only for the first group • As a result in R Commander we get: There is no significant difference in results of exams for two groups Two Sample t-test data: Points by Group t = -1.2709, df = 19, p-value = 0.1095 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf 1.952838 sample estimates: mean in group 1 mean in group 2 78.00000 83.41667 Hytothesis: H0 ...μ1-μ2=0 H1 ...μ1-μ2<0 Workshop on Data Analysis 2018

Example: independent samples 3/4 Boxplot Central line – median Box – range of 50% of central values There is difference in medians of observed samples Scattering of data between the groups, it is too large to recognize that difference as a significant Workshop on Data Analysis 2018

Example: independent samples 4/4 • Application of MWW test in R Commander • According to the nonparametric test there is a significant difference between the groups • Since the assumptions for applying the parametric test are not met, a nonparametric test is more appropriate Wilcoxon rank sum test with continuity correction data: Points by Group W = 28, p-value = 0.03475 alternative hypothesis: true location shift is less than 0 Hypothesis: H0 ...η1-η2=0 H1 ...η1-η2<0 Workshop on Data Analysis 2018

Example: dependent samples 1/3 For some research, married couples have been chosen, in which both men and women are employed. The following table shows the income of men and women in thousands of dollars. Test the claim that husbands have higher incomes among married couples. Use t-test for dependent samples and analog nonparametric tests Wilcoxon test. Perform the test at a level of significance of 5%. Workshop on Data Analysis 2018

Example: depended samples 2/3 • Using Shapiro-Wilk's test, it is determined that data in two samples are normally distributed • T-test for depended samples in R Commander • A null hypothesis is retained according to which there is no significant difference between the salaries of husband and wife (boundary!) • Dakle, prihvaća se nulta hipoteza prema kojoj ne postoji sig Paired t-test data: Husband.s.salary. and Wife.s.salary t = 1.739, df = 14, p-value = 0.05198 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: -0.03586299 Inf sample estimates: mean of the differences 2.8 Hypothesis: H0 ...μ1-μ2=0 H1 ...μ1-μ2>0 Workshop on Data Analysis 2018

Example: depended samples 3/3 • Application of Wilcoxon’s test for depenent samples • A null hypothesis is retained (also boundary!) Wilcoxon signed rank test with continuity correction data: Husband.s.salary. and Wife.s.salary V = 89, p-value = 0.0523 alternative hypothesis: true location shift is greater than 0 Hypothesis: H0 ...η1-η2=0 H1 ...η1-η2>0 Workshop on Data Analysis 2018

Testing of dependence of two qualitative variables • Chi-square test • Let variable A has r modalities A1, A2, ..., Ai,...,Ar • Let variable B has c modalities B1, B2, ..., Bj,..., Bc • By grouping the members of the sample according to the modalities of the variablesA and B, a two-dimensional table of contigence of order r × cis obtained • Hypothesis H0 ... Modalities of variables A and B are independent H1 ... Modalities of varianble A and B are dependend Workshop on Data Analysis 2018

Table of contigence Workshop on Data Analysis 2018

Example: Dependence of two qualitative variables • Data set Programming • Variables • ScoreFinal score in the examinations (0..20) • FFreshman?: 0=No, 1=Yes • OWas Elect. Eng. your first option?: 0=No, 1=Yes • ProgDid you learn programming at the secondary school?: 0=no; 1=scarcely; 2=a lot • ABDid you learn Boole algebra at the secondary school?: 0=no; 1=scarcely; 2=a lot • BADid you learn binary arithmetic at the secondary school?: 0=no; 1=scarcely; 2=a lot • HDid you learn digital systems at the secondary school?: 0=no; 1=scarcely; 2=a lot • KKnowledge factor: 1 if (Prog+AB+BA+H)>=5; 0 otherwise • LangIf you learned programming at the secondary school which language did you use?: 0=Pascal; 1=Basic; 2=other Workshop on Data Analysis 2018

Data analysis using R

Data analysis using R

Presentation Transcript

Bayesian data analysis 1 using Bugs 2 and R 3

Data Analysis Using R: 2. Descriptive Statistics

Statistical analysis using R

Bayesian data analysis 1 using Bugs 2 and R 3

Using R for Data Analysis

Big data Analysis in R using Hadoop

R for Data Analysis and Data Mining

Panel Data Analysis Using GAUSS

Analysis Using R

Basic Data Analysis Using R

Data Analysis Using R: 3. Graphical Analyses

Data Analysis Using R: 5. Analysis of Variance

Microarray Data Analysis Using R

Panel Data Analysis Using GAUSS

Data Analysis Using SAS

Microarray Data Analysis using R and Bioconductor

Spatial Analysis Using Content Analysis Data

Data Analysis Using R: Multiple linear regression analysis

Image Analysis Using R

Data Analysis Using SPSS

Teaching Data Analysis using SPSS