450 likes | 685 Views
HOW TO USE STATISTICS IN YOUR RESEARCH. LIES, DAMNED LIES AND STATISTICS!. What we will cover. WHY. HOW. Graphpad EXCEL. Why do statistics Descriptive Statistics Distributions Sampling & Hypotheses Presenting Results Chart junk. Why do you need statistics ?.
E N D
HOW TO USE STATISTICS IN YOUR RESEARCH LIES, DAMNED LIES AND STATISTICS!
What we will cover WHY HOW Graphpad EXCEL • Why do statistics • Descriptive Statistics • Distributions • Sampling & Hypotheses • Presenting Results • Chart junk
Why do you need statistics ? • Why are you doing a research project! • Also very important in everyday life • Measure things • Examine relationships • Make predictions • Test hypotheses • Explore issues • Explain activities or attitudes • Make comparisons • Draw conclusions based on samples • Develop new theories • …
Misuse of statistics Design • Ignoring some ‘inconvenient data points’ • Focus on certain variables and exclude others • Alter scales to present your data in a more positive way • Present correlation as causation
Misuse of statistics Design • Ignoring some ‘inconvenient data points’ • Focus on certain variables and exclude others • Alter scales to present your data in a more positive way • Present correlation as causation
Misuse of statistics Design • Ignoring some ‘inconvenient data points’ • Focus on certain variables and exclude others • Alter scales to present your data in a more positive way • Present correlation as causation
Misuse generally accidental • Bias • Need to be particular careful in ‘questionnaire’ type research • Also when sampling • Using the wrong statistical tests • Making incorrect inferences • In going from your sample to the general case • Incorrect drawing conclusions based on correlations
Descriptive Statistics • Used to describe or summarise what your data shows • Not used to draw any conclusions that extend beyond your own data • Mean • Median • Mode • Variance • Standard Deviation
Mean (Average) • Imagine you have collected some data • From running an algorithm on a problem • By measuring execution time • By asking opinions • You want to summarise your data • Don’t present all the results • mean {-30, 1, 2, 3, 4} = -4 • mean {0, 1, 2, 3, 4} = 2 • Measures centrality Excel: = AVERAGE(A1:A10) Graphpad
The mean is not the whole story.. Emma’s Algorithm Malcolm’s Algorithm
Standard Deviation • Standard Deviation measures something about the spread of your data • Important as it gives you some indication of reliability or variability of your results • sd {-30, 1, 2, 3, 4} = 14.6 • sd {0, 1, 2, 3, 4} = 1.6 • Measures spread Excel: = STDEV(A1:A10)
The mean is not the whole story.. Emma’s Algorithm Malcolm’s Algorithm STD: 0.71 STD: 28.07
True or False ? The majority of Scots have more than the average number of legs
TRUE! Most Scots have more than the average number of legs! • (None have 3 legs) • Most have 2 legs • Some have 1 leg • Some have 0 legs • The average < 2 (~1.9) • The mean is not a relevant measure!
When can I use the mean? • The data that you are sampling should follow a normal distribution • Most values are close to the mean, and a few lie at either extreme • 68% of values within 1 SD of mean • 95% of values within 2 SD In practice, a lot of data does follow this kind of distribution
But not all data has a normal distribution • majority of the data is < m ; • more than half the population has less than the mean value • more than half the population is “below average”! m - sd m m + sd
The Median • Median : item with average rank • Rank the items in order, and pick the middle one • median {-30, 1, 2, 3, 4} = 2 ; • median {0, 1, 2, 2, 2, 3, 4, 10, 27} = 2 EXCEL: =MEDIAN(A1:A10) median
Example: Mean vs Median Suppose we ask 7 students how much money they have on them: Mean: £146 Median: £3 The median is much less affected by outliers in the data It is more representative of the sample
q1 med q3 Measuring Spread in non-normal data • Quartiles (25th and 75th percentiles) are a nonparametric measure of spread • first quartile ( Q1) = lower quartile = cuts off lowest 25% of data • third quartile (designated Q3) = upper quartile = cuts off lowest 75%
Bag contains 1000 balls • They are either red or black • How can we estimate what proportion is red and what proportion is black without looking at all the balls in the bag ?
Sampling • Most experiments involve taking samples from a much larger “population”of data • 20 people asked to rate a website • An algorithm run 10 times to benchmark speed • A measure of quality of service on 10 consecutive days from a network We want to assume that our sample is representative of the larger population
Sampling • Imagine throwing a dice 600 times… • We know what the distributions of outcome should be theoretically • Assume we throw the die 30 times • We might take ‘good’ samples Frequency 1 2 3 4 6 5
Sampling • Imagine throwing a dice 600 times… • We know what the distributions of outcome should be theoretically • Assume we throw the die 30 times • We might take ‘good’ samples Frequency 1 2 3 4 6 5
Sampling • Imagine throwing a dice 600 times… • We know what the distributions of outcome should be theoretically • Assume we throw the die 30 times • We might take ‘good’ samples • Or we might be ‘unlucky’ with our samples Frequency 1 2 3 4 6 5
Sampling • Now imagine we have a weighted die… • We make 30 throws • The results look a lot like the ‘unlucky’ results from our previous sample… • How can we tell whether the die is really different or whether we were just unlucky during our sampling… • (In most experiments we don’t know what the underlying distribution actually is) Frequency 1 2 3 4 6 5 Frequency 1 2 3 4 6 5
Sampling • Now imagine we have a weighted die… • We make 30 throws • The results look a lot like the ‘unlucky’ results from our previous sample… • How can we tell whether the die is really different or whether we were just unlucky during our sampling… • (In most experiments we don’t know what the underlying distribution actually is) Frequency 1 2 3 4 6 5 Frequency 1 2 3 4 6 5
Statistical Tests – Student TTest • The t-test tells us the probability that the two sets of data came from the same underlying distribution • If the probability is very small (< 5%) then we assume the samples come from DIFFERENT distributions • We can safely say that one experiment is better than the other • But… • If >5%, you have to assume both samples came from the same distribution • Any differences in mean, standard deviation are only due to random sampling • There is no significant difference between the samples Excel: TTEST(Range1, Range2, tails, type) Range 1 – first set of data Range 2 – second set of data Tails: set this to 2 (assume a 2-tailed distribution) Type: set this to 2 (an unpaired t-test) Graphpad
Statistical Tests – Student TTest • Mary and John each write an algorithm to sort a large database. Mary claims hers is faster than Johns. • They each run their algorithms 20 times on the same machine and record the results and some descriptive statistics. • John claim she was wrong – his algorithm is definitely faster • Is he right ? • Two-tailed p value = 0.25 • There is a 25% chance the Mary’s and John’s samples both came from the same distribution • Therefore the difference in results is only down to random variations sampling • There is no statistical difference in performance between John’s and Mary’s algorithms
Another Example • Mary and John both roll a die many time and record the mean score. • Mary claims that John’s die is biased • Is she right ? • Two-tailed p value = 0.00002 • There is a 0.002% chance the Mary’s and John’s samples both came from the same distribution • Therefore the difference in results is statistically significant • We can safely conclude that John’s die is different to Mary’s
Some words of caution… • Strictly speaking, the t-test should only be used if the underlying data distribution is normal • If you don’t think it is, there are similar tests you can use: • Wilcoxon • RankSum
Some more tests • For some experiments, we might have a hypothesis: • Students have no preference as to which of 3 browsers they use when they go in the JKCC • From the hypothesis, we can calculate what we would expect to find in an experiment if the hypothesis was true • A researcher goes into the lab and records which of 3 browsers is being used by 60 students • Would expect to see 20 students using each browser • He records the actual results observed
CHITEST • The CHITEST asks: • What is the probability of finding the observed results is the hypothesis was true ? • It generates a number called the p-value • If p <0.05, we REJECT the hypothesis • If p>0.05, we ACCEPT the hypothesis • In this case, if p < 0.05, the we reject the hypothesis that students have no preference for browsers (i.e. they do have a preference!) • In EXCEL: CHITEST(actualValues, expectedValues)
Chi test • Students have no preference as to which browser they use when they go in the JKCC The two-tailed P value equals 0.1423 By conventional criteria, this difference is considered to be not statistically significant. P > 0.05 so we ACCEPT the hypothesis There is a 14% chance the data was sampled from the expected distribution This is NOT statistically sufficient – we have to assume that students have NO preference as to which browser they use, i.e. the theory is correct (value < 0.05 to be significant)
Linear Regression • Sometimes you want to find correlations between two variables: • QualityOfS & SizeOfNetwork • LinesOfCode & SpeedOfExecution • Age & TimeSpentOnSocialMedia • Show trends • Use to predict future values
Understanding the Graph Variable on x axis (independent variable) y=mx+c y-intercept Variable on y axis (dependent variable) Slope of line mark = 0.6958attendance + 4.9333 R² = 0.98134 Prediction: what mark would a student who attended 75% of time get ? Mark = 0.6958*75+4.9333=57.11 Measure of quality of fit (maximum = 1)
Doing this in Excel • Scatterplot of data • Make sure it is in columns, with independent variable first (x) • Chart Layout: • Add linear trendline • Trendline Options: choose show R value and equation
And finally Presenting your results
Chart Junk “The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new. The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.” “Chartjunk can turn bores into disasters, but it can never rescue a thin data set.”
Too much information! • Too much info
SUMMARY • Remember you need to use statistics to properly analyse your work • Make sure you use the right statistic • Make sure your present your data/statistics well • Don’t lie with statistics !
Dropbox links to slides and a workbook http://bit.ly/MrW1K4http://bit.ly/1fanfQ3