270 likes | 562 Views
Advanced Data Analytics. Lecture 1 - Introduction. Indicative Content. Conducting Statistical Analyses Factor Analysis and Cluster Analysis Effect Size, Sample Size, Power Predictive Analysis Machine Learning Simulation Optimization. Statistics.
E N D
Advanced Data Analytics Lecture 1 - Introduction
Indicative Content • Conducting Statistical Analyses • Factor Analysis and Cluster Analysis • Effect Size, Sample Size, Power • Predictive Analysis • Machine Learning • Simulation • Optimization
Statistics • The science of data, involving collecting, classifying, summarizing, organizing, analyzing, presenting, and interpreting numerical data • Descriptive Statistics • Utilizing numerical and graphical methods to look for patterns in a dataset, to summarize information in a dataset, and to present that information in a convenient form • Inferrential Statistics • Utilizing sample data to make estimates, decisions, predictions, and other generalizations about a larger set of data
Fundamental Elements • An experimental (or observational) unit in an object (e.g. person, thing, transaction, or event) about we collect data • A population is a set of units that we are interested in studying • A variable is a characteristic or property of an individual experiment unit in the population. • A sample is a subset of the units of the population
What you should know by now • Descriptive statistics. • Probability. • z-test. • t-test. • Chi-squared goodness of fit. • Simple linear regression. • Weighted moving averages. • One way ANOVA.
What Next • What are the preconditions for the tests we have looked? • Using statistical packages to check some of these preconditions. • Looking at more than one dimension / factor with the ANOVA and regression. • Multiple linear regression • Two-way ANOVA • Dimension reduction and classification. • Factor analysis • Power and effect • Some non-parametric alternatives to the tests we have looked at. • Looking at case-studies / real examples. • Focus on practical concepts. • Focus on using a statistical package rather deep mathematical concepts.
Multivariate • Multivariate analysis consists of a collection of methods that can be used when several measurements are made on each individual or object in one or more samples. • We will refer to the measurements as variables and to the individuals or objects as units (research units, sampling units, or experimental units) or observations. • In practice, multivariate data sets are common, although they are not always analyzed as such. • Historically, the bulk of applications of multivariate techniques have been in the behavioral and biological sciences.
Factor Analysis • Factor analysis is a multivariate statistical approach commonly used in psychology, education. • Factor analysis is a multivariate statistical procedure that has many uses two of which will be briefly noted here: • Firstly, factor analysis reduces a large number of variables into a smaller set of variables (also referred to as factors). • Secondly, it establishes underlying dimensions between measured variables and latent constructs, thereby allowing the formation and refinement of theory.
Non-parametric • The classical methods that you know for dealing with a quantitative response variable have always assumed that the response variable has a normal distribution. • Nonparametric statistics plays an important role in the world of data analysis. • Nonparametric techniques can save the day when you can’t use other methods. • You want to compare two populations but you have two small random samples which do not have a normal distribution – • What can you do? • Many hypothesis-testing situations have nonparametric alternatives to be used when the usual assumptions we make are not met. • In other situations, nonparametric methods offer unique solutions to problems at hand.
The Test Statistic and Critical Values Sampling Distribution of the test statistic Region of Rejection Region of Rejection Region of Non-Rejection Critical Values In general we want to hit the ‘region of rejection’ (if we are researchers) as we want to reject the null hypothesis.
Normality Tests • There are a couple of ways to help you decide whether a population has a normal distribution, based on your sample: • You can graph the data, using a histogram, and see whether it appears to have a bell shape (a mound of data in the middle, trailing down on each side). • You can make a normal probability plot, which compares your data to that of a normal distribution, using a Q-Q-Plot. • If the data follows a normal distribution, your normal probability plot will show a straight line. • If the data do not follow a normal distribution, the normal probability plot will not show a straight line; it may show a curve off to one side or the other, for example.
Effect Size & Sample • We sometimes find that even very small differences between the sample statistic and the population parameter can be statistically significant if the sample size is large. • The left side of the graph shows a fairly large difference between the sample mean and population mean, but this difference is not statistically significant with a small sample…
Multiple Regression • The idea of regression is to build a model that estimates or predicts one quantitative variable (y) by using at least one other quantitative variable (x). • Simple linear regression uses exactly one x variable to estimate the y variable. • Multiple linear regression, on the other hand, uses more than one x variable to estimate the value of y.
Reporting Results 1. Abstract 2. Introduction a. Background b. Statement of purpose c. Hypotheses 3. Method a. Participants b. Measures c. Procedure d. Statistical plans 4. Results 5. Discussion 6. References
Statistical Packages • R • Python • Excel • We are going to use R & Python (and sometimes excel) as the main statistical package for this semester. • There are many packages available • IBM SPSS • Matlab • Octave
Why R • It’s free. • If you can work with R you will pick up transferable skills and other stats packages won’t be a problem. • It is good thing to have on your CV at the moment • There are many excellent resources online. • You can continue learning and using R in your own time. • There is a community of people using R in Dublin who organise talks and examples.
Common Statistical Mistakes • Failure to check the data for errors • Using incorrect statistical procedure • Failure to check assumptions • Failure to define meaningful significance • Failure to evaluate confidence interval as well as P values • Selectively selecting significant results • Failure to check for spurious significant results • Failure to identify a representative sample • Failure to use all of the basic principles of experiments
Mistake #1 - Failure to check the data for errors • Failing to investigate data for data entry or recording errors. • Failing to graph data and calculate basic descriptive statistics before analyzing data. • Example – Wrong decision due to error Test of mu = 26.000 vs mu not = 26.000 Variable N Mean StDev SE Mean T P With 16 25.625 3.964 0.991 -0.38 0.71 Without 15 24.733 1.792 0.463 -2.74 0.016
Check Your Data • This is not just true for statistics but for data analysis in general, e.g. data mining. • See pdf of chapter 2 of “Guide to Intelligent Data Analysis How to Intelligently Make Sense of Real Data” • http://www.springer.com/cda/content/document/cda_downloaddocument/9781848822597-c2.pdf?SGWID=0-0-45-963492-p173866313
Mistake #2 • Using the wrong statistical procedure in analyzing your data. • Includes failing to check that necessary assumptions are met. • Failing to design your study so that it has high enough power to call meaningful differences “significantly different.” • Includes concluding that the null hypothesis is true. Should be “not enough evidence to say the null is false.” Mistake #3
Mistake #4 • Failing to report a confidence interval as well as the P-value. • P-value tells you if statistically significant. • Confidence interval tells you what the population value might be. • “Fishing” for significant results. That is, performing several hypothesis tests on a data set, and reporting only those results that are significant. • If = P(Type I) = 0.05, and we perform 20 tests on the same data set, we can expect to make 1 Type I error. (0.05 ×20 = 1). Mistake #5
Mistake #6 • Forgetting that a significant result may be “spurious.” • Overstating the results of an observational study. • That is, suggesting that one variable “caused” the differences in the other variable. • As opposed to correctly saying that the two variables are “associated” or “correlated.” • Using a non-random or unrepresentative sample. • Includes extending the results of an unrepresentative sample to the population. • Failing to use all of the basic principles of experiments, including randomization, blinding, and controlling. Mistake #7 Mistake #8
Summary • What are the preconditions for the tests we have looked? • Using statistical packages to check some of these preconditions. • Looking at more than one dimension / factor with the ANOVA and regression. • Multiple linear regression • Two-way ANOVA… • Dimension reduction and classification. • Factor analysis… • Power and effect. • Some non-parametric alternatives to the tests we have looked at.