210 likes | 221 Views
This chapter provides an overview of statistical inference and its application in learning from data, including estimation and hypothesis testing. It explores examples of estimation and hypothesis testing problems in the context of online dating profiles and experimental data. The chapter also discusses the risks involved in statistical inference.
E N D
Chapter 7 An Overview of Statistical Inference – Learning from Data Created by Kathy Fritz
Statistical Inference What You Can Learn from Data
The first two of these questions are estimation problems because they involve using sample data to learn something about a population characteristic. With the increasing popularity of online dating services, the truthfulness of information in the personal profiles by users is a topic of interest. A study was designed to investigate misrepresentation of personal characteristics. The researchers hoped to answer three questions: • What proportion of online daters believe they have misrepresented themselves in an online profile? • What proportion of online daters believe that others frequently misrepresent themselves? • Are people who place a greater importance on developing a long-term, face-to-face relationship more honest in their online profiles? The third question is a hypothesis testing problem because it involves determining if sample data support a claim about the population of online daters.
Learning from Sample Data • When you obtain information from a sample selected from some population, it is usually because • you want to learn something about characteristics of the population. • OR • you want to use sample data to decide whether there is support for some claim or statement about the population. A hypothesis testing problem involves using sample data to test a claim about a population. An estimation problem involves using sample data to estimate the value of a population characteristic. Methods for estimation and hypothesis testing are called statistical inference methods because they involve generalizing (making an inference) from a sample to the population from which the sample was selected.
Learning from Data When There Are Two or More Populations Sometimes sample data are obtained from two or more populations of interest, and the goal is to learn about differences between the populations. Consider the following example: College student spend a lot of time online, but do members of Facebook spend more time online than non-members? Data was collected from two samples of college students; one consisting of Facebook members and the other consisting of non-members. One of the variables studied was the amount of time spent on the Internet in a typical day. Based on the resulting data, it was concluded that there was no support for the claim that the mean time spent online for Facebook members was greater than the mean time for non-members. This study involves generalizing from samples, and it is a hypothesis testing problem because it involves testing a claim about the difference between the two groups.
Learning from Experimental Data • Statistical inference methods are also used to learn from experiment data. When data are obtained from an experiment, it is usually because • you want to learn about the effect of the different experimental conditions (treatments) on the measured response. • OR • you want to determine if experiment data provide support for a claim about how the effects of two or more treatments differ. This is a hypothesis testing problem because it involves testing a claim (hypothesis) about treatment effects. This is an estimation problem because it involves using sample data to estimate a characteristic of the treatments, such as the mean response for a treatment.
Do U Smoke After Txt? Researchers in New Zealand investigated whether mobile phone text messaging could be used to help people stop smoking? An experiment was designed to compare two treatments. Subjects for the experiment were 1705 smokers who were older than 15 years and owned a mobile phone and who wanted to quit smoking. People in the first group received personalized text messages providing support and advice on stopping smoking. The second group was a control group, and people in this group did not receive any of these text messages. After 6 weeks, each person participating in the study was contacted and asked if he or she had smoked during the previous week. Researchers estimated that the proportion of those who successfully quit smoking was greater by 0.15 for those who received text messages. Data from the experiment were used to estimate the difference in the proportion who had quit for those who received the text messages and those who did not.
Statistical Inference Involves Risk The risks associated with statistical inference arise because you are attempting to draw conclusions on the basis of data that provide partial rather than complete information. In estimation problems . . . RISK – these estimates may be inaccurate Understand that the method used to produce the estimates and accompanying measures of accuracy might mislead
Statistical Inference Involves Risk The risks associated with statistical inference arise because you are attempting to draw conclusions on the basis of data that provide partial rather than complete information. In hypothesis testing situations . . . RISK – an inaccurate conclusion Understand how likely it is that the method used to decide whether or not a claim is supported might leadto an incorrect decision
Variability in Data When there is variability in the population, you need to consider whether this partial picture (the sample) is representative of the population. Suppose we wanted to estimate the mean length of fish in a large lake. We could catch a sample of 20 fish from the lake. One sample may have a symmetric distribution like this. Another sample may have a skewed distribution like this . . . This sample-to-sample variability should be considered when you assess the risk associated with drawing conclusions about the population from sample data. . . . or like this.
vs. Variability in Data An experiment might be designed to determine if noise level has an effect on the time required to perform a task requiring concentration. There are 20 individuals available to serve as subjects in this experiment with two treatment conditions (quiet environment and noisy environment). The response variable is the time required to complete the task. If noise level has NO effect on completion time, the time observed for each of the 20 subjects would be the same whether they are in the quiet group or the noisy group. Any observed differences in the completion times for the two treatments would NOT be due to noise level, but to person-to-person variability and the random assignment of subjects to treatments. You must understand how differences might result from variability in the response and the random assignment to treatment groups in order to distinguish them from differences created by a treatment effect.
Selecting an Appropriate Method Four Key Questions
In the following chapters, you will encounter different types of inference problems. The answer to the following questions will lead you to a suggested method to use. Four Key Questions Question Type (Q): Is the question you are trying to answer an estimation problem or a hypothesis testing problem? Study Type (S): Does the situation involve generalizing from a sample to learn about the population (an observational study or survey) ORdoes it involve generalizing from an experiment to learn about treatment effects? You will choose different methods depending on the answer to this question. The answer to this question affects the choice of the method as well as the type of conclusion that can be drawn.
Four Key Questions Continued . . . Is the data set univariate (one variable) or bivariate (two variables)? Identify whether these examples involve univariate or bivariate data. Explain your choice. Are the data categorical or numerical? Type of Date (T): What type of data will be used to answer the question? Univariate versus Bivariate The study of deception in online dating profiles investigated whether people who place a greater importance on developing a long-term face-to-face relationship are more honest in their online profiles. A study was performed to learn how the proportion with a TV in the bedroom differed for children in two age groups.
Four Key Questions Continued . . . Is the data set univariate (one variable) or bivariate (two variables)? Are the data categorical or numerical? Type of Date (T): What type of data will be used to answer the question? Categorical versus Numerical If you have a single variable and the data are categorical, the question of interest is probably about a population proportion. If the data are numerical, the question of interest is probably about a population mean.
Four Key Questions Continued . . . Number of Samples or Treatments (N): How many samples are there? OR IF the data are from an experiment, how many treatments are being compared? For situations that involve sample data, different methods are used depending on whether there are one, two, or more than two samples. Also, you may choose a different method to analyze data from an experiment with only two treatments than you would for an experiment with more than two treatments.
QSTN Think of this as the word QUESTION without the vowels.
Answering Four Key Questions to Identify An Appropriate Method You will be able to refer to this table in the following chapters to identify an appropriate method to use.
A Five-Step Process for Statistical Inference Estimation Problems Hypothesis Testing Problems
A Five-Step Process for Estimation Problems (EMC3) M C E C C Estimate: Explain what population characteristic you plan to estimate Method: Select a potential method using QSTN Check: Check to make sure that the method is appropriate. It is important to verify that any conditions are met before proceeding. Calculate: Sample data are used to perform any necessary calculations. Communicate Results: This is a critical step in the process. You will answer the questions of interest, explain what you have learned from the data, and acknowledge potential risk.
A Five-Step Process for Hypothesis Testing Problems (HMC3) M C H C C Hypotheses: Define the hypotheses that will be tested Method: Select a potential method using QSTN Check: Check to make sure that the method is appropriate. It is important to verify that any conditions are met before proceeding. Calculate: Sample data are used to perform any necessary calculations. Communicate Results: This is a critical step in the process. You will answer the questions of interest, explain what you have learned from the data, and acknowledge potential risk.