Chapter 7 Sampling Distributions

Chapter 7 Sampling Distributions General Objectives: We begin to study samples and the statistics that describe them. These samples statistics are used to make inferences about the corresponding population parameters. This chapter involves sampling and sampling distributions, which describe the behavior of sample statistics in repeated sampling. ©1998 Brooks/Cole Publishing/ITP

Specific Topics 1. Random samples 2. Sampling plans and experimental designs 3. Statistics and sampling distributions 4. The central limit theorem 5. The sampling distribution of the sample mean, 6. The sampling distribution of the sample proportion, 7. Statistical process control: and p charts ©1998 Brooks/Cole Publishing/ITP

7.1 Introduction • Numerical descriptive measures called parameters are needed to calculate the probability of observing sample results, e.g.,p, m, s • Often the values of parameters that specify the exact form of a distribution are unknown. • Examples - A pollster is sure that the responses to his “agree/disagree” question will follow a binomial distribution, but p, the proportion of those who “agree” in the population, is unknown. ©1998 Brooks/Cole Publishing/ITP

- An agronomist believes that the yield per acre of a variety of wheat is approximately normally distributed, but the mean m and the standard deviation s of the yields are unknown. • You must rely on the sample to learn about these parameters. • If you want the sample to provide reliable information about the population, you must select your sample in a certain way! ©1998 Brooks/Cole Publishing/ITP

7.2 Sampling Plans and Experimental Designs • The way a sample is selected is called the sampling plan or experimental design, e.g., a simple random sample, a statistical random sample, a cluster sample, a conversion sample, a judgment sample, and a quota sample. • Simple random sampling is a commonly used sampling plan in which every sample size n has the same chance of being selected. • The resulting sample is called a simple random sample, or just a random sample. • Table 7.1 illustrates the ways of selecting a sample of size 2 from 4 objects. ©1998 Brooks/Cole Publishing/ITP

Table 7.1 Ways of selecting a sample of size 2 from 4 objects Sample Observations in Sample 1 x1, x2 2 x1, x3 3 x1, x4 4 x2, x3 5 x2, x4 6 x3, x4 Definition: If a sample of n elements is selected from a population of N elements using a sampling plan in which each of the possible samples has the same chance of selection, then the sampling is said to be random and the resulting sample is a simple random sample. • Example 7.1 is an example of the selection of a simple random sample. ©1998 Brooks/Cole Publishing/ITP

Example 7.1 A computer database at a downtown law firm contains files forN=1000 clients. The firm wants to select n=5 files for review. Select a simple random sample of 5 files from this database. Solution You must first label each file with a number from 1 to 1000. Perhaps the files are stored alphabetically, and the computer has already assigned a number to each. Then generate a sequence of ten three digit-random numbers. If you are using Table 10 of Appendix I, select a random starting point and use a portion of the table similar to the one shown in Table 7.2. The random starting point ensures that you will not use the same sequence over and over again. The first three digits of Table 7.2 indicate the number of the first file to be reviewed. ©1998 Brooks/Cole Publishing/ITP

The random number 001 corresponds to file #1, and the last file, #1000, corresponds to the random number 000. Using Table 7.2, you would choose the five files numbered 155, 450, 32, 882, and 350 for review. Table 7.2 Portion of a table of random numbers 15574 35026 98924 45045 36933 28630 03225 78812 50856 88292 26053 21121 ©1998 Brooks/Cole Publishing/ITP

A simple and reliable method of sampling uses random numbers—digits generated so that the values 0 to 9 occur randomly and with equal frequency. • Observational study: The data already existed before you decided to observe or describe their characteristics. • You must be careful when conducting a sample survey to watch for these problems: - Nonresponse: Are the respondes you received biased because only certain subjects responded? - Undercoverage: Does the database you used systematically exclude certain segments of the population? - Wording bias: Question may be too complicated or tend to confuse. • Some research involves experimentation in which an experi-mental condition or treatment is imposed on the experimental units. ©1998 Brooks/Cole Publishing/ITP

Some populations do not exist in fact but are hypothetical populations envisioned in the mind of the researcher. • Sometimes the researcher cannot choose randomly and instead chooses certain samples that are believed to be representative and behave as if they had been randomly selected from the two populations. • When the population of interest consists of two or more subpopulations, called strata, a sampling plan that ensures that each subpopulation is represented in the sample is called a stratified sample. Definition:Stratified random sampling involves selecting a simple random sample from each of the given number of subpopulations, or strata. ©1998 Brooks/Cole Publishing/ITP

Sometimes the available sampling units are groups of elements called clusters, such as households, city blocks, or neighborhoods. Definition: A cluster sample is a simple random sample of clusters from the available clusters in the population, Definition: A 1-in k-systematic random sample involves the random selection of one of the first k elements in an ordered population, and then the systematic selection of every kth element thereafter, e.g., elements 7t, 17, 27, etc. • Convenience sample—a sample that can be easily and simply obtained without random selection, e.g., people walking by a certain street corner. • Judgment sampling allows the sampler to decide who will or will not be included in the sample, e.g., only apparently rich people. ©1998 Brooks/Cole Publishing/ITP

Quota sampling—the makeup of the sample must reflect the makeup of the population on some selected characteristic, e.g., 90% white and 10% black, since that is the proportion in the total population. • Nonrandom samples can be described but cannot be used for making inferences. ©1998 Brooks/Cole Publishing/ITP

7.3 Statistics and Sampling Distributions • The numerical descriptive measures you calculate from the sample are called statistics. • Statistics are random variables. • The probability distributions for statistics are called sampling distributions. • In repeated sampling, they tell us what values of the statistics can occur and how often each value occurs. Definition: The sampling distribution of a statistic is the probability distribution for the possible values of the statistic that results when random samples of size n are repeatedly drawn from the population. ©1998 Brooks/Cole Publishing/ITP

Three ways to find the sampling distribution of a statistic: 1. Derive the distribution mathematically using the laws of probability. 2. Approximate the distribution empirically by drawing a large number of samples of size n. 3. Use statistical theorems to derive exact or approximate sampling distributions. • Example 7.3 illustrates the determination of the sampling distributions of the sample mean and the sample median m. ©1998 Brooks/Cole Publishing/ITP

Example 7.3 A population consists of N = 5 numbers: 3, 6, 9, 12, 15. If a random sample of size n= 3 is selected without replacement, find the sampling distribution for the sample mean and the sample median m. Solution The population from which you are sampling is shown in Figure 7.1. It contains five distinct numbers and each is equally likely, with probability p(x)= 1/5. You can easily find the population mean and median as ©1998 Brooks/Cole Publishing/ITP

There are ten possible random samples of size n = 3 and each is equally likely with probability 1/10. These samples, along with the calculated values of and m for each, are listed in Table 7.3. You will notice that some values of are more likely than others because they occur in more than one sample. For example: and ©1998 Brooks/Cole Publishing/ITP

When the number of elements in the population is very small, it is easy to derive the sampling distributions. Otherwise, you may need to use on of these methods: - Approximate the sampling distribution empirically. - Rely on statistical theorems and theoretical results. ©1998 Brooks/Cole Publishing/ITP

7.4 The Central Limit Theorem • The Central Limit Theorem states that, under rather general conditions, sums and means of random samples of measure-ments drawn from a population tend to possess an approxi-mately normal distribution. • Figure 7.3 shows the probability distribution of the number appearing on a single toss of a die. Table 7.5 sums the upper faces of two dice. • Figures 7.4–7.6 illustrate the sampling distributions of for n = 2, 3, 4, respectively. ©1998 Brooks/Cole Publishing/ITP

Table 7.5 Sums of the upper faces of two die First Die Second die 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12 ©1998 Brooks/Cole Publishing/ITP

Central Limit Theorem If random samples of n observations are drawn from a nonnormal population with finite m and standard deviation s, then, when n is large, the sampling distribution of the sample mean is approximately normally distributed, with mean and standard deviation The approximation becomes more accurate as n becomes large. • The spread of the distribution of sample means is considerably less than the spread of the sampled population. • The sum of the sample measurementsSxi also has an approximately normal distribution with mean nm and standard deviation as n becomes large. ©1998 Brooks/Cole Publishing/ITP

How Large Is Large? - If the sample is normal, then the sampling distribution of will also be normal, no matter what the sample size. - When the sample population is approximately symmetric, the distribution becomes approximately normal for relatively small values of n. - When the sample population is skewed, the sample size must be at least 30 before the sampling distribution of becomes approximately normal. ©1998 Brooks/Cole Publishing/ITP

7.5 The Sampling Distribution of the Sample Mean • Consider these criteria in choosing the estimator for m: - Is is easy or hard to calculate? - Does it produce estimates that are consistently too high or too low? - Is it more or less variable than other possible estimators? • If a random sample of n measurements selected from a population with a mean m and standard deviation s, the sampling distribution of the sample mean will have mean and standard deviation ©1998 Brooks/Cole Publishing/ITP

If the population has a normal distribution, the sampling distribution of will be exactly normally distributed, regardless of the sample size, n. • If the population distribution is nonnormal, the sampling distribution of will be approximately normally distributed for large samples (by the Central Limit Theorem). Definition: The standard deviation of a statistic used as an estimator of a population parameter is often called the standard error of the estimator (abbreviated SE) because it refers to the precision of the estimator. Therefore, the standard deviation of —given by —is referred to as the standard error of the mean (abbreviated as SEOM). ©1998 Brooks/Cole Publishing/ITP

Calculating Probabilities for the Sample Mean : If you know that the sampling distribution of is normal or approximately normal: 1. Find m and calculate 2. Write down the event of interest in terms of and locate the appropriate area on the normal curve. 3. Convert the necessary values of to z-values using 4. Use Table 3 in Appendix I to calculate the probability. ©1998 Brooks/Cole Publishing/ITP

Properties of the Sampling Distribution of the Sample Proportion, : • If a random sample of n observations is selected from a binomial population with parameter p, then the sampling distribution of the sample proportion will have a mean and a standard deviation • When the sample size n is large, the sampling distribution of can be approximated by a normal distribution. The approximation will be adequate if np > 5and nq >5.

Example 7.6 deals with the sampling distribution of the sample proportion Example 7.6 In a survey, 500 mothers and fathers were asked about the importance of sports for boys and girls. Of the parents interviewed, 60% agreed that the genders are equal and should have equal opportunities to participate in sports. Describe the sampling distribution of the sample proportion of parents who agree that the genders are equal and should have equal opportunities. Solution You can assume that the 500 parents represent a random sample of the parents of all boys and girls in the United States and that the true proportion in the population is equal to some unknown value that you can call p. ©1998 Brooks/Cole Publishing/ITP

The sampling distribution of can be approximated by a normal distribution with mean equal to p (see Figure 7.10) and standard deviation Calculating Probabilities for the Sample Proportion : 1. Find the necessary values of n and p. 2. Check whether the normal approximation to the binomial distribution is appropriate (np > 5and nq >5). 3. Write down the event of interest in terms of and locate the appropriate area on the normal curve. 4. Convert the necessary values of to z-values using 5. Use Table 3 in Appendix I to calculate the probability. • Example 7.7 deals with the probability of observing a certain sample proportion. ©1998 Brooks/Cole Publishing/ITP

7.7 A Sampling Application: Statistical Process Control (Optional) • The cause of a change in the variable is said to be assignable if it can be found and corrected. • Other variation that is not controlled is regarded as random variation. • If the variation in a process variable is solely random, the process is said to be in control. • If out of control, we must reduce the variation and get themeasurements of the process variable within specified limits. • Example 7.8 requires the construction of an chart for monitoring the process mean. ©1998 Brooks/Cole Publishing/ITP

Example 7.8 A statistical process control monitoring system samples the inside diameters of n= 4 bearings each hour. Table 7.6 provides the data for k= 25 hourly samples. Construct an chart for monitoring the process mean. Solution The sample mean was calculated for each of the k= 25 samples. For example, the mean for sample 1 is: ©1998 Brooks/Cole Publishing/ITP

Table 7.6 25 hourly samples of bearing diameters Sample Sample Sample measurements mean, 1 .992 1.007 1.016 .991 1.00150 2 1.015 .984 .976 1.000 .99375 3 .988 .993 1.011 .981 .99325 4 .996 1.020 1.004 .999 1.00475 5 1.015 1.006 1.002 1.001 1.00600 6 1.000 .982 1.005 .989 .99400 7 .989 1.009 1.019 .994 1.00275 8 .994 1.010 1.009 .990 1.00075 9 1.018 1.016 .990 1.011 1.00875 10 .997 1.005 .989 1.001 .99800 11 1.020 .986 1.002 .989 .99925 12 1.007 .986 .981 .995 .99225 13 1.016 1.002 1.010 .999 1.00675 ©1998 Brooks/Cole Publishing/ITP

Table 7.6 25 hourly samples of bearing diameters (cont.) Sample Sample Sample measurements mean, 14 .982 .995 1.011 .987 .99375 15 1.001 1.000 .983 1.002 .99650 16 .992 1.008 1.001 .996 .99925 17 1.020 .988 1.015 .986 1.00225 18 .993 .987 1.006 1.001 .99675 19 .978 1.006 1.002 .982 .99200 20 .984 1.009 .983 .986 .99050 21 .990 1.012 1.010 1.007 1.00475 22 1.015 .983 1.003 .989 .99750 23 .983 .990 .997 1.002 .99300 24 1.011 1.012 .991 1.008 1.00550 25 .987 .987 1.007 .995 .99400 ©1998 Brooks/Cole Publishing/ITP

The process variables are monitored with control charts, samples of n items are drawn from the process at specified intervals, and a sample statistic is computed. A Control Chart for the Process Mean: The Chart - According to the CLT, the sampling distribution of should be approximately normal; almost all of the values of should fall into the interval ±3 standard deviations around the mean; estimate m and s using the sample measurements. - Every control chart has a centerline and control limits. - The centerline for the chart is the estimate of m, the grand average of all the sample statistics calculated for the measurements on the process variable. - s is estimated by s, the standard deviation of the nk measurements (k samples of size n). ©1998 Brooks/Cole Publishing/ITP

A Control Chart for the Proportion Defective: The Chart • To monitor a process for defective items, samples of size nare selected at periodic intervals and the sample proportion is calculated. • When the process is in control, should fall into the interval where p is the proportion of defectives in the population (or the process fraction defective) with standard error • The process fraction defective is unknown but can be estimated by the average of the k sample proportions: and is estimated by ©1998 Brooks/Cole Publishing/ITP

The centerline for the p chart is located at , and the upper and lower control limits are and • Example 7.9 addresses the construction of a control chart for the proportion defective in a sample. ©1998 Brooks/Cole Publishing/ITP

Example 7.9 A manufacturer of ballpoint pens randomly samples 400 pens per day and tests each to see whether the ink flow is acceptable. The proportions of pens judged defective each day over a 40-day period are listed in Table 7.7. Construct a control chart for the proportion defective in samples of n= 400 pens selected from the process. Solution The estimate of the process proportion defective is the average of the k= 40 sample proportions in Table 7.7. Therefore, the centerline of the control chart is located at An estimate of ©1998 Brooks/Cole Publishing/ITP

and Therefore, the upper and lower control limits for the chart are located at and Or, since p cannot be negative, LCL = 0. The p control chart is shown in Figure 7.13. Note that all 40 sample proportions fall within the control limits. If a sample proportion collected at some time in the future falls outside the control limits, the manufacturer will be warned of a possible increase in the value of the process proportion defective. Efforts will be initiated to seek possible causes for an increase in the value of the process proportion defective. ©1998 Brooks/Cole Publishing/ITP

Other commonly used control charts are the R chart, which is used to monitor variation In the process variable by using the sample range, and the c chart, which is used to monitor the number of defects per item. ©1998 Brooks/Cole Publishing/ITP

Key Concepts and Formulas I. Sampling Plans and Experimental Designs 1. Simple random sampling a. Each possible sample is equally likely to occur. b. Use a computer or a table of random numbers. c. Problems are nonresponse, undercoverage, and wording bias. 2. Other sampling plans involving randomization a. Stratified random sampling b. Cluster sampling c. Systematic 1-in-k sampling ©1998 Brooks/Cole Publishing/ITP

3. Nonrandom sampling a. Convenience sampling b. Judgment sampling c. Quota sampling II. Statistics and Sampling Distributions 1. Sampling distributions describe the possible values of a statistic and how often they occur in repeated sampling. 2. Sampling distributions can be derived mathematically, approximated empirically, or found using statistical theorems. 3. The Central Limit Theorem states that sums and averages of measurements from a nonnormal population with finite mean m and standard deviation s have approximately normal distributions for large samples of size n. ©1998 Brooks/Cole Publishing/ITP

III. Sampling Distribution of the Sample Mean 1. When samples of size n are drawn from a normal population with mean m and variance s 2, the sample mean has a normal distribution with mean m and variance s 2/n. 2. When samples of size n are drawn from a nonnormal population with mean m and variance s 2, the Central Limit Theorem ensures that the sample mean will have an approximately normal distribution with mean m and variances 2/n when n is large (n³ 30). 3. Probabilities involving the sample mean m can be calculated by standardizing the value of using z: ©1998 Brooks/Cole Publishing/ITP

Chapter 7 Sampling Distributions