220 likes | 326 Views
Chapter 7 Looking at Distributions. Modeling by A Distribution. For a given data set we want to know which distribution can fit each variable. This is a modeling problem.
E N D
Chapter 7 Looking at Distributions
Modeling by A Distribution • For a given data set we want to know which distribution can fit each variable. This is a modeling problem. • When we have a knowledge to use a specific type distribution (normal, exponential, Poisson distributions) to fit the data, a goodness-fit-test will be useful. • Various Q-Q plots are very useful methods to find a suitable distribution to fit the data.
Two data sets • The contents in this chapter are from Chapter 7 of the textbook. • Our textbook chooses the data set of marathon.sav to show us how to use SPSS for looking at distribution. The Chicago Marathon has been run yearly since 1977. • As we use the student version of SPSS that has some limitation on the number of rows/columns, we use a similar data set of mar1500.sav to instead.
Data set “mar1500.sav” • The data set involves the following variables: “age”, “sex”, “hours”, “agecat8”, and “agecat6”. • Hours = “completion time in hours” • Agecat8: 1=24 or less, 2=25-39, 3=40-44, 4=45-49, 5=50-54, 6=55-59, 7=60-64, 8=65+ • Agecat6: 1=44 or less, 2=45-49, 3=50-54, 4=55-59, 5=60-64, 6=65+
Impressions on the histogram • The mean falls in 4.3 - 4.4 • The distribution is not symmetric about the mean. • The distribution has a tail toward larger times. • Low marathon times are difficult to achieve. It is hard to break the world record. • Since the distribution has a tail toward larger values, the median should be somewhat less than the mean.
Basic statistics • The 5% trimmed mean excludes the 5% largest and the 5% smallest values. It is based on the 90% of cases in the middle. • The trimmed mean provides an alternative to the median when you have some outliers. • In this data the 5% trimmed mean doesn’t differ much from the usual mean, because the distribution is not too far from being symmetric.
Comparisons of completing time on Gender • The difference in all of the percentile values of completing times between men and women is about 0.4882 hour. • The weighted percentiles and Tukey’s hinges are two different ways of calculating sample percentiles. More details refer to P.120.
Remarks • Average completion times for men and women of different ages are shown. • For every age group, the average time for men is less than the average time for women. • For men and women younger than 45, age does not seem to matter very much. • For both men and women the variability of completion times is very stable except the eldest age group.
Detecting outliers • Cases with values between 1.5 and 3 box lengths from the upper or lower edge of the box are called outliers and are designated with an “o”. • Cases with values of more than 3 box lengths from the upper or lower edge of the box are called extreme values. They are designated with “*”.
Stem-and-leaf plots • A stem-and-leaf plot is a display very much like a histogram, but it includes more information of the data. • In a stem-and-leaf plot, each row corresponds to a stem and each case is represented by a leaf.
Stem-and-leaf plots • The following are price of 15 students eating lunch at a fast-food restaurant: 5.35, 4.75, 4.30, 5.47, 4.85, 6.62, 3.54, 4.87, 6.26, 5.48, 7.27, 8.45, 6.05, 4.76, 5.91 1 3 | 5 The first value of 5.35 is rounded to 5.4 5 4 | 83998 The second value of 4.75 is rounded to 4.8 4 5 | 4559 Their stems are 5 and 4, respectively 3 6 | 631 Their leafs are 4 and 8, respectively 1 7 | 3 1 8 | 5
Stem-and-leaf plots completion time in hours Stem-and-Leaf Plot for agecat6= 45-49 Frequency Stem & Leaf 2.00 2 . 99 13.00 3 . 0022223344444 40.00 3 . 5555566777777788888888899999999999999999 35.00 4 . 00000001111111122222233333333334444 21.00 4 . 555666666777778888899 12.00 5 . 000111111234 9.00 5 . 667778889 4.00 6 . 0011 4.00 Extremes (>=6.2) Stem width: 1.00 Each leaf: 1 case(s)