1 / 42

Review of Previous Lecture

Review of Previous Lecture. Range The difference between the largest and smallest values Interquartile range The difference between the 25th and 75th percentiles Variance The sum of squares divided by the population size or the sample size minus one Standard deviation

Jimmy
Download Presentation

Review of Previous Lecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review of Previous Lecture • Range • The difference between the largest and smallest values • Interquartile range • The difference between the 25th and 75th percentiles • Variance • The sum of squares divided by the population size or the sample size minus one • Standard deviation • The square root of the variance • Z-scores • The number of standard deviations an observation is away from the mean

  2. Outline of Session • Another Measure of Dispersion • Coefficient of Variation (CV) • Histograms • Skewness • Kurtosis • Other Descriptive Summary Measures

  3. Measures of Dispersion – Coefficient of Variation • Coefficient of variation (CV) measures the spread of a set of data as a proportion of its mean. • It is the ratio of the sample standard deviation to the sample mean • It is sometimes expressed as a percentage • There is an equivalent definition for the coefficient of variation of a population

  4. Measures of Dispersion – Coefficient of Variation • A standard application of the Coefficient of Variation (CV) is to characterize the variability of geographic variables over space or time • Coefficient of Variation (CV) is particularly applied to characterize the interannualvariability of climate variables (e.g., temperature or precipitation) or biophysical variables (leaf area index (LAI), biomass, etc)

  5. Coefficient of Variation (CV) • It is a dimensionless number that can be used to compare the amount of variance between populations with differentmeans

  6. Source: http://www.daymet.org

  7. Source: http://www.daymet.org

  8. 1990 - 2000 NDVI Temperature Precipitation Source: Xiao & Moody, 2004 CV of NDVI ~ CV of temperature CV of NDVI ~ CV of precipitation

  9. Measures of Skewness and Kurtosis • A fundamental task in many statistical analyses is to characterize the location and variability of a data set (Measures of central tendency vs. measures of dispersion) • Both measures tell us nothing about the shape of the distribution • A further characterization of the data includes skewness and kurtosis • The histogram is an effective graphical technique for showing both the skewness and kurtosis of a data set

  10. Histograms Fig. 3. Histogram of crown width (m) measured in situ for a random sample of Quercus robur trees in Frame Wood (n = 63; mean = 9.3 m; SD = 4.64 m). Source: Koukoulas & Blackburn, 2005. Journal of Vegetation Science: Vol. 16, No. 5, pp. 587–596

  11. Frequency & Distribution • A histogram is one way to depict a frequency distribution • Frequency is the number of times a variable takes on a particular value • Note that any variable has a frequency distribution • e.g. roll a pair of dice several times and record the resulting values (constrained to being between and 2 and 12), counting the number of times any given value occurs (the frequency of that value occurring), and take these all together to form a frequency distribution

  12. Frequency & Distribution • Frequencies can be absolute (when the frequency provided is the actual count of the occurrences) or relative (when they are normalized by dividing the absolute frequency by the total number of observations [0, 1]) • Relative frequencies are particularly useful if you want to compare distributions drawn from two different sources (i.e. while the numbers of observations of each source may be different)

  13. Histograms • We may summarize our data by constructing histograms, which are vertical bar graphs • A histogram is used to graphically summarize the distribution of a data set • A histogram divides the range of values in a data set into intervals • Over each interval is placed a bar whose height represents the frequency of data values in the interval.

  14. Building a Histogram • To construct a histogram, the data are first grouped into categories • The histogram contains one vertical bar for each category • The height of the bar represents the number of observations in the category (i.e., frequency) • It is common to note the midpoint of the category on the horizontal axis

  15. Building a Histogram – Example • 1. Develop an ungrouped frequency table • That is, we build a table that counts the number of occurrences of each variable value from lowest to highest: TMI ValueUngrouped Freq. 4.16 2 4.17 4 4.18 0 … … 13.71 1 • We could attempt to construct a bar chart from this table, but it would have too many bars to really be useful

  16. Building a Histogram – Example • 2. Construct a grouped frequency table • Select an appropriate number of classes Percentage

  17. Building a Histogram – Example • 3. Plot the frequencies of each class • All that remains is to create the bar graph A proxy for Soil Moisture

  18. Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 91. Further Moments of the Distribution • While measures of dispersion are useful for helping us describe the width of the distribution, they tell us nothing about the shape of the distribution

  19. Further Moments of the Distribution • There are further statistics that describe the shape of the distribution, using formulae that are similar to those of the mean and variance • 1st moment - Mean (describes central value) • 2nd moment - Variance (describes dispersion) • 3rd moment - Skewness (describes asymmetry) • 4th moment -Kurtosis(describes peakedness)

  20. Further Moments – Skewness • Skewness measures the degree of asymmetry exhibited by the data • If skewness equals zero, the histogram is symmetric about the mean • Positive skewness vs negative skewness

  21. Further Moments – Skewness Source: http://library.thinkquest.org/10030/3smodsas.htm

  22. Further Moments – Skewness • Positive skewness • There are more observations below the mean than above it • When the mean is greater than the median • Negative skewness • There are a small number of low observations and a large number of high ones • When the median is greater than the mean

  23. Further Moments – Kurtosis • Kurtosis measures how peaked the histogram is • The kurtosis of a normal distribution is 0 • Kurtosis characterizes the relative peakedness or flatness of a distribution compared to the normal distribution

  24. Further Moments – Kurtosis • Platykurtic– When the kurtosis < 0, the frequencies throughout the curve are closer to be equal (i.e., the curve is more flat and wide) • Thus, negative kurtosis indicates a relatively flat distribution • Leptokurtic– When the kurtosis > 0, there are high frequencies in only a small part of the curve (i.e, the curve is more peaked) • Thus, positive kurtosis indicates a relatively peaked distribution

  25. Further Moments – Kurtosis platykurtic leptokurtic Source: http://www.riskglossary.com/link/kurtosis.htm • Kurtosis is based on the size of a distribution's tails. • Negative kurtosis (platykurtic) – distributions with short tails • Positive kurtosis (leptokurtic) – distributions with relatively long tails

  26. Why Do We Need Kurtosis? • These two distributions have the same variance, approximately the same skew, but differ markedly in kurtosis. Source: http://davidmlane.com/hyperstat/A53638.html

  27. How to Graphically Summarize Data? • Histograms • Box plots

  28. Functions of a Histogram • The function of a histogram is to graphically summarize the distribution of a data set • The histogram graphically shows the following: 1. Center (i.e., the location) of the data 2. Spread (i.e., the scale) of the data 3. Skewness of the data 4. Kurtosis of the data 4. Presence of outliers 5. Presence of multiple modes in the data.

  29. Functions of a Histogram • The histogram can be used to answer the following questions: 1. What kind of population distribution do the data come from? 2. Where are the data located? 3. How spreadout are the data? 4. Are the data symmetric or skewed? 5. Are there outliers in the data?

  30. Source: http://www.robertluttman.com/vms/Week5/page9.htm (First three) http://office.geog.uvic.ca/geog226/frLab1.html (Last)

  31. We can also use a box plot to graphically summarize a data set A box plot represents a graphical summary of what is sometimes called a “five-number summary” of the distribution Minimum Maximum 25th percentile 75th percentile Median Interquartile Range (IQR) 75th %-ile max. median 25th %-ile min. Rogerson, p. 8. Box Plots

  32. Box Plots • Example – Consider first 9 Commodore prices ( in $,000) 6.0, 6.7, 3.8, 7.0, 5.8, 9.975, 10.5, 5.99, 20.0 • Arrange these in order of magnitude 3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0 • The median is Q2 = 6.7 (there are 4 values on either side) • Q1 = 5.9 (median of the 4 smallest values) • Q3 = 10.2 (median of the 4 largest values) • IQR = Q3 – Q1 = 10.2 - 5.9 = 4.3

  33. Example (ranked) 3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0 • The median is Q1 = 6.7 • Q1 = 5.9 Q3 = 10.2 IQR = Q3 – Q1 = 10.2 - 5.9 = 4.3

  34. Box Plots Example: Table 1.1 Commuting data (Rogerson, p5) Ranked commuting times: 5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22, 23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47 25th percentile is represented by observation (30+1)/4=7.75 75th percentile is represented by observation 3(30+1)/4=23.25 25th percentile: 11.75 75th percentile: 26 Interquartile range: 26 – 11.75 = 14.25

  35. Example (Ranked commuting times): 5, 5, 6, 9, 10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22, 23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47 25th percentile: 11.75 75th percentile: 26 Interquartile range: 26 – 11.75 = 14.25

  36. Other Descriptive Summary Measures • Descriptive statistics provide an organization and summary of a dataset • A small number of summary measures replaces the entirety of a dataset • We’ll briefly talk about other simple descriptive summary measures

  37. Other Descriptive Summary Measures • You're likely already familiar with some simple descriptive summary measures • Ratios • Proportions • Percentages • Rates of Change • Location Quotients

  38. Other Descriptive Summary Measures • Ratios – # of observations in A # of observations in B e.g., A - 6 overcast, B - 24 mostly cloudy days • Proportions – Relates one part or category of data to the entire set of observations, e.g., a box of marbles that contains 4 yellow, 6 red, 5 blue, and 2 green gives a yellow proportion of 4/17 or colorcount = {yellow, red, blue, green} acount = {4, 6, 5, 2} =

  39. Other Descriptive Summary Measures • Proportions - Sum of all proportions = 1. These are useful for comparing two sets of data w/different sizes and category counts, e.g., a different box of marbles gives a yellow proportion of 2/23, and in order for this to be a reasonable comparison we need to know the totals for both samples • Percentages - Calculated by proportions x 100, e.g., 2/23 x 100% = 8.696%, use of these should be restricted to larger samples sizes, perhaps 20+ observations

  40. Other Descriptive Summary Measures • Location Quotients - An index of relative concentration in space, a comparison of a region's share of something to the total • Example – Suppose we have a region of 1000 Km2 which we subdivide into three smaller areas of 200, 300, and 500 km2 (labeled A, B, & C) • The region has an influenza outbreak with 150 cases in A, 100 in B, and 350 in C (a total of 600 flu cases): Proportion of Area Proportion of Cases Location Quotient A 200/1000=0.2 150/600=0.25 0.25/0.2=1.25 B 300/1000=0.3 100/600=0.17 0.17/0.3 = 0.57 C 500/1000=0.5 350/600=0.58 0.58/0.5=1.17

  41. Assignment II • Due by Thursday (02/09/2006) • Downloadable from Course website: • http://www.unc.edu/courses/2006spring/geog/090/001/www/

More Related