1 / 110

STATISTICS

STATISTICS. Summarizing, Visualizing and Understanding Data. I. Populations, Variables, and Data. Populations and Samples.

bono
Download Presentation

STATISTICS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STATISTICS Summarizing, Visualizing and Understanding Data

  2. I. Populations, Variables, and Data

  3. Populations and Samples To a statistician, the population is the set or collection under investigation. Individual members of the population are not usually of interest. Rather, investigators try to infer with some degree of confidence the general features of the population.

  4. Examples • Students currently enrolled at a certain university. • Registered voters in a certain Congressional district. • The population of large-mouthed bass in a certain lake. • The population of all decay times of a radioactive isotope.

  5. StatisticalInference • Drawing and quantifying the reliability of conclusions about a population from observations on a smaller subset of the population. • Sample: The subset observed.

  6. Variables and Data • A population variable is a descriptive number or label associated with each member of a population. • The values of a population variable are the various numbers (or labels) that occur as we consider all the members of the population. • Values of variables that have been recorded for a population or a sample from a population constitute data.

  7. Types of Data • Nominal variables are variables whose values are labels. • Ordinal variables are variables whose values have a natural order. • Interval variables have values represented by numbers referring to a scale of measurement. • Ratio variables have values that are positive numbers on a scale with a unit of measurement and a natural zero point.

  8. Guess the Type • Age • Questionnaire responses: 1=”strongly agree”,2=”agree”…,5=”strongly disagree” • Letter grades • Reading comprehension scores • Gender • Zip codes • Molecular velocities

  9. II. Summarizing Data

  10. Location Measures (Measures of Central Tendency) A location measure or measure of central tendency for a variable is a single value or number that is taken as representing all the values of the variable. Different location measures are appropriate for different types of data.

  11. The Mean • For interval or ratio variables x • N individuals in the sample or population • xi = value of x for ith individual The mean of a population variable is denoted by m (the Greek letter mu).

  12. The Mean with Repeated Values • Distinct values of x: • nj = frequency of occurrence of

  13. The Mean with Repeated Values • Relative frequencies:

  14. Example

  15. The Median • Informally, the “middle” value when all the values are arranged in order • A number m is a median of x if at least half the individuals i in the population have and at least half of them have

  16. The Median – Example 1 • x: –2.0, 1.5, 2.2, 3.1, 5.7 (no repetitions) • median(x)=2.2

  17. The Median – Example 2 • x: -2.0, 1.5, 3.1, 3.1, 3.1 • median(x) = 3.1

  18. The Median – Example 3 • x: -2.0, 1.5, 3.1, 5.7, 5.9, 7.1 • median(x)=Any number in [3.1,5.7] • By convention, for an even number of individuals choose the midpoint between the smallest and largest medians, e.g.,

  19. Example • Change 7.1 to 71. What happens to the mean and the median? • The mean changes from 3.55 to 14.2 • No change in the median • The median is much less sensitive to outliers (which may be mistakes in recording data)

  20. The Median for Ordered Categories N=100. The median grade is B-.

  21. The Mode • The data value with the greatest frequency • Not useful for interval or ordinal data if recorded with precision • The only useful location measure for strictly nominal data

  22. Example The modes are B and B-.

  23. Cumulative Frequencies and Percentiles • x is an interval or ratio variable. • Ordered distinct values: • Relative frequencies:

  24. Cumulative Frequencies Cumulative Relative Frequencies Cumulative Frequencies and Percentiles

  25. The Weather Person’s Prediction Errors x

  26. Exercise From the table above, what fraction of the data is less than 1? What fraction is greater than 3? What fraction is greater than or equal to 3?

  27. Percentiles • x: an interval or ratio variable • A number a is a pthpercentile of x if at least p% of the values of x are less than or equal to a and at least (100-p) % of the values of x are greater than or equal to a. • The 25th percentile is called the first quartile of x and the 75th percentile is the third quartile of x. • The 50th percentile is the second quartile or median.

  28. Example For the weather person’s errors, the 25th percentile is 3. The 50th percentile and third quartile are both 4.

  29. Measures of Variability Statisticians are not only interested in describing the values of a variable by a single measure of location. They also want to describe how much the values of the variable are dispersed about that location.

  30. Population Variance and Standard Deviation • x: an interval or ratio variable. • N=number of individuals in population. • Variance of x: • Standard deviation of x:

  31. Sample Variance and Standard Deviation • n: the number of individuals in a sample from a population • Sample variance: • Sample standard deviation:

  32. AlternativeFormulas for the Variance • Using frequencies: • Using relative frequencies:

  33. The Interquartile Range • Q1, Q3 : 1st and 3rd quartiles, respectively • Interquartile range: • Not influenced by a few extremely large or small observations (outliers)

  34. The Range • The difference between the largest data value and the smallest • Range of sample values is not a reliable indicator of the range of a population variable

  35. III. Graphical Methods

  36. Pie Charts (Circle Graphs) Sources: AT&T (1961) The World’s Telephones R: A language and environment for statistical computing, the R core development team.

  37. Bar Charts (Bar Graphs)

  38. Pros and Cons • Bar chart has a scale of measurement – more precise information • Pie chart gives more vivid impression of relative proportions, e.g., obvious at a glance that N. America had more than half the telephones in the world.

  39. Stemplots (Stem and Leaf Diagrams) Stem|Leaves Cumulative Frequency 4 | 7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50 Grades of 50 students on a test

  40. Find the Median Stem|Leaves Cumulative Frequency 4 | 7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50 25th and 26th leaves circled. Median = 78

  41. Exercise Stem|Leaves Cumulative Frequency 4 | 7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50 The 1st quartile is 70 and the 3rd quartile is 82.

  42. Boxplots (Box and Whisker Diagrams)

  43. Elements of a Boxplot largest outlier box whisker quartiles median

  44. Boxplot Shows Distribution Skewed to the Left

  45. Histograms • For interval or ratio data • Data is grouped into class intervals • Superficially like a bar chart

  46. Frequency Histogram Height=bin frequency Class interval (bin) Source: R: A language and environment for statistical computing, the R core development team.

  47. Probability Histogram Area of bar = relative bin frequency E.g., .011×25=.275

  48. Ogives(Cumulative Frequency Polygons) • Related to probability histograms • Examples of cumulative distribution functions • Probability histograms are examples of density functions

  49. Example Ogive

  50. Relationship Between Probability Histogram and Ogive The height of the ogive is the cumulative area under the histogram

More Related