1 / 77

Lecture 2

Lecture 2. Examining Data Characteristics: Descriptive Statistics and Data Screening. Objectives: Describe and display categorical data; Generate and interpret frequency tables, bar charts and pie charts;

amma
Download Presentation

Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2 Examining Data Characteristics: Descriptive Statistics and Data Screening

  2. Objectives: • Describe and display categorical data; • Generate and interpret frequency tables, bar charts and pie charts; • Generate and interpret histograms to display the distribution of a quantitative variable; • Describe the shape, centre and spread of a distribution; • Compute descriptive statistics and select between mean/median and standard deviation / interquartile range; and • Explain data screening and its purpose, and be able to assess a distribution for normality. Lecture 2

  3. In the business world, statistics has these important uses: • To summarise business data • To draw conclusions from those data • To make reliable forecasts about business activities • To improve business processes • The statistical methods used for these tasks come from one of the two branches of statistics: • Descriptive Statistics – methods that help collect, summarise, present, and analyse a set of data • Inferential Statistics – methods that use the data collected from a small group to draw conclusions about a larger group. Statistics in Business

  4. Categorical Data Describing, Displaying & Generating in SPSS

  5. What are Frequencies? A frequency distribution is a count of the number of responses associated with different values of a variable • Where only ONE variable is considered at a time They can be represented in tabular or graphical form • Numerical – Histograms • Categorical – Bar Charts Frequencies are reported by: • Counts • Percentages

  6. Frequencies and Data • What type of data? • Usually Categorical • Nominal • Ordinal • Examples: • Gender (Male / Female) • Marital Status (Married / Divorced / Single etc) • Age categories (18-25 / 26-35 / 36-45 / 46+) NOTE: Frequencies can be used for any scale (e.g. age in years) BUT may not be useful if there are too many divisions.

  7. Frequencies – Output (Gender) From a sample of 1450 people, males and females were fairly evenly represented. There were 810 female responses which accounted for 55.9% of the sample, and 44.1% of the sample were male.

  8. Frequencies – Output (Age) 575 of the 1450 respondent’s were aged 18-24 (39.7%). The proportion of 25-29, 30-34 and 40-45 year olds were relatively equal, contributing to approximately 16% of the sample each, with only 12.4% of the respondents aged between 35-39.

  9. Analyze • Descriptive Statistics • Frequencies • Select the variable/s and use the > button to move into the variables list • Click on Charts… • Choose appropriate chart (Pie Chart or Bar Chart) • Click OK. Frequencies in SPSS - Words

  10. 1. 2. 3. Frequencies in SPSS - Visuals

  11. 6. Click on Charts… Frequencies in SPSS – Visuals Cont.. Select the variable/s 4. • Click on the Bar charts button 7. 5. Use the > button to move into the variables list

  12. Quantitative Data Describing, Displaying & Generating in SPSS

  13. Review: Frequencies 575 of the 1450 respondent’s were aged 18-24 (39.7%). The proportion of 25-29, 30-34 and 40-45 year olds were relatively equal, contributing to approximately 16% of the sample each, with only 12.4% of the respondents aged between 35-39. What about when the information is too detailed?

  14. How do we make sense of this?

  15. Descriptive Statistics Descriptive Statistics are used to SUMMARISE the data Shape Centre Spread What type of data? Quantitative Interval Ratio

  16. We describe the shape of a distribution in terms of its modes, its symmetry, and whether it has any gaps or outlying values. Modes Peaks or humps seen in a histogram are called the modes of a distribution. A distribution whose histogram has one main peak is called unimodal, two peaks – bimodal, three or more – multimodal. Shape

  17. Modes A distribution whose histogram doesn’t appear to have any mode and in which all the bars are approximately the same height is called uniform. Shape

  18. Symmetry A distribution is symmetric if the halves on either side of the center look, at least approximately, like mirror images. A symmetrical distribution has a skewness statistic of ‘0’. Shape

  19. Symmetry The thinner ends of a distribution are called the tails. If one tail stretches out farther than the other, the distribution is said to be skewed to the side of the longer tail. If a distribution is skewed to the right, the skewness statistic is positive. If a distribution is skewed to the left, the skewness statistic is negative. To which direction is this distribution skewed? Shape

  20. Outliers • Always be careful to point out the outliers in a distribution: those values that stand off away from the body of the distribution. Outliers … • can affect every statistical method we will study. • may be an error in the data. • should be discussed in any conclusions drawn about the data. Shape

  21. Centre (Location) Describe the centre of the distribution of the data • Repeated measurements for the variable of interest will group around this centre point in some way • Mean • Average • Median • Middle value • Mode • Occurs Most frequently

  22. To find the mean of a variable y, add all the values of the variable and divide that sum by the number of data values, n. The mean is a natural summary for unimodal, symmetric distributions. The mode is the value that occurs the most frequently in a data set. Centre (Mean and Mode)

  23. Centre (Median) Middle value when data is arranged in ascending or descending order The value that splits the histogram into two equal areas 50th percentile Measure of centre: ordinal, interval and ratio data 1, 8, 2, 6, 5, 3 1, 2, 3, 5, 6, 8 1, 2, 6, 5, 3 1, 2, 3, 5, 6, (3+5) / 2 = 4

  24. When you have a symmetric distribution… • the number of values either side of the centre of the distribution are the same, and the mean, mode and median are equal • The mean is NOT resistant to unusual observations (outliers) and to the shape of the distribution • When the distribution is unimodal and symmetric, the mean is a natural summary statistic.

  25. SkewedLeft(Negative) SkewedRight(Positive) When you have a distribution that is skewed… • There is more values on one side of the distribution than the other • The median is relatively resistant to unusual observations and to the shape of the distribution • Therefore, the median is usually a better choice for skewed data

  26. Mean versus Median Demonstration Mean is sensitive to outliers (extremely small/big values) Mean = 4.17 Median = 4 1, 8, 2, 6, 5, 3 Mean = 7.86 Median = 5 1, 8, 2, 6, 5, 3, 30

  27. Mean versus Median This histogram below depicts monthly trading volume of AIG shares (in millions of shares) for the period 2002 to 2007. The mean is 170.1 million shares and the median is 135.9 million shares. Which is better to use to describe the centre of the distribution and why?

  28. Range • difference between largest (max) and smallest (min) values of a distribution • Directly affected by outliers • IQR • range of the middle 50% of the data Spread

  29. Taking into account how far each value is from the mean gives a powerful measure of the spread of a distribution Spread • Variance • The average of the squared deviation of all the values from the mean • Standard Deviation • square root of variance

  30. Variance Demonstration Mean squared deviation from the mean Deviation from the mean = difference between the mean and an observed value

  31. σ = 0.45 σ = 1 σ = 2.24 Spread

  32. The five-number summaryof a distribution reports its median, quartiles, and extremes (maximum and minimum). Below is the five-number summary of monthly trading volume of AIG shares (in millions of shares) for the period 2002 to 2007. Five Number Summary

  33. Once we have a five-number summary of a variable, we can display that information in a boxplot. Five Number Summary

  34. What should be done with outliers? They should be understood in the context of the data. An outlier for a year of data may not be an outlier for the month in which it occurred and vice versa. They should be investigated to determine if they are in error. The values may have simply been entered incorrectly. If a value can be corrected, it should be. They should be investigated to determine why they are so different from the rest of the data. For example, were extra sales or fewer sales seen because of a special event like a holiday. Outliers

  35. Half the people took 20 mins or less to get to work, 50% took 20 mins or more Average time take to get to work was 19.98 mins The maximum time it took someone to get to work was 33 mins The minimum time it took someone to get to work was 8 mins Skewness value close to 0, therefore fairly symmetrical (also mean and median very close) The bigger the S.D the more spread out (more varied) the times to get to work were. The S.D. is 5 mins. Is this spread out? Consider context. The range of the middle 50% of the data is 7 mins. How to use SPSS Output

  36. Descriptive Statistics in SPSS - Words • Analyze • Descriptive Statistics • Explore • Select the variable/s and use the > button to move into the DEPENDENT LIST box • Click on the PLOTS button • Select • Histogram • Normality plots with tests • Click OK.

  37. Descriptive Statistics in SPSS - Visuals 1. 2. 3.

  38. 6. Click on Plots… Descriptive Statistics in SPSS – Visuals Cont.. Select the variable/s 4. Use the > button to move into the Dependent List: 5. • Select Histogram and Normality plots with tests 7.

  39. Descriptive Statistics SPSS Output

  40. Descriptive Statistics Interpretation In this sample of 1,450 respondents the number of times fast food was eaten ranged from 1 to 80 in the last four weeks. The distribution is skewed to the right, with 50% of the respondents having eaten fast food less than 6 times in the last four weeks (remember 50% of values are above and 50% below the median). The average number of times fast food was eaten was 7.73 times, though the results were quite varied with a standard deviation of 6.577 and an IQR of 6. • Shape • Skewness – 2.850 • Kurtosis – 16.621 • Centre (Location) • Mean – 7.73 • Median – 6.00 • Spread (Variability) • Min - 1 • Max - 80 • Range - 79 • IQR - 6 • Std Deviation – 6.577 NOTE: MedianandIQR are more appropriate when the data is skewed.

  41. Your Turn!Describe this distribution Students were asked to respond to the statement “I am scared to study Statistics” on a scale of 1(Strongly Disagree) to 5 (Strongly Agree). Below are the results. Remember: SHAPE, CENTRE, SPREAD!

  42. Data Screening

  43. Data screening and transformation techniques are useful for: • Ensuring that data have been correctly entered • Checking for missing values • Recoding data • Ensuring distributions used in analysis are normal • If distributions do vary from normal: • Non-normal distributions may be transformed • Non-parametric techniques may be used for dramatic deviations Data Screening

  44. Includes consistency checks and treatment of missing responses • Consistency checks • Out of range • Logically inconsistent • Extreme values • After data entry • Run Frequencies • Indentify missing and out of range data Errors in Data Entry

  45. When do data look wrong? Values all the same Holes – values “missing” Values out of range 45

  46. Identify data that is out of range, logically inconsistent and extreme values Errors in Data Entry 46

  47. Reverse coding is done so that positive actions always fall on the same end of the response scale • Example • I am satisfied with: 1 2 3 4 5 • I am dissatisfied with: 1 2 3 4 5 1 → 5 2 → 4 3 → 3 4 → 2 5 → 1 Transformation 47

  48. Data CleaningRecoding • This creates a new variable • Transform – Recode into different variables • Select the variable • Under “Output Variable”, • type a Name for the new • variable & Label • Select ‘Old and New Values..’ 48

  49. Data CleaningRecoding Old Value New Value 1 → 5 2 → 4 3 → 3 4 → 2 5 → 1 • Type in Old Value • E.g. 1 • Type in New Value • E.g. 5 • Click Add • Repeat for other • values • Click Continuewhen • finished 49

  50. Example: Below we display the skewed distribution of total compensation for the CEOs of the 500 largest companies. What is the “centre” of this distribution? Are there outliers? Transformation

More Related