1 / 46

Chapter1

Chapter1. Looking at Data - Distributions. Introduction. Goal: Using Data to Gain Knowledge Terms/Definitions: Individiduals : Units described by or used to obtain data, such as humans, animals, objects (aka experimental or sampling units )

raisie
Download Presentation

Chapter1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter1 Looking at Data - Distributions

  2. Introduction • Goal: Using Data to Gain Knowledge • Terms/Definitions: • Individiduals: Units described by or used to obtain data, such as humans, animals, objects (aka experimental or sampling units) • Variables: Characteristics corresponding to individuals that can take on different values among individuals • Categorical Variable: Levels correspond to one of several groups or categories • Quantitaive Variable: Take on numeric values such that arithmetic operations make sense

  3. Introduction • Spreadsheets for Statistical Analyses • Rows: Represent Individuals • Columns: Represent Variables • SPSS, Minitab, EXCEL are examples • Measuring Variables • Instrument: Tool used to make quantitative measurement on subjects (e.g. psychological test or physical fitness measurement) • Independent and Dependent Variables • Independent Variable: Describes a group an individal comes from (categorical) or its level (quantitative) prior to observation • Dependent Variable: Random outcome of interest

  4. Independent and Dependent Variables • Dependent variables are also called response variables • Independent Variables are also called explanatory variables • Marketing: Does amount of exposure effect attitudes? • I.V.:Exposure (in time or number), different subjects receive different levels • D.V.: Measurement of liking of a product or brand • Medicine: Does a new drug reduce heart disease? • I.V.: Treatment (Active Drug vs Placebo) • D.V.: Presence/Absence of heart disease in a time period • Psychology/Finance: Risk Perceptions • I.V.: Framing of Choice (Loss vs Gain) • D.V.: Choice Taken (Risky vs Certain)

  5. Rates and Proportions • Categorical Variables: Typically we count the number with some characteristic in a group of individuals. • The actual count is not a useful summary. More useful summaries include: • Proportion: The number with the characteristic divided by the group size (will lie between 0 and 1) • Percent: # with characteristic per 100 individuals (proportion*100) • Rate per 100,000: proportion*100,000

  6. Graphical Displays of Distributions • Graphs of Categorical Variables • Bar Graph: Horizontal axis defines the various categories, heights of bars represent numbers of individuals • Pie Chart: Breaks down a circle (pie) such that the size of the slices represent the numbers of individuals in the categories or percentage of individuals.

  7. Example - AAA Ratings of FL Hotels (Bar Chart)

  8. Example - AAA Ratings of FL Hotels (Pie Chart)

  9. Graphical Displays of Distributions • Graphs of Numeric Variables • Stemplot: Crude, but quick method of displaying the entire set of data and observing shape of distribution • Stem: All but rightmost digit, Leaf: Rightmost Digit • Put stems in vertical column (small at top), draw vertical line • Put leaves in appropriate row in increasing order from stem • Histogram: Breaks data into equally spaced ranges on horizontal axis, heights of bars represent frequencies or percentages

  10. Example: Time (Hours/Year) Lost to Traffic Stems: 10s of hours Leaves: Hours Step 1: Stems: 1 2 3 4 5 Step 2: Stems and Leaves 1 48 2 01244699 3 0112244457778 4 122222245566 5 0336 Step 3: . Source: Texas Transportation Institute (5/7/2001).

  11. Example: Time (Hours/Year) Lost to TrafficEXCEL Output Note in histogram, the bins represent the number up to and including that number (e.g. T14, 14<T21, …, 42<T49, T>49)

  12. Comparing 2 Groups - Back-to-back Stemplots • Places Stems in Middle, group 1 to left, group 2 to right • Example: Maze Learning: • Groups (I.V.): Adults vs Children • Measured Response (D.V.): Average number of Errors in series of Trials

  13. Example - Maze Learning (Average Errors) Stems: Integer parts Leaves: Decimal Parts

  14. Examinining Distributions • Overall Pattern and Deviations • Shape: symmetric, stretched to one direction, multiple humps • Center: Typical values • Spread: Wide or narrow • Outlier: Individual whose value is far from others (see bottom right corner of previous slide) • May be due to data entry error, instrument malfunction, or individual being unusual wrt others

  15. Time Plots -Variable Measured Over Time

  16. Time Plot with Trend/Seasonality

  17. Numeric Descriptions of Distributions • Measures of Central Tendency • Arithmetic Mean: Total equally divided among individual cases • Median: Midpoint of the distribution (M) • Measures of Spread (Dispersion) • Quartiles (first/third): Points that break out the smallest and largest 25% of distribution (Q1 , Q3) • 5 Number Summary: (Minimum,Q1,M,Q3,Maximum) • Interquartile Range: IQR = Q3-Q1 • Boxplot: Graphical summary of 5 Number Summary • Variance: “Average” squared deviation from mean (s2) • Standard Deviation: Square root of variance (s)

  18. Measures of Central Tendency • Arithmetic Mean: Obtain the total by summing all values and divide by sample size (“equal allotment” among individuals) • Median: Midpoint of Distribution • Sort values from smallest to largest • If n odd, take the (n+1)/2 ordered value • If n even, take average of n/2 and (n/2)+1 ordered values

  19. 2005 Oscar Nominees (Best Picture) • Movie: Domestic Gross/Worldwide Gross • The Aviator: $103M / $214M • Finding Neverland: $52M / $116M • Million Dollar Baby: $100M / $216M • Ray: $75M / $97M • Sideways: $72M / $108M • Mean & Median Domestic Gross among nominees ($M):

  20. Delta Flight Times - ATL/MCO Oct,2004 • N=372 Flights 10/1/2004-10/31/2004 • Total actual time: 30536 Minutes • Mean Time: 30536/372 = 82.1 Minutes • Median: 372/2=186, (372/2)+1=187 • 186th and 187th ordered times are 81 minutes: M=81

  21. Measures of Spread • Quartiles: First (Q1aka Lower) and Third (Q3 aka Upper) • Q1 is the median of the values below the median position • Q3is the median of the values below the median position • Notes(See examples on next page): • If n is odd, median position is (n+1)/2, and finding quartiles does not include this value. • If n is even, median position is treated (most commonly) as (n+1)/2 and the two values (positions) used to compute median are used for quartiles.

  22. Oscar Nominations: • # of Individuals: n=5 • Median Position: (5+1)/2=3 • Positions Below Median Position: 1-2 • Positions Above Median Position: 4-5 • Median of Lower Positions: 1.5 • Median of Lower Positions: 4.5 • ATL/MCO Flights: • # of Individuals: n=372 • Median Position: (372+1)/2=186.5 • Positions Below Median Position: 1-186 • Positions Above Median Position: 187-372 • Median of Lower Positions: 93.5 • Median of Upper Positions: 279.5

  23. Outliers - 1.5xIQR Rule • Outlier: Value that falls a long way from other values in the distribution • 1.5xIQR Rule: An observation may be considered an outlier if it falls either 1.5 times the interquartile range above the third (upper) quartile or the same distance below the first (lower) quartile. • ATL/MCO Data: Q1=76 Q3=86 IQR=10 1.5xIQR=15 • “High” Outliers: Above 86+15=101 minutes • “Low” Outliers: Below 76-15=61 minutes • 12 Flights are at 102 minutes or more (Highest is 122). See (modified) boxplot below

  24. Measures of Spread - Variance and S.D. • Deviation: Difference between an observed value and the overall mean (sign is important): • Variance: “Average” squared deviation (divides the sum of squared deviations by n-1 (as opposed to n) for reasons we see later: • Standard Deviation: Positive square root of s2

  25. Example - 2005 Oscar Movie Revenues • Mean: x=80.4 • The Aviator: i=1 x1=103 Deviation: 103-80.4=22.6 • Finding Neverland: i=2 x2=52 Dev: 52-80.4= -28.4 • Million Dollar Baby: i=3 x3=100 Dev: 100-80.4=19.6 • Ray: i=4 x4=75 Dev: 75-80.4 = -5.4 • Sideways: i=5 x5=72 Dev: 72-80.4 = -8.4

  26. Computer Output of Summary Measures and Boxplot (SPSS) - ATL/MCO Data

  27. Linear Transformations • Often work with transformed data • Linear Transformation: xnew = a + bx for constants a and b (e.g. transforming from metric system to U.S., celsius to fahrenheit, etc) • Effects: • Multiplying by b causes both mean and standard deviation to be multiplied by b • Addition by a shifts mean and all percentiles by a but does not effect the standard deviation or spread • Note that for locations, multiplication of b precedes addition of a

  28. Density Curves/Normal Distributions • Continuous (or practically continuous) variables that can lie along a continuous (practically) range of values • Obtain a histogram of data (will be irregular with rigid blocks as bars over ranges) • Density curves are smooth approximations (models) to the coarse histogram • Curve lies above the horizontal axis • Total area under curve is 1 • Area of curve over a range of values represents its probability • Normal Distributions - Family of density curves with very specific properties

  29. Mean and Median of a Density Curve • Mean is the balance point of a distribution of measurements. If the height of the curve represented weight, its where the density curve would balance • Median is the point where half the area is below and half the area is above the point • Symmetric Densities: Mean = Median • Right Skew Densities: Mean > Median • Left Skew Densities: Mean < Median • We will mainly work with means. Notation:

  30. Symmetric (Normal) Distribution

  31. Right Skewed Density Curve

  32. Mean is the Balance Point

  33. Normal Distribution • Bell-shaped, symmetric family of distributions • Classified by 2 parameters: Mean (m) and standard deviation (s). These represent location and spread • Random variables that are approximately normal have the following properties wrt individual measurements: • Approximately half (50%) fall above (and below) mean • Approximately 68% fall within 1 standard deviation of mean • Approximately 95% fall within 2 standard deviations of mean • Virtually all fall within 3 standard deviations of mean • Notation when X is normally distributed with mean m and standard deviation s :

  34. Two Normal Distributions

  35. Normal Distribution

  36. Example - Heights of U.S. Adults • Female and Male adult heights are well approximated by normal distributions: XF~N(63.7,2.5) XM~N(69.1,2.6) Source: Statistical Abstract of the U.S. (1992)

  37. Standard Normal (Z) Distribution • Problem: Unlimited number of possible normal distributions (- < m <  , s > 0) • Solution: Standardize the random variable to have mean 0 and standard deviation 1 • Probabilities of certain ranges of values and specific percentiles of interest can be obtained through the standard normal (Z) distribution

  38. Standard Normal (Z) Distribution Table Area 1-Table Area z

  39. 2nd Decimal Place I n t g e r p a r t & 1st D e c i m a l

  40. 2nd Decimal Place I n t g e r p a r t & 1st D e c i m a l

  41. Finding Probabilities of Specific Ranges • Step 1 - Identify the normal distribution of interest (e.g. its mean (m) and standard deviation (s) ) • Step 2 - Identify the range of values that you wish to determine the probability of observing (XL , XU), where often the upper or lower bounds are  or - • Step 3 - Transform XL and XU into Z-values: • Step 4 - Obtain P(ZL Z  ZU) from Z-table

  42. Example - Adult Female Heights • What is the probability a randomly selected female is 5’10” or taller (70 inches)? • Step 1 -X ~ N(63.7 , 2.5) • Step 2 -XL = 70.0 XU =  • Step 3 - • Step 4 - P(X  70) = P(Z  2.52) = 1-P(Z2.52)=1-.9941=.0059 (  1/170)

  43. Finding Percentiles of a Distribution • Step 1 - Identify the normal distribution of interest (e.g. its mean (m) and standard deviation (s) ) • Step 2- Determine the percentile of interest 100p% (e.g. the 90th percentile is the cut-off where only 90% of scores are below and 10% are above). • Step 3 - Find p in the body of the z-table and itscorresponding z-value (zp) on the outer edge: • If 100p< 50 then use left-hand page of table • If 100p50 then use right-hand page of table • Step 4 - Transform zp back to original units:

  44. Example - Adult Male Heights • Above what height do the tallest 5% of males lie above? • Step 1 - X ~ N(69.1 , 2.6) • Step 2 - Want to determine 95th percentile (p = .95) • Step 3 - P(z1.645) = .95 • Step 4 - X.95 = 69.1 + (1.645)(2.6) = 73.4 (6’,1.4”)

  45. Statistical Models • When making statistical inference it is useful to write random variables in terms of model parameters and random errors • Here m is a fixed constant and e is a random variable • In practice m will be unknown, and we will use sample data to estimate or make statements regarding its value

More Related