Chapter1

Chapter1 Looking at Data - Distributions

Introduction • Goal: Using Data to Gain Knowledge • Terms/Definitions: • Individiduals: Units described by or used to obtain data, such as humans, animals, objects (aka experimental or sampling units) • Variables: Characteristics corresponding to individuals that can take on different values among individuals • Categorical Variable: Levels correspond to one of several groups or categories • Quantitaive Variable: Take on numeric values such that arithmetic operations make sense

Introduction • Spreadsheets for Statistical Analyses • Rows: Represent Individuals • Columns: Represent Variables • SPSS, Minitab, EXCEL are examples • Measuring Variables • Instrument: Tool used to make quantitative measurement on subjects (e.g. psychological test or physical fitness measurement) • Independent and Dependent Variables • Independent Variable: Describes a group an individal comes from (categorical) or its level (quantitative) prior to observation • Dependent Variable: Random outcome of interest

Independent and Dependent Variables • Dependent variables are also called response variables • Independent Variables are also called explanatory variables • Marketing: Does amount of exposure effect attitudes? • I.V.:Exposure (in time or number), different subjects receive different levels • D.V.: Measurement of liking of a product or brand • Medicine: Does a new drug reduce heart disease? • I.V.: Treatment (Active Drug vs Placebo) • D.V.: Presence/Absence of heart disease in a time period • Psychology/Finance: Risk Perceptions • I.V.: Framing of Choice (Loss vs Gain) • D.V.: Choice Taken (Risky vs Certain)

Rates and Proportions • Categorical Variables: Typically we count the number with some characteristic in a group of individuals. • The actual count is not a useful summary. More useful summaries include: • Proportion: The number with the characteristic divided by the group size (will lie between 0 and 1) • Percent: # with characteristic per 100 individuals (proportion*100) • Rate per 100,000: proportion*100,000

Graphical Displays of Distributions • Graphs of Categorical Variables • Bar Graph: Horizontal axis defines the various categories, heights of bars represent numbers of individuals • Pie Chart: Breaks down a circle (pie) such that the size of the slices represent the numbers of individuals in the categories or percentage of individuals.

Example - AAA Ratings of FL Hotels (Bar Chart)

Example - AAA Ratings of FL Hotels (Pie Chart)

Graphical Displays of Distributions • Graphs of Numeric Variables • Stemplot: Crude, but quick method of displaying the entire set of data and observing shape of distribution • Stem: All but rightmost digit, Leaf: Rightmost Digit • Put stems in vertical column (small at top), draw vertical line • Put leaves in appropriate row in increasing order from stem • Histogram: Breaks data into equally spaced ranges on horizontal axis, heights of bars represent frequencies or percentages

Example: Time (Hours/Year) Lost to Traffic Stems: 10s of hours Leaves: Hours Step 1: Stems: 1 2 3 4 5 Step 2: Stems and Leaves 1 48 2 01244699 3 0112244457778 4 122222245566 5 0336 Step 3: . Source: Texas Transportation Institute (5/7/2001).

Example: Time (Hours/Year) Lost to TrafficEXCEL Output Note in histogram, the bins represent the number up to and including that number (e.g. T14, 14<T21, …, 42<T49, T>49)

Comparing 2 Groups - Back-to-back Stemplots • Places Stems in Middle, group 1 to left, group 2 to right • Example: Maze Learning: • Groups (I.V.): Adults vs Children • Measured Response (D.V.): Average number of Errors in series of Trials

Example - Maze Learning (Average Errors) Stems: Integer parts Leaves: Decimal Parts

Examinining Distributions • Overall Pattern and Deviations • Shape: symmetric, stretched to one direction, multiple humps • Center: Typical values • Spread: Wide or narrow • Outlier: Individual whose value is far from others (see bottom right corner of previous slide) • May be due to data entry error, instrument malfunction, or individual being unusual wrt others

Time Plots -Variable Measured Over Time

Time Plot with Trend/Seasonality

Numeric Descriptions of Distributions • Measures of Central Tendency • Arithmetic Mean: Total equally divided among individual cases • Median: Midpoint of the distribution (M) • Measures of Spread (Dispersion) • Quartiles (first/third): Points that break out the smallest and largest 25% of distribution (Q1 , Q3) • 5 Number Summary: (Minimum,Q1,M,Q3,Maximum) • Interquartile Range: IQR = Q3-Q1 • Boxplot: Graphical summary of 5 Number Summary • Variance: “Average” squared deviation from mean (s2) • Standard Deviation: Square root of variance (s)

Measures of Central Tendency • Arithmetic Mean: Obtain the total by summing all values and divide by sample size (“equal allotment” among individuals) • Median: Midpoint of Distribution • Sort values from smallest to largest • If n odd, take the (n+1)/2 ordered value • If n even, take average of n/2 and (n/2)+1 ordered values

2005 Oscar Nominees (Best Picture) • Movie: Domestic Gross/Worldwide Gross • The Aviator: $103M / $214M • Finding Neverland: $52M / $116M • Million Dollar Baby: $100M / $216M • Ray: $75M / $97M • Sideways: $72M / $108M • Mean & Median Domestic Gross among nominees ($M):

Delta Flight Times - ATL/MCO Oct,2004 • N=372 Flights 10/1/2004-10/31/2004 • Total actual time: 30536 Minutes • Mean Time: 30536/372 = 82.1 Minutes • Median: 372/2=186, (372/2)+1=187 • 186th and 187th ordered times are 81 minutes: M=81

Measures of Spread • Quartiles: First (Q1aka Lower) and Third (Q3 aka Upper) • Q1 is the median of the values below the median position • Q3is the median of the values below the median position • Notes(See examples on next page): • If n is odd, median position is (n+1)/2, and finding quartiles does not include this value. • If n is even, median position is treated (most commonly) as (n+1)/2 and the two values (positions) used to compute median are used for quartiles.

Oscar Nominations: • # of Individuals: n=5 • Median Position: (5+1)/2=3 • Positions Below Median Position: 1-2 • Positions Above Median Position: 4-5 • Median of Lower Positions: 1.5 • Median of Lower Positions: 4.5 • ATL/MCO Flights: • # of Individuals: n=372 • Median Position: (372+1)/2=186.5 • Positions Below Median Position: 1-186 • Positions Above Median Position: 187-372 • Median of Lower Positions: 93.5 • Median of Upper Positions: 279.5

Outliers - 1.5xIQR Rule • Outlier: Value that falls a long way from other values in the distribution • 1.5xIQR Rule: An observation may be considered an outlier if it falls either 1.5 times the interquartile range above the third (upper) quartile or the same distance below the first (lower) quartile. • ATL/MCO Data: Q1=76 Q3=86 IQR=10 1.5xIQR=15 • “High” Outliers: Above 86+15=101 minutes • “Low” Outliers: Below 76-15=61 minutes • 12 Flights are at 102 minutes or more (Highest is 122). See (modified) boxplot below

Measures of Spread - Variance and S.D. • Deviation: Difference between an observed value and the overall mean (sign is important): • Variance: “Average” squared deviation (divides the sum of squared deviations by n-1 (as opposed to n) for reasons we see later: • Standard Deviation: Positive square root of s2

Example - 2005 Oscar Movie Revenues • Mean: x=80.4 • The Aviator: i=1 x1=103 Deviation: 103-80.4=22.6 • Finding Neverland: i=2 x2=52 Dev: 52-80.4= -28.4 • Million Dollar Baby: i=3 x3=100 Dev: 100-80.4=19.6 • Ray: i=4 x4=75 Dev: 75-80.4 = -5.4 • Sideways: i=5 x5=72 Dev: 72-80.4 = -8.4

Computer Output of Summary Measures and Boxplot (SPSS) - ATL/MCO Data

Linear Transformations • Often work with transformed data • Linear Transformation: xnew = a + bx for constants a and b (e.g. transforming from metric system to U.S., celsius to fahrenheit, etc) • Effects: • Multiplying by b causes both mean and standard deviation to be multiplied by b • Addition by a shifts mean and all percentiles by a but does not effect the standard deviation or spread • Note that for locations, multiplication of b precedes addition of a

Density Curves/Normal Distributions • Continuous (or practically continuous) variables that can lie along a continuous (practically) range of values • Obtain a histogram of data (will be irregular with rigid blocks as bars over ranges) • Density curves are smooth approximations (models) to the coarse histogram • Curve lies above the horizontal axis • Total area under curve is 1 • Area of curve over a range of values represents its probability • Normal Distributions - Family of density curves with very specific properties

Mean and Median of a Density Curve • Mean is the balance point of a distribution of measurements. If the height of the curve represented weight, its where the density curve would balance • Median is the point where half the area is below and half the area is above the point • Symmetric Densities: Mean = Median • Right Skew Densities: Mean > Median • Left Skew Densities: Mean < Median • We will mainly work with means. Notation:

Symmetric (Normal) Distribution

Right Skewed Density Curve

Mean is the Balance Point

Normal Distribution • Bell-shaped, symmetric family of distributions • Classified by 2 parameters: Mean (m) and standard deviation (s). These represent location and spread • Random variables that are approximately normal have the following properties wrt individual measurements: • Approximately half (50%) fall above (and below) mean • Approximately 68% fall within 1 standard deviation of mean • Approximately 95% fall within 2 standard deviations of mean • Virtually all fall within 3 standard deviations of mean • Notation when X is normally distributed with mean m and standard deviation s :

Two Normal Distributions

Normal Distribution

Example - Heights of U.S. Adults • Female and Male adult heights are well approximated by normal distributions: XF~N(63.7,2.5) XM~N(69.1,2.6) Source: Statistical Abstract of the U.S. (1992)

Standard Normal (Z) Distribution • Problem: Unlimited number of possible normal distributions (- < m <  , s > 0) • Solution: Standardize the random variable to have mean 0 and standard deviation 1 • Probabilities of certain ranges of values and specific percentiles of interest can be obtained through the standard normal (Z) distribution

Standard Normal (Z) Distribution Table Area 1-Table Area z

2nd Decimal Place I n t g e r p a r t & 1st D e c i m a l

Finding Probabilities of Specific Ranges • Step 1 - Identify the normal distribution of interest (e.g. its mean (m) and standard deviation (s) ) • Step 2 - Identify the range of values that you wish to determine the probability of observing (XL , XU), where often the upper or lower bounds are  or - • Step 3 - Transform XL and XU into Z-values: • Step 4 - Obtain P(ZL Z  ZU) from Z-table

Example - Adult Female Heights • What is the probability a randomly selected female is 5’10” or taller (70 inches)? • Step 1 -X ~ N(63.7 , 2.5) • Step 2 -XL = 70.0 XU =  • Step 3 - • Step 4 - P(X  70) = P(Z  2.52) = 1-P(Z2.52)=1-.9941=.0059 (  1/170)

Finding Percentiles of a Distribution • Step 1 - Identify the normal distribution of interest (e.g. its mean (m) and standard deviation (s) ) • Step 2- Determine the percentile of interest 100p% (e.g. the 90th percentile is the cut-off where only 90% of scores are below and 10% are above). • Step 3 - Find p in the body of the z-table and itscorresponding z-value (zp) on the outer edge: • If 100p< 50 then use left-hand page of table • If 100p50 then use right-hand page of table • Step 4 - Transform zp back to original units:

Example - Adult Male Heights • Above what height do the tallest 5% of males lie above? • Step 1 - X ~ N(69.1 , 2.6) • Step 2 - Want to determine 95th percentile (p = .95) • Step 3 - P(z1.645) = .95 • Step 4 - X.95 = 69.1 + (1.645)(2.6) = 73.4 (6’,1.4”)

Statistical Models • When making statistical inference it is useful to write random variables in terms of model parameters and random errors • Here m is a fixed constant and e is a random variable • In practice m will be unknown, and we will use sample data to estimate or make statements regarding its value

Chapter1

Chapter1

Presentation Transcript

Chapter1 Introduction to Computers

Chapter1 Fundamental of Computer Design

Chapter1: principles of Design

Chapter1: PRINCIPLES OF DESIGN

Chapter1.

Chapter1 section 1

Chapter1. Circuit Concept

Chapter1: Understanding Families

Chapter1 Stocks and Inventories

Chapter1

Chapter1--- Body language

Chapter1- Studying Geography

Chapter1: Number Systems

Chapter1

Chapter1 - continued

Chapter1 PROGRAMMING PRINCIPLES

Chapter1

Chapter1: Introduction

chapter1

Optical Communication Chapter1