IRCCS San Raffaele Pisana, Rome, Italy, 28 February - 2 March 2018

Describing data (types of data, data visualization, descriptive statistics, concept of sample and population) IRCCS San Raffaele Pisana, Rome, Italy,28 February - 2 March 2018

Descriptive Statistics Summarize, organize or reduce large numbers of observations Sometimes called summary statistics

Data may be collected as quantitative or as qualitative ! Questions may be close or open ! • What is your working status ? • Housewife with no paid job outside home. • In part-time employement (lesss that 25 hours per week). • In full-time employement • Unenmployed seeking work • Retired due to disability or illness • Retired for other causes. • Other (please specify). • How many years have you lived in this town ? • Less that 5 • 5-9 • 10-19 • 20-29 • 30-39 • 40 or more Your weight: ............... Kg.

Data may be collected as quantitative or as qualitative ! Questions may be close or open ! How much stress have you had in the last month with: None A little Some Much Very much Your spouse ? 1 2 3 4 5 Friends ? 1 2 3 4 5 Your chief ? 1 2 3 4 5 Yout health ? 1 2 3 4 5 Your suegra ? 1 2 3 4 5 Circle number

Qualitative data are summarized as frequency, and the proportion or percentage of appropriate totals are defined relative frequencies 2 aberrant cells out of 103 metaphases scored

Measures of Central Tendency • What is a measure of central tendency? • Measures of Central Tendency • Mode • Median • Mean • Shape of the Distribution • Considerations for Choosing an Appropriate Measure of Central Tendency

What is a measure of Central Tendency? • Numbers that describe what is average or typical of the distribution • You can think of this value as where the middle of a distribution lies.

The Mode • The category or score with the largest frequency (or percentage) in the distribution. • The mode can be calculated for variables with levels of measurement that are: nominal, ordinal, or interval-ratio.

The Mode: An Example • Example: Number of Votes for Candidates for Mayor. The mode, in this case, gives you the “central” response of the voters: the most popular candidate. Candidate A – 11,769 votes The Mode: Candidate B – 39,443 votes “Candidate C” Candidate C – 78,331 votes

The Median • The score that divides the distribution into two equal parts, so that half the cases are above it and half below it. • The median is the middle score, or average of middle scores in a distribution.

Median Exercise #1 (N is odd) Calculate the median for this hypothetical distribution: Job Satisfaction Frequency 1) Very High 2 2) High 3 3) Moderate 5 4) Low 7 5) Very Low 4 TOTAL 21 112223333344444445555 1122233333 4 4444445555 1122233333 4 4444445555

The Mean • The arithmetic average obtained by adding up all the scores and dividing by the total number of scores.

Formula for the Mean “Y bar” equals the sum of all the scores, Y, divided by the number of scores, N.

Mean: Grouped Scores

Calculating the mean with grouped scores where: f Y= a score multiplied by its frequency

Mean: Grouped Scores

Mean = 4.6 Median = 5.0 Mode = 4.0

Shape of the Distribution • Symmetrical (mean is about equal to median) • Skewed • Negatively (example: years of education) mean < median • Positively (example: income) mean > median • Bimodal (two distinct modes) • Multi-modal (more than 2 distinct modes)

Distribution Shape

Considerations for Choosing a Measure of Central Tendency • For a nominal variable, the mode is the only measure that can be used. • For ordinal variables, the mode and the median may be used. The median provides more information (taking into account the ranking of categories.) • For interval-ratio variables, the mode, median, and mean may all be calculated. The mean provides the most information about the distribution, but the median is preferred if the distribution is skewed.

Central Tendency

Measures of Dispersion The measures of central tendency are not adequate to describe data. Two data sets can have the same mean but they can be entirely different. Thus to describe data, one needs to know the extent of variability. This is given by the measures of dispersion. Range, interquartile range, and standard deviation are the three commonly used measures of dispersion.

Range The range is the difference between the largest and the smallest observation in the data. The prime advantage of this measure of dispersion is that it is easy to calculate. On the other hand, it has lot of disadvantages. It is very sensitive to outliers and does not use all the observations in a data set.[1] It is more informative to provide the minimum and the maximum values rather than providing the range.

INTERQUARTILE RANGE Interquartile range is defined as the difference between the 25th and 75th percentile (also called the first and third quartile). Hence the interquartile range describes the middle 50% of observations. If the interquartile range is large it means that the middle 50% of observations are spaced wide apart. The important advantage of interquartile range is that it can be used as a measure of variability if the extreme values are not being recorded exactly (as in case of open-ended class intervals in the frequency distribution). Other advantageous feature is that it is not affected by extreme values. The main disadvantage in using interquartile range as a measure of dispersion is that it is not amenable to mathematical manipulation.

Standard deviation (SD) is the most commonly used measure of dispersion. It is a measure of spread of data about the mean. SD is the square root of sum of squared deviation from the mean divided by the number of observations. Standard deviation STANDARD DEVIATION

xi - Σ xi – x = 0

xi - Σ (xi – x)2 = deviation - Σ xi – x = 0

xi - Σ (xi – x)2 = deviation - Σ xi – x = 0 - Σ (xi – x)2 /N = Variance

Variance Standard deviation • Variance • A measure of dispersion among individual observations about their average value • Computed before the standard deviation • Standard deviation • Another measure of dispersion • 68% of observations should be within ± 1 standard deviation of the mean • 95% will be within 1.96 standard deviations

The extent of variability can be described using deviations from the mean, i.e., x1-x The mean of the deviations from the mean is useless, since some of these will be negative and some positive, i.e., (x1-x) A practical way to get out of this problem is to square differences and calculate a mean value of squared differences, i.e., (x1-x)2 To have measures in the original units (and not squared) we generally use Standard deviation, i.e., - - - _______ = Variance n _______ -  _______ (x1-x)2 = Standard Deviation n

Standard Deviation Curve A Curve B A B

Why the "Normal distribution" is important. The distribution of many test statistics is normal or follows some form that can be derived from the normal distribution. In this sense, philosophically speaking, the Normal distribution represents one of the empirically verified elementary "truths about the general nature of reality," and its status can be compared to the one of fundamental laws of natural sciences. The exact shape of the normal distribution (the characteristic "bell curve") is defined by a function which has only two parameters: mean and standard deviation.

A characteristic property of the Normal distribution is that 68% of all of its observations fall within a range of ±1 standard deviation from the mean, and a range of ±2 standard deviations includes 95% of the scores.

A characteristic property of the Normal distribution is that 68% of all of its observations fall within a range of ±1 standard deviation from the mean, and a range of ±2 standard deviations includes 95% of the scores. In other words, in a Normal distribution, observations that have a standardized value of less than -2 or more than +2 have a relative frequency of 5% or less.

A characteristic property of the Normal distribution is that 68% of all of its observations fall within a range of ±1 standard deviation from the mean, and a range of ±2 standard deviations includes 95% of the scores. In other words, in a Normal distribution, observations that have a standardized value of less than -2 or more than +2 have a relative frequency of 5% or less. (Standardized value means that a value is expressed in terms of its difference from the mean, divided by the standard deviation.).

A characteristic property of the Normal distribution is that 68% of all of its observations fall within a range of ±1 standard deviation from the mean, and a range of ±2 standard deviations includes 95% of the scores. In other words, in a Normal distribution, observations that have a standardized value of less than -2 or more than +2 have a relative frequency of 5% or less. (Standardized value means that a value is expressed in terms of its difference from the mean, divided by the standard deviation.)The exact values of probability associated with different values in the normal distribution can be calculated; for example, the Z value,i.e., standardized value of 4, has an associated probability less than .0001, because in the normal distribution almost all observations (i.e., more than 99.99%) fall within the range of ±4 standard deviations.

From populations to samples .…..

Consider the situation where you take many samples and determine a mean and standard deviation for each sample. The obtained mean values would be distributed in the same normal distribution as raw scores.

You could apply the same normal curve graph to the means as was applied to the individual scores. For example, 95% of the sample means would fall within the range of -1.96 to +1.96 z-scores. Distribution of Sample Means

In a practical situation, however, you have only one sample mean with which you could work. You would have no idea whether this sample mean is near or far from the real population mean. Wouldn't it be nice to have an estimate of the standard deviation of sample means which describe the spread of sample means? There is a way to obtain this estimate. Divide the standard deviation by the square root of the number of observations.

Standard error of the mean The standard deviation of the sample mean is equivalent to the standard deviation of the error in the sample mean with respect to the true mean, since the sample mean is an unbiased estimator. The SEM can also be understood as the standard deviation of the error in the sample mean with respect to the true population mean (or an estimate of that statistic).

Example: Suppose that the population mean of males' serum uric acid levels is 5.4 mg per 100 ml and the standard deviation is 1. If you drew 100 samples of 25 men in each sample and computed 100 sample means, how many of those means would you expect to fall within the range 5.4-1.96*1 to 5.4+1.96*1 = (3.44-7.36) ? The answer is 95.

Example: Suppose that the population mean of males' serum uric acid levels is 5.4 mg per 100 ml and the standard deviation is 1. If you drew 100 samples of 25 men in each sample and computed 100 sample means, how many of those means would you expect to fall within the range 5.4-1.96*1 to 5.4+1.96*1 = (3.44-7.36) ? The answer is 95. If you conducted a sample and found a mean serum uric acid level of 8.2, then would you assume this was "significantly" different from the population mean?

Example: Suppose that the population mean of males' serum uric acid levels is 5.4 mg per 100 ml and the standard deviation is 1. If you drew 100 samples of 25 men in each sample and computed 100 sample means, how many of those means would you expect to fall within the range 5.4-1.96*1 to 5.4+1.96*1 = (3.44-7.36) ? The answer is 95. If you conducted a sample and found a mean serum uric acid level of 8.2, then would you assume this was "significantly" different from the population mean? Yes, because a mean of that magnitude could occur less than 5 times in 100.

IRCCS San Raffaele Pisana, Rome, Italy, 28 February - 2 March 2018