1 / 22

Graphs of Frequency Distribution

emera
Download Presentation

Graphs of Frequency Distribution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Graphs of Frequency Distribution Introduction to Statistics Chapter 2 Jan 21, 2010 Class #2

    2. Two-dimensional graphs: Basic Set-Up

    3. Commonly Used Graphs Histogram Height of bars proportional to frequency Width proportional to class boundaries Bar Chart Height proportional to frequency Width not really significant Frequency Polygon Plot points then connect with straight lines

    4. Histograms

    5. Simple Bar Graph

    6. Grouped Bar Graph

    7. Frequency Polygons

    9. Shape of Frequency Distribution Symmetrical If you can draw a vertical line through the middle (so that you have a mirror image) The scores are evenly distributed Positively skewed Scores piled up on left with tail on right Negatively skewed Scores piled up on right with tail on left

    10. Frequency Distribution: Different Distribution shapes This slide shows some examples of different shapes that D can take. The top left = ND which I’ve already described: majority of scores fall around a mid point with fewer and fewer as the scores get more or less extreme. The top right D shows another type of distribution shape: here the majority of scores fall around two values. The bottom two graphs show D shapes that again cluster around a central value: but unlike the top two graphs they are not symmetrical ( =can draw vertical line through middle and one side is mirror image of each other). Lets describe these D in terms of what they mean for actual score. Lets imagine that these are all graphs of exam results for different exams, so as the x axis goes along the higher the exam results. The TL graph shows an exam where most people got results around the mid point: v few got v low and v few got v high scores. The TR graph shows 2 peaks, so lots of people got exam results either at one particular lower level (left peak) or a higher level (right peak), with less getting v low, v high or scores in the middle. The BL graph shows that most people got very low scores as the peak of the curve is near the beginning of the x axis, with very few getting high scores (v hard exam). The BR graph is the opposite with the peak of scores near the end of the x axis: so most people scored highly on this exam. D where scores pile at one end or another while the tail of the scores taper off to the other end are called skewed D. So the BL is an example of + skew, so called because the tail of scores tapers off towards the + end of the x axis (looks like p facing upwards), - skew so called because tail tapers off towards – end of x axis. This kind of information is clear from looking at a D. This slide shows some examples of different shapes that D can take. The top left = ND which I’ve already described: majority of scores fall around a mid point with fewer and fewer as the scores get more or less extreme. The top right D shows another type of distribution shape: here the majority of scores fall around two values. The bottom two graphs show D shapes that again cluster around a central value: but unlike the top two graphs they are not symmetrical ( =can draw vertical line through middle and one side is mirror image of each other). Lets describe these D in terms of what they mean for actual score. Lets imagine that these are all graphs of exam results for different exams, so as the x axis goes along the higher the exam results. The TL graph shows an exam where most people got results around the mid point: v few got v low and v few got v high scores. The TR graph shows 2 peaks, so lots of people got exam results either at one particular lower level (left peak) or a higher level (right peak), with less getting v low, v high or scores in the middle. The BL graph shows that most people got very low scores as the peak of the curve is near the beginning of the x axis, with very few getting high scores (v hard exam). The BR graph is the opposite with the peak of scores near the end of the x axis: so most people scored highly on this exam. D where scores pile at one end or another while the tail of the scores taper off to the other end are called skewed D. So the BL is an example of + skew, so called because the tail of scores tapers off towards the + end of the x axis (looks like p facing upwards), - skew so called because tail tapers off towards – end of x axis. This kind of information is clear from looking at a D.

    11. Be careful… See next slide for tricks researchers might use with graphs…

    13. Plotting Data: describing spread of data A researcher is investigating short-term memory capacity: how many symbols remembered are recorded for 20 participants: 4, 6, 3, 7, 5, 7, 8, 4, 5,10 10, 6, 8, 9, 3, 5, 6, 4, 11, 6 We can describe our data by using a Frequency Distribution. This can be presented as a table or a graph. Always presents: The set of categories that made up the original category The frequency of each score/category Three important characteristics: shape, central tendency, and variability Ok so let’s consider some actual raw data, and see what type of DS we could produce. (read example). Data is recorded as the number of different symbols a person can remember after being shown a series of 20, with an immediate memory test. There are all the scores in the order in which they were obtained: at the moment telling us very little. This presentation of raw data becomes even more confusing the more data there is. So one of the first things we might want to do is organise this data in some logical way one way is a FD. This presents are data in such a way that categorises scores according to how often they occur: if the data above was shown as a FD, we would be able to see how many people remembered 4 symbols, how many remembered 5 and so on. Again this is especially useful for larger data sets, which can be summarised neatly by a FD. A FD is commonly represented as a table, which lists the different data scores or measurements, then the number of times that score or measurement is observed in the data set. Presenting data in this way can show us how many people scored what, what the most common score was (an indication of the central tendency of the data set, which I will talk about later), and how spread out the scores are to each other: called the variability of the data set.Ok so let’s consider some actual raw data, and see what type of DS we could produce. (read example). Data is recorded as the number of different symbols a person can remember after being shown a series of 20, with an immediate memory test. There are all the scores in the order in which they were obtained: at the moment telling us very little. This presentation of raw data becomes even more confusing the more data there is. So one of the first things we might want to do is organise this data in some logical way one way is a FD. This presents are data in such a way that categorises scores according to how often they occur: if the data above was shown as a FD, we would be able to see how many people remembered 4 symbols, how many remembered 5 and so on. Again this is especially useful for larger data sets, which can be summarised neatly by a FD. A FD is commonly represented as a table, which lists the different data scores or measurements, then the number of times that score or measurement is observed in the data set. Presenting data in this way can show us how many people scored what, what the most common score was (an indication of the central tendency of the data set, which I will talk about later), and how spread out the scores are to each other: called the variability of the data set.

    14. Frequency Distribution Tables Highest Score is placed at top All observed scores are listed Gives information about distribution, variability, and centrality X = score value f = frequency fx = total value associated with frequency ?f = N ?X =?fX So here is the data from the example memory study presented as a FD table: the column labelled X refers to the different scores (so here the numbers of symbols that were remembered) and the f column shows the frequency of these scores (so how many remembered 3 etc.). A FD table always has the highest score at the top and lists all possible scores down to the lowest score at the bottom. With such a table it is easy to see that most participants remembered around 6 symbols. The last column on the right labelled fx is not necessary for FD, and shows how much of the data set can be attributed to each possible score: so for the score of 11 symbols remembered, 1 person scored 11 so the fx value is 11. For the score of 10 symbols remembered, 2 people remembered 10 which makes fx 2 x 10 =20. The fx column is the (read fx). You should see from the FD table that using it you can work out certain values: the total number of participants (or N) is gained simply by adding all the values in the f column (which should give up 20). To get the total sum of the scores: which can be obtained by adding all the values of the fx column (which gives us a value of 127)So here is the data from the example memory study presented as a FD table: the column labelled X refers to the different scores (so here the numbers of symbols that were remembered) and the f column shows the frequency of these scores (so how many remembered 3 etc.). A FD table always has the highest score at the top and lists all possible scores down to the lowest score at the bottom. With such a table it is easy to see that most participants remembered around 6 symbols. The last column on the right labelled fx is not necessary for FD, and shows how much of the data set can be attributed to each possible score: so for the score of 11 symbols remembered, 1 person scored 11 so the fx value is 11. For the score of 10 symbols remembered, 2 people remembered 10 which makes fx 2 x 10 =20. The fx column is the (read fx). You should see from the FD table that using it you can work out certain values: the total number of participants (or N) is gained simply by adding all the values in the f column (which should give up 20). To get the total sum of the scores: which can be obtained by adding all the values of the fx column (which gives us a value of 127)

    15. Frequency Table Additions Frequency tables can display more detailed information about distribution Percentages and proportions p = fraction of total group associated with each score (relative frequency) p = f/N As %: p(100) =100(f/N) What does this tell about this distribution of scores? As well as information about scores and frequency of scores: a FD table can also give us more detailed information about the distribution of scores within a data set. Two common measures concern information about the proportion of score frequency. This involves adding columns that shows the information in column, f and fx (i.e. frequency of score and total value associated with frequency) as a proportion or percentage value. So column p shows what proportion of the group remembered each amount of symbols: we can see that a proportion of 0.05 of participants recalled 11 symbols. We convert the frequencies into proportions by simply dividing the frequency by the number of participants so 1/20 = 0.05. We can also convert the frequencies into % values simply by multiplying the proportion by 100. So we know that 1 participant recalled 11 symbols, this is a proportion of 0.05 (1/20), and 5% (0.05 x 100). Including the proportions and % is another way of summarising the data: again it shows which score was most common, and how may symbols were recalled how often. So it clearly shows info about the distribution of the scores. As well as information about scores and frequency of scores: a FD table can also give us more detailed information about the distribution of scores within a data set. Two common measures concern information about the proportion of score frequency. This involves adding columns that shows the information in column, f and fx (i.e. frequency of score and total value associated with frequency) as a proportion or percentage value. So column p shows what proportion of the group remembered each amount of symbols: we can see that a proportion of 0.05 of participants recalled 11 symbols. We convert the frequencies into proportions by simply dividing the frequency by the number of participants so 1/20 = 0.05. We can also convert the frequencies into % values simply by multiplying the proportion by 100. So we know that 1 participant recalled 11 symbols, this is a proportion of 0.05 (1/20), and 5% (0.05 x 100). Including the proportions and % is another way of summarising the data: again it shows which score was most common, and how may symbols were recalled how often. So it clearly shows info about the distribution of the scores.

    16. Steps in Constructing a Grouped Frequency Distribution 1. Determine the Class Interval Size Ideally, we wish to generate a frequency distribution with 10 class intervals. We would like the size (width) of each class interval to be in units of 1, 2, 3, 5, 10, 20, 30, 50, or multiples (factor of 10) of these values.

    17. Steps in Constructing a Grouped Frequency Distribution 1. Determine the Class Interval Size (continued) To Achieve These Goals, We Employ the Following Procedure:

    18. Grouped Frequency Distribution Tables Sometimes the spread of data is too wide Grouped tables present scores as class intervals About 10 intervals An interval should be a simple round number (2, 5, 10, etc), and same width Bottom score should be a multiple of the width Class intervals represent Continuous variable of X: E.g. 51 is bounded by real limits of 50.5-51.5 If X is 8 and f is 3, does not mean they all have the same scores: they all fell somewhere between 7.5 and 8.5 FD are a useful tool, but what happens when a set of data covers a much wider range of values than our pervious example? Say you had data with scores ranging from 50 to 100, that’s 50 rows in the X column if you listed each possible score which is not practical. In cases such as these GFD tables are used: we simply group ranges of scores into intervals and record the number of cases that had scores that fall into each particular interval. So this slide shows an example of such a table: the scores range from 50-100, so each interval covers 5 points. The GFDt follows the same conventions as the normal FDt: it starts with the highest interval then lists all possible intervals down to the lowest. GFDt should have (read) any more than 10 is too much. Score groups or class intervals as they are called should be based on a simple round number, and all intervals should be the same (so in our example every interval has a range of 5). Bottom score: so if you have an interval range of 5 then the lowest score included in the table should be a multiple of 5. Something to be aware of is the continuous nature of FD representations. The arrangement of the GFDt means that even though scores are recorded in the same class interval does not mean they are all the same. There are 4 scores recorded in the CI 70-74, one could have 71, one 70, one 73 etc. This is also true of the frequencies in simpler FDt: Class intervals in such GFDt and the X scores on FDt always represent continuous variables: ( a var that is continuously divisible). This means that the data/scores obtained do not represent separate data points but intervals on a continuous scale. This differentiates between apparent limits of an interval and the real limits of an interval. So for example (read 51) when we say 51 we are actually simplifying and using 51 to mean any value that falls within the upper and lower real limits of the value. (read 8)FD are a useful tool, but what happens when a set of data covers a much wider range of values than our pervious example? Say you had data with scores ranging from 50 to 100, that’s 50 rows in the X column if you listed each possible score which is not practical. In cases such as these GFD tables are used: we simply group ranges of scores into intervals and record the number of cases that had scores that fall into each particular interval. So this slide shows an example of such a table: the scores range from 50-100, so each interval covers 5 points. The GFDt follows the same conventions as the normal FDt: it starts with the highest interval then lists all possible intervals down to the lowest. GFDt should have (read) any more than 10 is too much. Score groups or class intervals as they are called should be based on a simple round number, and all intervals should be the same (so in our example every interval has a range of 5). Bottom score: so if you have an interval range of 5 then the lowest score included in the table should be a multiple of 5. Something to be aware of is the continuous nature of FD representations. The arrangement of the GFDt means that even though scores are recorded in the same class interval does not mean they are all the same. There are 4 scores recorded in the CI 70-74, one could have 71, one 70, one 73 etc. This is also true of the frequencies in simpler FDt: Class intervals in such GFDt and the X scores on FDt always represent continuous variables: ( a var that is continuously divisible). This means that the data/scores obtained do not represent separate data points but intervals on a continuous scale. This differentiates between apparent limits of an interval and the real limits of an interval. So for example (read 51) when we say 51 we are actually simplifying and using 51 to mean any value that falls within the upper and lower real limits of the value. (read 8)

    19. Percentiles and Percentile Ranks Percentile rank = the percentage of the sample with scores below or at the particular value This can be represented be a cumulative frequency column Cumulative percentage obtained by: c% = cf/N(100) This gives information about relative position in the data distribution X values = raw scores, without context One final column of very useful info that can be added to a FDt is one that gives context to the frequencies. A FDt gives us a description about a whole set of scores, but it can also be used to give us info about an individual score and its position to other scores (higher/lower) in the data set. We do this by figuring out the percentile rank of a particular score: this is the percentage of people who score at or below that particular score. Say for example we wanted to know what percentage of people remembered 8 or fewer symbols. WE do this by adding a cumulative frequency column, which just means that we add the number of participants for each score as we go along: so 2 people remembered 3 symbols, 3 people remembered 4 symbols: so 5 people remembered 4 or less symbols (3+2). 3 people remembered 5 symbols so we add this to 5 to give us 8 people who remembered 5 or less symbols. To find the percentile ranks we simply convert the values in the cf column into % of the total participant group size by dividing the value by 20 (the number of participants we had) then X 100. So 16 people remembered 8 symbols or less (2 who remembered 8 + the 14 who remembered lower amounts of symbols). If we divide 16 by 20 then x 100 we get 80. So 80% of people remembered 8 or fewer symbols. So say you remembered 8 symbols: your score has the percentile rank of 80%, and is the 80% percentile (called because it is identified by its percentile rank) One final column of very useful info that can be added to a FDt is one that gives context to the frequencies. A FDt gives us a description about a whole set of scores, but it can also be used to give us info about an individual score and its position to other scores (higher/lower) in the data set. We do this by figuring out the percentile rank of a particular score: this is the percentage of people who score at or below that particular score. Say for example we wanted to know what percentage of people remembered 8 or fewer symbols. WE do this by adding a cumulative frequency column, which just means that we add the number of participants for each score as we go along: so 2 people remembered 3 symbols, 3 people remembered 4 symbols: so 5 people remembered 4 or less symbols (3+2). 3 people remembered 5 symbols so we add this to 5 to give us 8 people who remembered 5 or less symbols. To find the percentile ranks we simply convert the values in the cf column into % of the total participant group size by dividing the value by 20 (the number of participants we had) then X 100. So 16 people remembered 8 symbols or less (2 who remembered 8 + the 14 who remembered lower amounts of symbols). If we divide 16 by 20 then x 100 we get 80. So 80% of people remembered 8 or fewer symbols. So say you remembered 8 symbols: your score has the percentile rank of 80%, and is the 80% percentile (called because it is identified by its percentile rank)

    20. 1. Determine the Class Interval Size (continued) Example: Given the following data 100 74 84 95 95 110 99 87 100 108 85 103 99 83 91 91 84 110 113 105 100 98 100 108 100 98 100 107 79 86 123 107 87 105 88 85 99 101 93 99 Steps in Constructing a Grouped Frequency Distribution

    21. 2. Determine the Starting Point (First Class Interval) of the Frequency Distribution Start the Frequency Distribution with a Class Interval in Which the Following Guidelines Apply: The First Number of the Class Interval is a Multiple of the Class Interval Size. The First Interval Includes the Lowest Number or Value in the Data Set Steps in Constructing a Grouped Frequency Distribution

    22. Credits http://www.statcan.ca/english/edu/power/ch8/frequency.htm http://www.le.ac.uk/pc/sk219/introtostats1.ppt#259,4,Plotting Data: describing spread of data http://leeds-faculty.colorado.edu/luftig/Past_Course_Websites/APPM_4570_5570/Website_without_Sound/Lecture_Slides/CHAPTER2/Chap_2.ppt

More Related