0% found this document useful (0 votes)
328 views54 pages

Frequency Tables, Bar Graphs, and Histograms: Handout #5

This document discusses methods for summarizing data, including frequency tables, bar graphs, and histograms. It provides an example of a frequency table created from a student survey with 12 respondents and 10 variables. The table summarizes the responses for one of the variables, party identification, by coding each response, tallying the frequencies, and calculating absolute and relative frequencies. It also discusses how to construct frequency bar graphs to visually represent the frequency distribution of a variable.

Uploaded by

NadiahNasir
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
328 views54 pages

Frequency Tables, Bar Graphs, and Histograms: Handout #5

This document discusses methods for summarizing data, including frequency tables, bar graphs, and histograms. It provides an example of a frequency table created from a student survey with 12 respondents and 10 variables. The table summarizes the responses for one of the variables, party identification, by coding each response, tallying the frequencies, and calculating absolute and relative frequencies. It also discusses how to construct frequency bar graphs to visually represent the frequency distribution of a variable.

Uploaded by

NadiahNasir
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

FREQUENCY TABLES,

BAR GRAPHS, AND


HISTOGRAMS

Handout #5
“Results” of Student Survey
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

1 1 2 3 1 1 3 2 2 2 2

2 3 1 3 3 1 2 1 5 2 1

3 2 3 3 1 2 2 2 1 2 2

4 1 2 1 3 2 1 3 4 3 2

5 1 1 1 1 1 3 3 4 2 2

6 1 2 5 5 2 2 2 4 2 2

7 1 2 5 2 1 3 3 4 2 2

8 1 1 5 1 1 3 3 1 2 2

9 3 2 2 3 2 2 1 5 2 1

10 2 2 3 5 1 2 1 1 2 1

11 1 1 1 1 2 2 2 4 2 2

12 1 1 1 1 1 3 2 1 2 2

One can’t just stare at this and grasp what the data is “saying.”
– The numbers don’t “speak for themselves”
• even apart from being numerically coded.
Data Needs to be “Boiled Down” to Reveal
Meaningful Information, Patterns, and
Relationships
• How you do this depends on the nature of the data, e.g.,
– nominal, ordinal, etc.

• This “boiling down” is commonly referred to as “number


crunching.”

• This boiling down can now be quickly accomplished for


even very large data sets by using computer software
such as SPSS.

• For a small data set like the Student Survey, it is feasible


(but still tedious) to do this by hand.
Boiling Down Data (cont.)
• One variable at a time (Univariate Analysis)
• Two variables at a time (Bivariate Analysis)
• Multiple variables at a time (Multivariate Analysis)

• Two stages:
– reduce the data to a single relatively compact table (frequency
table, crosstabulation, control table, etc.) or corresponding chart
(frequency bar graph, histogram, dot chart, box chart,
scattergram, etc.)
– reduce it further to one or several summary statistical measures
(measures of central tendency, dispersion, association,
correlation and regression coefficients, etc.).
• We first look at the process of boiling univariate data
down to frequency tables, frequency bar graphs, and
histograms.
– Then (univariate) measures of central tendency and dispersion.
Constructing Frequency Tables for Discrete
Variables in the Student Survey Data
Recall that the first question in the Student Survey was the
following:

Generally speaking, do you think of yourself as a


Republican, a Democrat, an Independent, or what?
(1) Democrat
(2) Independent
(3) Republican
(4) Other; minor party
(5) Don't know

[Above is from the Questionnaire/Codebook]


Frequency Table Worksheet
FREQUENCY TABLE OF PARTY ID (V1)
Value Code Tallies IDs Abs Freq Rel Freq Adj Rel Freq
Dem. 1
Ind. 2
Rep. 3
Other 4
DK 5
NA 9
Total________________________________________________________
ID V1 ID V1
1 1 6 1
2 3 7 1
3 2 8 1
4 1 9 3
5 1 10 2
Frequency Table Worksheet (cont.)

ID V1 ID V1
1 1 6 1
2 3 7 1
3 2 8 1
4 1 9 3
5 1 10 2
Absolute (Cases Counts) vs. Relative
(Percentages) Frequencies
Count up tallies to get absolute frequencies.

Use relative frequencies (percentages) to make valid


comparisons across data sets:
e.g., one student survey with another or (especially)
student survey with national data.
Relative Frequency (%) = Absolute Frequency × 100%
Total N of Cases

Also, probably set aside missing data, including “don’t


know,” “no opinion,” “other,” etc., cases.
Adjusted Rel. Frequency (%) = Absolute Frequency × 100%
N of Cases - N of Missing/Invalid Cases
Frequency Table Worksheet (cont.)

FREQUENCY TABLE OF PARTY ID (V1)

Values Code Tallies IDs Abs Freqs. Rel Freqs. Adj Rel. Freqs.
Dem. 1 20 46% 49%
Ind. 2 [not 12 28% 29%
Rep. 3 9 21% 22%
Other 4 shown] 0 0%
DK 5 2 5%
NA 9 0 0%
Total 43 100% 100%

Percentages have been rounded to nearest whole percent. (SPSS rounds to


the nearest tenth of a percent.) Rounding may produce rounding error, so
that a total that should come out to precisely 100% may actually add up to
101% or 99.9%, etc.
A “Presentation Grade”
Frequency Table

PARTY IDENTIFICATION AMONG POLI 300 STUDENTS, FALL 2006

Democratic 49%
Independent 29%
Republican 22%
Total 100%
(n = 41)

Source: POLI 300 Student Political Attitudes Survey, Fall 2006


Another Example
FREQUENCY TABLE OF DEMOCRATIC PARTY THERMOMETER SCALE (V16)

Value Code Abs. Freqs. Rel. Freqs. Adj. Rel. Freqs. Cum Rel. Freqs.
0-20 1 8 19% 19% 19% 100%
21-40 2 4 9% 9% 28% 81%
41-60 3 16 37% 37% 65% 72%
61-80 4 8 19% 19% 84% 35%
81-100 5 7 16% 16% 100% 16%
Missing 9 0 0%
Total 43 100% 100%

• Note that V16 is (at least) ordinal in nature. The list of values should follow the
natural ordering.
• If the ordering runs from “Low” to “High,” the lowest value is conventionally put at the
top of the list and the highest value at the bottom (illogical though this may seem).
• There is no missing data, so Adjusted Relative Frequency = Relative Frequency.
• In this table, we have shown one other type of percentage — namely, cumulative
(adjusted relative) frequencies, where the cumulation can proceed either downward
or upward. Thus the “61-80" row of the table shows that 19% of the respondents
have 61 to 80 degrees of “warmth” toward the Democratic Party, 84% have this level
of warmth or cooler (i.e., 80 degrees or less), and 35% have this level of warmth or
warmer (i.e., 61 degrees or more).
• Cumulative frequencies make no sense if the variable in question is merely nominal
in nature (or if the table does not list ordinal values in their natural order).
SPSS Frequency Tables
for ANES Discrete Variables
V25 DEMOCRATIC CANDIDATE THERMOMETER SCORE (1972-2004)

Code Freq. Percent Valid Percent Cum Percent

Valid 0-20 1 2359 12.9 13.4 13.4


21-40 2 2787 15.3 15.8 29.2
41-60 3 5003 27.4 28.4 57.6
61-80 4 3376 18.5 19.2 76.8
81-100 5 4097 22.4 23.2 100.0
Total 17623 96.5 100.0
Missing NA 9 638 3.5
Total 18260 100.0

SPSS uses somewhat different labels for different types of frequencies:


Frequency = Absolute Frequency
Percent = Relative Frequency
Valid Percent = Adjusted Relative Frequency
SPSS Frequency Tables for ANES
Discrete Variables (cont.)
V30 MOST IMPORTANT NATIONAL PROBLEM (1972-2004)

Code Freq. Percent Valid % Cum %


Valid economy 1 4581 25.1 36.9 36.9
foreign affairs 2 2116 11.6 17.0 53.9
social welfare 3 3029 16.6 24.4 78.2
crime, public order 4 1889 10.3 15.2 93.4
other 5 816 4.5 6.6 100.0
Total 12430 68.1 100.0
Missing NA 9 5830 31.9
Total 18260 100.0

Note: SPSS calculates and displays cumulative frequencies auto-


matically, even when they make no substantive sense (as with the
nominal variable MOST IMPORTANT NATIONAL PROBLEM).
SPSS doesn’t “know better”: it operates on the code values and
cannot tell the difference between different types of variables.
Frequency Charts
• Frequency distribution information is often presented by a bar chart.
• To construct a frequency bar chart, first draw a horizontal line and
place tick marks at equal intervals along the line.
– Each tick mark represents a possible value of the (qualitative or
discrete) variable.
– If the variable is ordinal, the marks should follow the natural ordering of
the values.
– Conventionally (and plausibly), values increase from left to right.
• We then erect a vertical axis that represents (absolute or relative)
frequency.
– The vertical axis can be calibrated in terms of either absolute or relative
frequencies.
• Relative frequencies are more typically displayed, especially with
data from surveys (where the actual number of cases depends on
sample size and is of no special interest).
• It is possible, of course, to have two axes (e.g., one at the left and
the other at the right edge of the chart) displaying both absolute
and relative frequencies.
• Above each tick mark, we erect a bar with some standard width and
the height of which is proportional to the frequency of that value.
– Conventionally the sides of the bars do not touch each other,
representing the fact that the values of the variable are discrete.
Bar Chart Work Sheet
SPSS Frequency Tables
for ANES Discrete Variables
V25 DEMOCRATIC CANDIDATE THERMOMETER SCORE (1972-2004)

Code Freq. Percent Valid Percent Cum Percent


Valid 0-20 1 2359 12.9 13.4 13.4
21-40 2 2787 15.3 15.8 29.2
41-60 3 5003 27.4 28.4 57.6
61-80 4 3376 18.5 19.2 76.8
81-100 5 4097 22.4 23.2 100.0
Total 17623 96.5 100.0
Missing NA 9 638 3.5
Total 18260 100.0

SPSS uses somewhat different labels for different types of frequencies:


Frequency = Absolute Frequency
Percent = Relative Frequency
Valid Percent = Adjusted Relative Frequency
FREQUENCY BAR CHART
SPSS Frequency Tables for NES
Discrete Variables (cont.)
V30 MOST IMPORTANT NATIONAL PROBLEM

Code Freq. Percent Valid % Cum %


Valid economy 1 4581 25.1 36.9 36.9
foreign affairs 2 2116 11.6 17.0 53.9
social welfare 3 3029 16.6 24.4 78.2
crime, public order 4 1889 10.3 15.2 93.4
other 5 816 4.5 6.6 100.0
Total 12430 68.1 100.0
Missing NA 9 5830 31.9
Total 18260 100.0

SPSS calculates and displays cumulative frequencies automatically,


even when they make no substantive sense (as with the nominal
variable MOST IMPORTANT NATIONAL PROBLEM). SPSS
doesn’t “know better”: it operates on the code values and cannot tell
the difference between different types of variables.
FREQUENCY BAR CHART (cont.)
Is Picture Worth a Thousand Words?
• Not Always: Sixty eight senators voted for the bill and
thirty two voted against.
PIE CHARTS

• Since pie charts do not show values in a linear order, they are
especially appropriate for displaying frequencies of nominal
variables
• Since such charts show how a “pie” is “divided up,” they are also
especially appropriate for displaying “shares,” such as how parties
divide up popular votes, electoral votes, or seats in a legislature, or
how a budget is divided up among different spending categories.
• Using different colors (or hatching) for each slice can help the
reader quickly grasp the information in the chart.
Comparing Frequency Distributions for Subsets of Cases, for
Different (but “comparable”) Variables, or from different Data
Sets (e.g., Student Survey and SETUPS/ANES)

• Clearly merged or clustered bar graphs like these should display


relative frequencies if the data sets (or subsets) being compared are of
different size.
• You might merge (hand drawn) bar graphs in this manner when you
compare Student Survey and SETUPS/NES data in Problem Set #5A.
“Stacked” Bar Graphs
• Another way to compress and merge bar graphs is to “stack” all the bars of an
ordinary bar graph on top of one another to form a single bar representing
100% of the (valid) cases.
• We can then combine nine such stacked bars to “tell the story” of the changing
perceived importance of different types of issues in Presidential elections over
the last 33 years.
Frequencies of Continuous Variables

• Remember: the first step in constructing a frequency


table was to list all possible values of the variable.
• But we cannot do this if the variable of interest is
quantitative and continuous in nature,
– because such a variable has an infinite number of possible
values.

• Remember that all points along (some interval of) the


real number line represent possible values of a
continuous (and interval) variable

• One way to proceed is to divide the line representing


values of the variable up into a (relatively small) number
of segments called class intervals.
Class Intervals
• We noted in Problem Set #3B that some of the variables
in the SETUPS/NES data are “truly continuous” in nature
but have in effect been turned into discrete variables.
– This was accomplished by creating class intervals for
such variable as V60 (AGE), V65B and V65C
(DOLLAR INCOME), and all the “Thermometer
Scales.”

• Once such class intervals have been created, we can


proceed to create frequency tables and charts in the
same manner as with discrete variables.
– Indeed, we have already done this with respect to
V25 DEMOCRATIC CANDIDATE THERMOMETER
SCORE.
States by Percent of Population Aged 65 or Older
• Note: The data is not recorded entirely precisely; it is rounded off to
the nearest one-tenth of one percent.
– For example, IL, IN, and MS (all recorded as 12.1%) almost certainly
have different values on the variable.
• To boil the data down to a frequency table or graph, we might create
class intervals one percentage point wide, i.e., 0-1%, 1-2%, etc.
– We need some rule (disclosed to readers) about whether (for
example) a case with a rounded value of 1.0% goes into the 0-
1% or 1-2% interval.)
– The numerical bounds on adjacent intervals must “touch” each other so
that every possible value is included in some interval. [See =>]
• Note: The AGE intervals in the SETUPS/NES Codebook appear not
to “touch” in this way. Presumably the 17-24 interval actually
includes everyone who has not yet turned 25 (and so would be
better be written as 17-25), and likewise for other AGE intervals.
• The following slide shows an SPSS histogram for this data with
class intervals one percentage point wide.
– The intervals are 3.5-4.5% and so forth and the value labels are
the whole numbers at the mid-point of these intervals.
– You can verify that the 11.5 and 12.5 observations are included
in the 11.5-12.5 and 12.5-13.5 intervals respectively.
SETUPS Codebook
Histogram of Percent of Population 65+

• That there are outliers becomes immediately apparent.


• This histogram is logically equivalent to a frequency bar chart, with
the merely cosmetic difference that the bars touch each other
(reflecting the continuous nature of the variable).
Histogram vs. Frequency Bar Graph
• The preceding histogram is essentially no different from
a frequency bar chart because all class intervals all have
the same width (in this case,1 percentage point wide).

• Otherwise (i.e., if the class intervals are not all of equal


width), a bar chart and a histogram of the same data
may look quite different,
– in which event the bar chart may present a misleading
picture of the data.

• This can be illustrated by focusing on the SETUPS/NES


variable V65D (DOLLAR INCOME IN 2004),
– for which unequal class intervals were created.
SPSS Frequency Table for V65D

V65D DOLLAR INCOME (2004)

Freq. Percent Valid % Cum. %


Valid Less than$15,000 145 12.0 13.7 13.7
$15,000 to $25,000 121 10.0 11.4 25.2
$25,000 to $35,000 102 8.4 9.7 34.9
$35,000 to $50,000 154 12.7 14.6 49.5
$50,000 to $80,000 246 20.3 23.3 72.8
$80,000 to $120,000 167 13.8 15.8 88.6
More than $120,000 120 9.9 11.4 100.0
Total 1055 87.0 100.0
Missing NA 157 13.0
Total 1212 100.0
SPSS Bar Chart for V65D

• The bar chart appears to display a distribution of income that is approxi-


mately “uniform,” i.e., all bars are approximately the same height, except for
a distinctive peak (or “mode”) in the third highest income category.
– Indeed, the impression the bar graph conveys to the eye is that there
are more well-off than not-so-well-off people.
• However, this impression is quite misleading, as you can begin to under-
stand when you look more closely at the income class intervals and notice
that they are not of equal width.
Histogram for V65D

• The fundamental difference between a bar graph and a histogram:


– in a bar graph, frequency is represented by the height of the
bars (all of which have the same width);
– in a histogram, frequency is represented by the area of the
“bars” (which may have different widths).
• With equal class intervals, the area of a bar depends only on its
height.
• With unequal class intervals, the area of a bar depends on both its
height and its width
How to Construct the Histogram of
V65D

• To draw this a histogram, we first draw a horizontal line,


i.e., a real number line, representing the possible values
of the variable.
• Since the variable is interval and continuous, we can
place tick marks (like a ruler) at equal intervals to mark
equal increments in the value of the variable, e.g., $0K,
$20K, $40K, etc. for INCOME.
How to Construct the Histogram of
V65D (cont.)

• Next we put other [red] marks along the scale at the


points that separate the class intervals we are using
– in this case at $0, $15K, $25K, $35K, $50K, $80K, and $120K.
• Note that the highest class interval has no definite upper
bound and thus no definite width.
– Here I have set an upper bound more or less arbitrarily at
$250K.
– In contrast, the lowest class interval has a definite width, since
INCOME is a ratio variable and cannot have values less than 0.
• We will remove these interval marks later.
How to a Construct Histogram of V65D (cont.)

• Next we erect a vertical axis (analogous to the vertical axis that


indicates frequency in a bar graph).
• However, this axis in fact does not indicate frequency and (like the
red interval marks) is only temporary “scaffolding” that is erected to
help us construct the histogram but which will be taken down once
the construction is finished.
• The scale marked on the vertical axis is drawn to accommodate the
height of the “bars” of the histogram.
• It is essential that the scale begin at zero.
Histogram of V65D (cont.)
• Next we erect a rectangle (a “bar,” if you wish) on each
class interval, so that the area [not height] of each
rectangle is proportional to the frequency associated with
that class interval.
• How tall should each rectangle be?
• The width of each rectangle is the width of the class
interval, and [from 3rd grade we remember that]
Area = Height × Width so Height = Area / Width
• Since Area here represents Frequency, we have the
formula:
Height = Frequency / Width,
where Width is the width of the class interval.
Histogram of V65D (cont.)
• Now we can calculate the following (relative) heights. (Since only
relative magnitudes matter, we can ignore the $000 = $K in
INCOME values.)

Class Interval Width Freq. Freq/Width Height


0-15 15 13.7 13.7 / 15 = 0.913
15-25 10 11.4 11.4 / 10 = 1.140
25-35 10 9.7 9.7 / 10 = 0.970
35-50 15 14.6 14.6 / 15 = 0.973
50-80 30 23.3 23.3 / 30 = 0.777
80-120 40 15.8 15.8 / 40 = 0.395
120-250 130 11.4 11.4/130 = 0.088

• Now we can draw the appropriate scale on the vertical axis.


• The tallest rectangle has a (relative) height of about 1.14, so the
axis should extend a bit higher than this.
• Having constructed the bars/rectangles, we should remove the
vertical axis and scale.
– Otherwise, readers are likely to (mis)interpret it as representing
frequency, like the vertical axis in a bar graph.
• Given that height in a histogram does not represent frequency, what does it
represent?
• The answer is that height represents density — that is, how closely
observed values of cases are “packed into” each class interval.
– Note that the class interval $50-80K includes about twice as many
cases (23.3%) as the interval $15-25K (11.4%).
– This fact is reflected in the bar graph in Figure 9 by the fact that the bar
on the $50-80K interval is about twice as high as the bar over the $15-
25K interval.
– It is reflected in the histogram in Figure 10 by the fact that the “bar”
(rectangle) on the $50-80K interval has about twice the area of the bar
on the $15-25K interval.
– But the 23.3% of the cases in the $50-80K interval are spread over an
income interval that is three times as wide than the interval into which
the 11.4% of the cases in the $15-25K interval are packed, so the
former is only a bit over two-thirds as tall as the latter.
Areas, Populations, and Population
Densities
• It might (or might not) be helpful to point out that a
histogram type of diagram could be used to display the
areas, population, and population density of each U.S.
states.
• Each state would be represented by a segment of the horizontal
axis proportional to its area [square miles].
• The total population of each state would be represented by the
area of the rectangle erected on its interval.
• The height of the rectangle would represent the state’s
population density [people per square mile].
• Only if all states had the same area would their populations
depend solely on their population densities.
“Cute” Bar Charts

• Popular newspapers
(especially USA Today),
magazines, advertise-
ments, etc., like to
present bar graphs but
usually can’t resist the
temptation of making
them “cute” by letting
figures of one sort or
other take the place of
simple bars.
• Often, as the heights of
the figures vary, their
widths also vary in a
proportionate manner.
• The eye then tends to
compare areas rather
than heights, producing
distinctly misleading
impressions.
• Cuteness trumps clarity.
“Cute” Bar Charts (cont.)

• This is a bar chart; height


represents frequency.
• But, as the heights of the
“bars” (bullets) vary, their
widths also vary propor-
tionately.
• The eye then tends to
compare areas rather than
heights, producing distinctly
misleading impressions.
• The U.S. has only about twice
as many firearms per capita
as Switzerland or Finland but
its bullet is about four times
as large as theirs.
• This problem is mitigated by
the fact that the actual
numerical values of firearms
per capita is shown
Continuous Densities
• The INCOME histogram was based on a small number
of (rather wide) class intervals and a modest number of
cases (n = 1212).
• Remember that INCOME is interval and (essentially)
continuous.
• Suppose we have INCOME data that is recorded very
precisely, e.g., to the near dollar or even cent.
• Suppose also we have a huge --- even infinite ---number
of cases.
• We could then refine INCOME into narrower and narrow
(i.e., more precise) class intervals, redrawing the
histogram accordingly.
• If we pushed this process to the limit, we would end up
with what would be an essentially continuous (and
probably fairly smooth) density curve
• The is illustrated in the following series of charts using a
symmetric (“normal’) distribution.
Approaching a Continuous Density Curve
Cut the width of the Class Intervals in Half
Cut the width of the Class Intervals in Half Again
And Again
And Again
We Approach a Continuous Density [Normal] Curve
A Continuous Income Density Curve [“Eyeball estimate”]

• Contrary to the (hypothetical) continuous density curve for INCOME with


Problem Set #5C, the SETUPS/NES data suggests that the distribution of
household income has two “peaks” (or modes), one at about $18K and
another at about $43K, with a slight “valley” between them.
• This probably results from the fact that there are two types of households:
family or multi-person households (typically two or more adults and often
children as well) and single-person households (typically widows/widowers
or young adults who have recently “flown the nest” but are not yet married
with children). On average, the former type of household has (and needs)
higher income than the latter. This tends to produce two peaks in the
overall distribution of household income.
A Symmetric [Normal] Density Curve
An Asymmetric Density Curve

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy